30th , Monday

ANOVA, or Analysis of Variance, is a statistical technique used to analyze the variation between different groups or factors and determine whether there are significant differences among them. It’s commonly used in the field of statistics to compare means from more than two groups and to understand the sources of variability in a dataset. Here are some key points about ANOVA:

ANOVA is used to test whether there are statistically significant differences between the means of three or more groups or treatments. It helps determine if the variations between these groups are likely due to genuine differences or just random variation.

If ANOVA indicates significant differences between groups, post-hoc tests like Tukey’s HSD or Bonferroni tests can be used to determine which specific group means are different from each other.

ANOVA is widely used in various fields, including experimental research, social sciences, medicine, and quality control, to compare multiple groups and understand the impact of different factors on a dependent variable. It is a powerful tool for making statistical inferences about group differences.

 

27th , friday

K-Nearest Neighbors (KNN) is a simple yet powerful machine learning algorithm used for both classification and regression tasks. It works by finding the K data points in the training set that are closest to a given input data point and then making predictions based on the majority class (for classification) or the average (for regression) of those K neighbors. Here are some common applications of KNN:

Classification: KNN can be used for tasks such as image classification, text classification, and spam detection. Given a new data point, it can classify it into one of the predefined classes based on the majority class of its K nearest neighbors.

Regression: KNN can also be used for regression tasks, such as predicting the price of a house based on the prices of nearby houses or estimating a person’s income based on the incomes of their neighbors.

Anomaly Detection: KNN can be applied to detect anomalies or outliers in a dataset by identifying data points that are significantly different from their neighbors.

Recommender Systems: KNN can be used in collaborative filtering-based recommender systems to suggest items to users based on the preferences of users who are similar to them.

Clustering: Although KNN is primarily a supervised learning algorithm, it can also be used for unsupervised tasks like clustering. By finding the nearest neighbors of data points, it can group similar data points together.

Image and Handwriting Recognition: KNN has been used in image recognition tasks, such as identifying handwritten digits or recognizing patterns in images.

Geospatial Analysis: KNN can be used in geospatial applications to find the nearest points of interest, such as identifying the closest restaurants or hospitals based on a user’s location. Continue reading “27th , friday”

week 7 – Monday

In today’s class we have discussed about k means clustering.

K- means clustering: K-means clustering is a widely used technique in statistical analysis and unsupervised machine learning for partitioning a dataset into distinct groups or clusters based on the similarity of data points. It’s a straightforward and effective method for grouping data into clusters with similar characteristics.

K-means clustering aims to divide a dataset into ‘K’ clusters, where each data point belongs to the cluster with the nearest mean (centroid). The ‘K’ in K-means represents the number of clusters, which is typically pre-specified. The algorithm iteratively refines the clusters by assigning data points to the nearest centroid and recalculating the centroid as the mean of the points in each cluster.

In our project we are k-means clustering to divide the dataset in to few clusters. Age group is the one which helps in making clusters. We can group them based on the ages like 20 -30, 40-50,50-60. Each group’s mean difference can be calculated with the help of that we can get p-value.

Week 6 – Friday

Clustering: It is a technique used to group similar data points or observations into clusters or categories. The primary goal of clustering is to find patterns or structure in data, with the assumption that data points within the same cluster are more similar to each other than to those in other clusters. Clustering is commonly used in various fields, including machine learning, data analysis, pattern recognition, and data mining.

Clustering is a powerful technique for organizing and summarizing large and complex datasets, enabling researchers and analysts to identify hidden patterns and structures within their data. The choice of the clustering algorithm and parameters depends on the nature of the data and the goals of the analysis.

With the help of clustering, we are able to group the information with geo position, race, age groups. It makes easier to compare different kinds of groups and analyze the data to which is happening more and vice versa.

Week 6 -Wednesday

In today’s class we have discussed about Monte Carlo approximation,is a statistical technique used to estimate the behavior of a system, process, or phenomenon by generating a large number of random samples and then analyzing the results of those samples. It is particularly useful when dealing with complex systems, mathematical models, or simulations where analytical solutions are difficult or impossible to obtain.

The fundamental idea behind Monte Carlo approximation is to use random sampling to approximate numerical solutions to problems.Monte Carlo simulations can provide valuable insights into the behavior and uncertainty of complex systems, allowing analysts and researchers to make informed decisions or predictions. The accuracy of Monte Carlo approximations generally improves as the number of random samples (iterations) increases. However, it may require a substantial computational effort when dealing with complex or high-dimensional problems.

Week 6 – Monday

Today in the class I learnt about Geo position. In the dataset provided in Washington post, there are two columns of latitude and longitude. With these columns we can explore the concept of Geo Position.

Geo position is which specifies a location on the Earth’s surface in terms of latitude and longitude. These coordinates provide a precise and unique identifier for any point on the Earth. Geographic positions are a fundamental concept in geography, cartography, navigation, and geospatial information systems (GIS).

The components of geo position:

  1. Latitude: Latitude is the angular distance measured in degrees north or south of the equator (0° latitude). Lines of latitude run parallel to the equator, with positive values indicating locations north of the equator and negative values for locations south of the equator. The range of latitude is from -90° (South Pole) to +90° (North Pole).
  2. Longitude: Longitude is the angular distance measured in degrees east or west of the Prime Meridian (0° longitude), which passes through Greenwich, England. Lines of longitude, also known as meridians, run from the North Pole to the South Pole. Positive values indicate locations to the east of the Prime Meridian, while negative values indicate locations to the west. The range of longitude is from -180° (the International Date Line) to +180°.

With data of latitude and longitude in the dataset, we can explore about Geo position with we can simplify data into the region where the shootings are happening more.

Cohen’s d, police shootings Dataset- Week 5 – friday

Cohen’s d is a widely used effect size measure in statistical analysis, especially in the context of hypothesis testing and comparing the means of two groups. It quantifies the standardized difference between two means, providing a measure of the magnitude of the effect or the strength of the relationship between two groups.

Cohen’s d is calculated as follows:

= (Mean of Group 1−Mean of Group 2/) Pooled Standard Deviation

The interpretation of Cohen’s d is as follows:

  • A value of 0 indicates no difference between the means of the two groups.
  • Small effect size: Typically, a value of d around 0.2 is considered a small effect.
  • Medium effect size: A value around 0.5 is considered a medium effect.
  • Large effect size: A value greater than 0.8 is considered a large effect.

From this Police shooting data set we can calculate Cohen’s d based on the age groups. Which age group is dying more and some more groups.

The Police shootings, Washington post

The data set provided in Washinton post shows about the police shootings happened at various places. This dataset deals about the shootings occurred at in various cities, age, gender and race of the person. The coordinates like latitude and longitude at which the shooting took place. The data also consists of armed with, was mentally ill, body cam. These all will help in categorizing the shooting. There is also the flee status which meant how the person is travelling. These all will help in analyzing the data to find out why these shootings are happening and if yes, where most of them are happening and to whom it is happening more. We can do more research using these data and make some conclusions.

10/04/23, Wednesday

In our project, we looked at data on levels of inactivity, obesity, and associated diabetes rates in various counties to try to forecast diabetes. When we first tried basic linear regression models, we discovered that they were insufficient since the data had heteroskedasticity. We found that a quadratic model improved by an interaction term was more accurate for predicting diabetes.

In order to evaluate the test errors of several models, including the quadratic one, we picked counties having complete data for all three parameters. With more information, a wider pattern may be seen, allowing for a more straightforward and precise model. The results of our investigation are summarized below, along with a recommendation for potential future research using more information.

10/6/23 – Friday

Skewness:

Skewness measures the asymmetry of a distribution. It tells you whether the data is skewed to the left (negatively skewed), to the right (positively skewed), or is approximately symmetric. In a positively skewed distribution, the tail on the right side is longer or fatter than the left side, and the majority of the data points are concentrated on the left side. In a negatively skewed distribution, the tail on the left side is longer or fatter than the right side, and the majority of the data points are concentrated on the right side. A perfectly symmetric distribution has a skewness of zero.

Kurtosis:

Kurtosis measures the “tailedness” of a distribution, indicating whether the data has heavy tails (leptokurtic) or light tails (platykurtic) compared to a normal distribution. A positive kurtosis (leptokurtic) indicates that the distribution has heavier tails and a more peaked central region than a normal distribution. A negative kurtosis (platykurtic) indicates that the distribution has lighter tails and a flatter central region than a normal distribution. A normal distribution has a kurtosis of 3 (excess kurtosis), so any deviation from this value (greater or smaller) indicates the degree of departure from normality.

We calculated these before combining the data and after combining the data.