Cohen’s d, police shootings Dataset- Week 5 – friday

Cohen’s d is a widely used effect size measure in statistical analysis, especially in the context of hypothesis testing and comparing the means of two groups. It quantifies the standardized difference between two means, providing a measure of the magnitude of the effect or the strength of the relationship between two groups.

Cohen’s d is calculated as follows:

= (Mean of Group 1−Mean of Group 2/) Pooled Standard Deviation

The interpretation of Cohen’s d is as follows:

  • A value of 0 indicates no difference between the means of the two groups.
  • Small effect size: Typically, a value of d around 0.2 is considered a small effect.
  • Medium effect size: A value around 0.5 is considered a medium effect.
  • Large effect size: A value greater than 0.8 is considered a large effect.

From this Police shooting data set we can calculate Cohen’s d based on the age groups. Which age group is dying more and some more groups.

The Police shootings, Washington post

The data set provided in Washinton post shows about the police shootings happened at various places. This dataset deals about the shootings occurred at in various cities, age, gender and race of the person. The coordinates like latitude and longitude at which the shooting took place. The data also consists of armed with, was mentally ill, body cam. These all will help in categorizing the shooting. There is also the flee status which meant how the person is travelling. These all will help in analyzing the data to find out why these shootings are happening and if yes, where most of them are happening and to whom it is happening more. We can do more research using these data and make some conclusions.

10/04/23, Wednesday

In our project, we looked at data on levels of inactivity, obesity, and associated diabetes rates in various counties to try to forecast diabetes. When we first tried basic linear regression models, we discovered that they were insufficient since the data had heteroskedasticity. We found that a quadratic model improved by an interaction term was more accurate for predicting diabetes.

In order to evaluate the test errors of several models, including the quadratic one, we picked counties having complete data for all three parameters. With more information, a wider pattern may be seen, allowing for a more straightforward and precise model. The results of our investigation are summarized below, along with a recommendation for potential future research using more information.

10/6/23 – Friday

Skewness:

Skewness measures the asymmetry of a distribution. It tells you whether the data is skewed to the left (negatively skewed), to the right (positively skewed), or is approximately symmetric. In a positively skewed distribution, the tail on the right side is longer or fatter than the left side, and the majority of the data points are concentrated on the left side. In a negatively skewed distribution, the tail on the left side is longer or fatter than the right side, and the majority of the data points are concentrated on the right side. A perfectly symmetric distribution has a skewness of zero.

Kurtosis:

Kurtosis measures the “tailedness” of a distribution, indicating whether the data has heavy tails (leptokurtic) or light tails (platykurtic) compared to a normal distribution. A positive kurtosis (leptokurtic) indicates that the distribution has heavier tails and a more peaked central region than a normal distribution. A negative kurtosis (platykurtic) indicates that the distribution has lighter tails and a flatter central region than a normal distribution. A normal distribution has a kurtosis of 3 (excess kurtosis), so any deviation from this value (greater or smaller) indicates the degree of departure from normality.

We calculated these before combining the data and after combining the data.

September 29th

K- cross Validation

K-fold cross-validation is a technique commonly used in machine learning and statistical modeling to assess the performance and generalization ability of a predictive model. It is particularly helpful when you have a limited amount of data and want to make efficient use of it while avoiding overfitting. K-fold cross-validation involves the following steps:

  1. Data Splitting
  2. Model Training and Evaluation
  3. Performance Metric Calculation
  4. Cross-Validation Results

October 2nd

We used Regularization concept in our project. Regularization is a set of techniques used to prevent overfitting and improve the generalization performance of statistical models. Overfitting occurs when a model learns to fit the training data too closely, capturing noise and idiosyncrasies in the data rather than the underlying patterns. This can lead to poor performance when the model is applied to new, unseen data. Regularization methods are particularly important in situations where the number of features (variables) is large relative to the number of observations (data points), which is common in many statistical modeling problems. Regularization helps to control the complexity of a model and reduce the risk of overfitting.

27thSeptember,Wednesday

Cross-validation is a statistical technique used in mathematical statistics and machine learning to assess the performance and generalization ability of a predictive model. It involves partitioning a dataset into multiple subsets, training the model on some of these subsets, and then evaluating its performance on the remaining data . The primary goal of cross-validation is to estimate how well the model will perform on unseen data, which helps in model selection and hyperparameter tuning. Cross-validation helps in model selection by comparing the performance of different models on the same dataset. It also aids in hyperparameter tuning by assessing how different hyperparameters affect model performance. By using cross-validation, researchers and data scientists can make more informed decisions about which models and parameter settings are likely to perform well when applied to new, unseen data, thereby improving the reliability of statistical analyses and predictions.

25th Sept – Monday

A key idea in machine learning and statistical modeling is “Cross-Validation: The Right and Wrong Ways”. The performance and generalizability of predictive models are evaluated using the cross-validation technique. It is likely that both proper and improper cross-validation procedures are covered in this video or topic. The “right way” probably entails carefully dividing the data into training and validation sets, choosing the optimal number of folds (for example, 5 or 10), and thoroughly assessing a model’s performance to prevent overfitting or underfitting. Contrarily, employing the “wrong way” could entail making typical errors like data leaking, using the incorrect assessment metrics, or treating imbalanced datasets improperly. Building trustworthy and dependable machine learning models requires an understanding of these correct and incorrect methods.

22nd September, Friday

 

K-nearest neighbors (KNN) is a simple and popular machine learning algorithm used for classification and regression tasks. Here’s a brief overview: 

KNN is based on the idea that similar data points tend to be close to each other in the feature space. It makes predictions by finding the K data points in the training set that are closest (most similar) to a given input data point.

Parameters: The main parameter in KNN is ‘K,’ which represents the number of nearest neighbors to consider when making a prediction. A smaller K value makes the algorithm more sensitive to noise, while a larger K value makes it smoother but might miss fine-grained patterns.

We used KNN in our project, by using it we got r^2 = 0.23, which is less than linear regression.