## Distributions

### Standard Normal Distribution

Standardization can be applied to any distribution. A standardized normal distribution will have a mean of 0 and a standard deviation of 1.

#### Data Set

original_dataset <- c(567.45, 572.45, 572.45, 589.12, 613.87, 615.78, 628.45, 644.87, 650.45, 652.20, 656.87, 661.45, 666.45, 667.70, 668.95, 675.28, 675.78, 685.53, 694.28, 697.62, 705.78, 705.87, 708.12, 711.03, 714.03, 716.03, 722.28, 728.12, 728.70, 729.03, 730.12, 731.95, 735.03, 736.95, 737.37, 738.28, 739.78, 740.62, 743.62, 747.20, 748.20, 748.28, 748.53, 750.03, 752.12, 754.70, 755.03, 758.37, 760.53, 764.03, 769.28, 775.45, 781.20, 781.70, 785.62, 792.78, 793.37, 795.28, 797.62, 798.95, 799.70, 799.95, 810.87, 811.53, 813.62, 814.03, 814.78, 817.87, 818.87, 820.70, 821.12, 825.62, 828.62, 841.45, 842.03, 842.87, 849.62, 874.70, 878.78, 897.45)

ggplot(mapping = aes(original_dataset)) +
geom_histogram(bins = 20) #### Calculate the mean and standard deviation of the dataset

mean <- mean(original_dataset)
standard_deviation <- sd(original_dataset)

mean

##  743.027

standard_deviation

##  73.95317


#### Standardize the dataset

The formula $$z=\frac{x-\mu}{\sigma}$$

standardized_dataset <- (original_dataset-mean)/standard_deviation

standardized_dataset

##   -2.374164556 -2.306554204 -2.306554204 -2.081141291 -1.746470048
##   -1.720642893 -1.549318261 -1.327285865 -1.251832712 -1.228169089
##  -1.165021020 -1.103089938 -1.035479586 -1.018576998 -1.001674410
##  -0.916079704 -0.909318669 -0.777478482 -0.659160366 -0.613996651
##  -0.503656557 -0.502439570 -0.472014912 -0.432665687 -0.392099476
##  -0.365055335 -0.280542395 -0.201573504 -0.193730703 -0.189268420
##  -0.174529363 -0.149783974 -0.108135997 -0.082173622 -0.076494352
##  -0.064189268 -0.043906163 -0.032547623  0.008018588  0.056427600
##   0.069949670  0.071031436  0.074411953  0.094695059  0.122956186
##   0.157843128  0.162305411  0.207469126  0.236676798  0.284004045
##   0.354994914  0.438426089  0.516177994  0.522939029  0.575945545
##   0.672763569  0.680741591  0.706568745  0.738210390  0.756194744
##   0.766336296  0.769716814  0.917377823  0.926302389  0.954563516
##   0.960107565  0.970249118  1.012032316  1.025554386  1.050299775
##   1.055979045  1.116828361  1.157394573  1.330882736  1.338725537
##   1.350084076  1.441358051  1.780491577  1.835661624  2.088118679


Now that we have the new standardized dataset let’s try plotting it

new_mean <- mean(standardized_dataset)
new_standard_deviation <- sd(standardized_dataset)

round(new_mean)

##  0

round(new_standard_deviation)

##  1


#### Plot the data on a graph to see the change

ggplot(mapping = aes(standardized_dataset)) +
geom_histogram(bins = 20) # Confidence Interval

## Single Population

### z-Statistic

$$CI = \overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$$

#### Dataset

Given:

dataScientist_salaries <- c(117313, 104002, 113038, 101936, 84560, 113136, 80740, 100536, 105052, 87201, 91986, 94868, 90745, 102848, 85927, 112276, 108637, 96818, 92307, 114564, 109714, 108833, 115295, 89279, 81720, 89344, 114426, 90410, 95118, 113382)

std_dev <- 15000


Population Standard deviation = $15,000 #### Calculate the mean salary_mean <- mean(dataScientist_salaries) salary_mean  ##  100200.4  #### Find the appropriate z-score for calculating a 90% confidence interval R has a couple of handy functions to find the z distribution values. First is the pnorm() which receives the z-value and spits back out the $$\alpha$$/2 value. # 95% Confidence Interval, z0.025 pnorm(1.96)  ##  0.9750021  # 99% Confidence Interval, z 0.005 qnorm(0.995)  ##  2.575829  Knowing this, to get the z-value for a 90% confidence interval estimate. The alpha is 1-0.9, or 0.1. Divide that by 2 to get $$\alpha$$/2 and we’ll need to get the z-value for 1-0.05, 0.95. z <- qnorm(0.95) z  ##  1.644854  I find this whole thing unnecessarily complex. Let’s make a small function to get the results by simply inputting the desired confidence interval confidence <- 90 z <- round(qnorm(1-(1-(confidence/100))/2), 2)  lower_int <- round(salary_mean - z*(std_dev/sqrt(30))) upper_int <- round(salary_mean + z*(std_dev/sqrt(30)))  Now we can say with 90% confidence that the average salary for data scientists fall between 95709 and 104692. ### Student’s T Distribution Developed by a research for Guiness beer #### Dataset t_dataset <- c(78000, 90000, 75000, 117000, 105000, 96000, 89500, 102300, 80000)  #### Calculate the mean and the standard error of the dataset t_mean <- mean(t_dataset) t_sd <- sd(t_dataset)  #### Determine which statistic to use for inference If population standard deviation is not known, we should use the t-statistic. If it is known, then we will use the z-statistic. #### Find the appropriate statistic, taking into consideration the degrees of freedom (if applicable) for 99% confidence The degrees of freedom is equal to the sample size minus one # Find df with R t_df <- length(t_dataset)-1 t_df  ##  8  Like the z-value, R also has a handy function that can help find the t-statistic. pt() and qt() # Getting the t-statistic # 99% confidence interval t_value <- round(qt(1-0.01/2, t_df), 2) t_value  ##  3.36  #### Find the 99% confidence interval Formula: $$CI = \overline{x} \pm t_{n-1, \alpha/2} \frac{s}{\sqrt{n}}$$ lower_int <- round(t_mean - t_value*(t_sd/sqrt(length(t_dataset)))) upper_int <- round(t_mean + t_value*(t_sd/sqrt(length(t_dataset))))  Using the t-statistic, we can say with 99% confidence that the average salary for data scientists fall between 76930 and 108137. ## Two Variables ### Dependent Variables #### Background The 365 team has developed a diet and an exercise program for losing weight. It seems that it works like a charm. However, you are interested in how much weight are you likely to lose. We’re provided with the following dataset: subject <- c(1:10) weight_before <- c(228.58, 244.01, 262.46, 224.32, 202.14, 246.98, 195.86, 231.88, 243.32, 266.74) weight_after <- c(204.74, 223.95, 232.94, 212.04, 191.74, 233.47, 177.60, 213.85, 218.85, 236.86) weight_loss_df <- data.frame(subject, weight_before, weight_after) kable(weight_loss_df)  subject weight_before weight_after 1 228.58 204.74 2 244.01 223.95 3 262.46 232.94 4 224.32 212.04 5 202.14 191.74 6 246.98 233.47 7 195.86 177.60 8 231.88 213.85 9 243.32 218.85 10 266.74 236.86 Adding a new column to find the difference: df <- weight_loss_df %>% mutate("difference" = weight_after-weight_before) kable(df)  subject weight_before weight_after difference 1 228.58 204.74 -23.84 2 244.01 223.95 -20.06 3 262.46 232.94 -29.52 4 224.32 212.04 -12.28 5 202.14 191.74 -10.40 6 246.98 233.47 -13.51 7 195.86 177.60 -18.26 8 231.88 213.85 -18.03 9 243.32 218.85 -24.47 10 266.74 236.86 -29.88 Now that we’ve calculated the ‘difference’, we can treat this as a single population. We’ll be using the t-statistics because of the small (<30) sample size and unknown population variance. # Calculating the mean and standard deviation mean <- mean(df$difference)
sd <- sd(df$difference) # Finding the t-statistic for a 95% confidence interval t_value <- qt(1-0.05/2, length(df)-1) # Calculating the confidence interval lower_limit <- mean-t_value*(sd/sqrt(10)) upper_limit <- mean+t_value*(sd/sqrt(10))  Results: • Mean = -20.025 • Standard deviation = 6.8617754 • $$t_{9,\ 0.025}$$ = t_value $$T_{95\%}$$ Confidence interval: -26.930539 and -13.119461 Inferences 1. In 95% of subjects, the true mean will fall within the above interval. 2. The whole interval is negative. 3. All subjects successfully lost weight. “I am 95% confident that by following the diet and exercise program you will lose between 13 to 27 lbs.” Extra Let’s try calculating the 90% and 99% confidence interval to see the differences. # Creating a helper function to avoid redundant typing find_lower_CI <- function(c) { round(mean-(qt(1-(1-(c/100))/2, length(df)-1))*(sd/sqrt(10)), 2) } find_upper_CI <- function(c) { round(mean+(qt(1-(1-(c/100))/2, length(df)-1))*(sd/sqrt(10)), 2) } # Putting the CIs in a dataframe CI <- c("Lower Int", "Upper Int") T90 <- c(find_lower_CI(90), find_upper_CI(90)) T95 <- c(find_lower_CI(95), find_upper_CI(95)) T99 <- c(find_lower_CI(99), find_upper_CI(99)) kable(data.frame(CI, T90, T95, T99))  CI T90 T95 T99 Lower Int -25.13 -26.93 -32.70 Upper Int -14.92 -13.12 -7.35 We can clearly see that the higher the confidence the wider the interval range. There is a trade off between higher confidence. On one hand the inference we make will very likely be correct but the range would be so wide that it is useless for any meaninful statistic. ### Independent Variables #### Variance of differences ##### z-statistic $$\sigma^2_x, \sigma^2_y = \frac{\sigma^2_x}{n_x} + \frac{\sigma^2_y}{n_y}$$ ##### t-statistic $$s^2_p = \frac{(n_x-1)s^2_x+(n_y-1)s^2_y}{n_x+n_y-2}$$ #### CI Formulas ##### z-statistic $$CI = (\overline{x}-\overline{y}) \pm z_{\alpha/2} \sqrt{\frac{\sigma^2_x}{n_x} + \frac{\sigma^2_y}{n_y}}$$ ##### t-statistic $$CI = (\overline{x}-\overline{y}) \pm t_{n_x+n_y-2,\alpha/2} \sqrt{\frac{s^2_p}{n_x} + \frac{s^2_p}{n_y}}$$ #### Excercise 1 Confidence interval for difference of two means; independent samples, variances unknown but assumed to be equal # Dataset ny_apple <- c(3.80, 3.76, 3.87, 3.99, 4.02, 4.25, 4.13, 3.98, 3.99, 3.62) la_apple <- c(3.02, 3.22, 3.24, 3.02, 3.06, 3.15, 3.81, 3.44)  Since the population variance is unknown and the sample size is small, we will be using the t-statistic again for this particular excercise. ##### Calculation # Find mean, standard deviation, variance ny_mean <- mean(ny_apple) ny_sd <- sd(ny_apple) ny_var <- var(ny_apple) la_mean <- mean(la_apple) la_sd <- sd(la_apple) la_var <- var(la_apple) ny_len <- length(ny_apple) la_len <- length(la_apple) # Pooled Variance s2p <- ((ny_var*(ny_len-1))+(la_var*(la_len-1)))/(ny_len+la_len-2) # Find t-values t_90 <- round(qt(1-(1-(90/100))/2, ny_len+la_len-2), 2) t_95 <- round(qt(1-(1-(95/100))/2, ny_len+la_len-2), 2) # Calculate CI lower_int <- round((ny_mean-la_mean)-t_90*sqrt((s2p/ny_len)+(s2p/la_len)), 2) upper_int <- round((ny_mean-la_mean)+t_90*sqrt((s2p/ny_len)+(s2p/la_len)), 2)  CI = (0.51, 0.88) #### Excercise 2 Confidence interval for the difference of two means. Independent samples, variance known # Dataset index <- c("Sample Size", "Sample Mean", "Population Std") engineering <- c(100, 58, 10) management <- c(70, 65, 5) difference <- c(NA, -7, 1.16) data <- data.frame(index, engineering, management, difference) data  ## index engineering management difference ## 1 Sample Size 100 70 NA ## 2 Sample Mean 58 65 -7.00 ## 3 Population Std 10 5 1.16  We’ve pretty much been handed all the information needed on a silver platter here. Since we know the population standard deviation, we will be using the z-statistic. ##### Calculation # Find z-value z_value <- round(qnorm(1-(1-(99/100))/2), 2) # Calculate CI lower_int <- round(data$difference-z_value*sqrt((data$engineering^2/data$engineering)+(data$management^2/data$management)), 2)
upper_int <- round(data$difference+z_value*sqrt((data$engineering^2/data$engineering)+(data$management^2/data\$management)), 2)


CI = (-10.01, -3.99)