This is the lecture notes from Intro to Descriptive Statistics course (lihat reviewnya di sini). Most of the material here was taken from the course and original lecture notes, which can be found here:
- Intro to statistical research methods (Lesson1.PDF)
- Visualizing Data (Lesson2.PDF)
- Central Tendency (Lesson3.PDF)
- Variability (Lesson4.pdf)
- Standardizing (Lesson5.pdf)
- Normal Distribution (Lesson6.pdf)
- Sampling Distributions (Lesson7.pdf)
Please do mind that there are quite a lot of errors in the lecture notes which I have corrected in this post.
Lesson 1: Intro to statistical research methods
A construct is anything that is difficult to measure because it can be defined and measured in many different ways.
- Operational Definition
The operational definition of a construct is the unit of measurement we are using for the construct. Once we operationally define something it is no longer a construct.
Volume is a construct. We know volume is the space something takes up but we haven’t defined how we are measuring that space. (i.e. liters, gallons, etc.)
Had we said volume in liters, then this would not be a construct because now it is operationally defined.
Minutes is already operationally defined; there is no ambiguity in what we are measuring
1.2 Population vs Sample
The population is all the individuals in a group.
The sample is some of the individuals in a group.
- Parameter vs Statistic
A parameter defines a characteristic of the population whereas a statistic defines a characteristic of the sample.
The mean of a population is defined with the symbol µ whereas the mean of a sample is defined as x̄.
Number of members in sample is defined as n, whereas the number of members of population is defined as N.
In an experiment, the manner in which researchers handle subjects is called a treatment. Researchers are specifically interested in how different treatments
might yield differing results.
- Observational Study
An observational study is when an experimenter watches a group of subjects and does not introduce a treatment.
- A survey is an example of an observational study
- Independent Variable
The independent variable of a study is the variable that experimenters choose to manipulate; it is usually plotted along the x-axis of a graph.
- Dependent Variable
The dependent variable of a study is the variable that experimenters choose to measure during an experiment; it is usually plotted along the y-axis of a graph.
- Treatment Group
The group of a study that receives varying levels of the independent variable. These groups are used to measure the effect of a treatment.
- Control Group
The group of a study that receives no treatment. This group is used as a baseline when comparing treatment groups.
Something given to subjects in the control group so they think they are getting the treatment, when in reality they are getting something that causes no effect to them. (e.g. a Sugar pill)
Blinding is a technique used to reduce bias. Double blinding ensures that both those administering treatments and those receiving treatments do not know who is receiving which treatment.
Lesson 2: Visualizing Data
The frequency of a data set is the number of times a certain outcome occurs.
A proportion is the fraction of counts over the total sample. A proportion can be turned into a percentage by multiplying the proportion by 100.
is a graphical representation of the distribution of data, discrete intervals (bins) are decided upon to form widths for our boxes.
- Adjusting the bin size of a histogram will compact (or spread out) the distribution.
2.2.1 Skewed Distribution
- Positive Skew
A positive skew is when outliers are present along the right most end of the distribution
- Negative Skew
A negative skew is when outliers are present along the left most end of the distribution
Lesson 3: Central Tendency
3.1 Mean, Median and Mode
The mean of a dataset is the numerical average and can be computed by dividing the sum of all the data points by the number of data points:
- The mean is heavily affected by outliers, therefore we say the mean is not a robust measurement.
The median of a dataset is the datapoint that is directly in the middle of the data set. If two numbers are in the middle then the median is the average of the two.
- The data set is odd n/2 = the position in the data set the middle value is
- The data set is even the median for the two middle data points is:
- The median is robust to outliers, therefore an outlier will not affect the value of the median.
The mode of a dataset is the datapoint that occurs the most frequently in the data set.
- The mode is robust to outliers as well.
- In the normal distribution the mean = median = mode.
Lesson 4: Variability
What’s the difference between these two distributions:
Difference between maximum and minimum: range = max – min
- more spreadout
Means the range is bigger
4.1 Box Plots and the IQR
- Interquartile range.
The Interquartile range (IQR) is the distance between the 1st quartile and 3rd quartile and gives us the range of the middle 50% of our data. The IQR is easily found by computing: Q3 – Q1
- A box plot is a great way to show the 5 number summary of a data set (the minimum, first quartile, median, third quartile, and the maximum) in a visually appealing way.
4.1.1 Finding outliers
- How to identify outliers
Universally accepted definition of outliers:
- Upper outliers if > Q3 + 1.5 * IQR
- Lower outliers if < Q1 – 1.5 * IQR
Problem with IQR: it shows the same picture even when the distributions are different:
4.2 Variance and Standard Deviation
The variance is the average of the squared differences from the mean. The formula for computing variance is:
The intuition for variance is the average area of the boxes that represents the squared difference:
- Standard Deviation
The standard deviation is the square root of the variance and is used to measure distance from the mean.
- In a normal distribution 68% of the data lies within 1 standard deviation from the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations.
- The intuition for SD is the side length of the box above
Why standard deviation (e.g. instead of average sum of absolute difference)? What’s the point of using standard deviation:
4.2.1 Bessel’s Correction
- Bessel’s Correction
Corrects the bias in the estimation of the population variance, and some (but not all) of the bias in the estimation of the population standard deviation. To apply Bessel’s correction we multiply the variance by
- Use Bessel’s correction primarily to estimate the population standard deviation.
Sample standard deviation (i.e. SD for sample) is denoted with lowercase s:
Don’t get mistaken with SD of small population!
Sample SD is used to estimate population SD:
Lesson 5: Standardizing
5.1 Z score
- Standard Score
Given an observed value x, the Z score finds the number of Standard deviations x is away from the mean.
5.1.1 Standard Normal Curve
- The standard normal curve is the curve we will be using for most problems in this section. This curve is the resulting distribution we get when we standardize our scores. We will use this distribution along with the Z table to compute percentages above, below, or in between observations in later sections.
Lesson 6: Normal Distribution
6.1 Probability Density Function (PDF)
- Probability Density Function
The probability density function is a normal curve with an area of 1 beneath it, to represent the cumulative frequency of values.
- Z-table: represents the proportion of the population that is less than the z-score.
Lesson 7: Sampling Distributions
- Imagine a tetahedral dice
- Population: 1, 2, 3, 4
- µ = 2.5
- Roll twice
- Number of possible samples of size 2: 16
- Sample mean = mean of each sample
- e.g. on the first roll we have 1, and on the second roll we have 2, then the mean is 1.5
- Sample means:
- calculate the mean of each samples (remember there are 16 samples)
- Mean of sample means: 2.5
- calculate the average of sample means above
- Mean of sample means is the same as population mean
- Sampling distribution = distribution of sample means
- Standard deviation of the population: σ = 1.12
- Standard deviation of all of sample means: SE = 0.790569415
- The relationship is
- ⇑ That is the Central Limit Theorem
- SE ← Standard Error
- Standard Error (SE): standard deviation of sample means
- Distribution of sample means is normal regardless of the distribution of the original population
7.1 Central Limit Theorem
The Central Limit Theorem is used to help us understand the following facts regardless of whether the population distribution is normal or not:
- the mean of the sample means is the same as the population mean
- the standard deviation of the sample means is always equal to the standard error
- the distribution of sample means will become increasingly more normal as the sample size, n, increases.
- Sampling Distribution
The sampling distribution of a statistic is the distribution of that statistic. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size.