AP Statistics Exploring Data
AP Statistics ·- Table of contents
Collecting Data
Descriptive methods
different methods for using collected data by organizing and summarizing it.
depends on the type of data being collected.
Types of Variables
Categorical (qualitative)
places the individuals being studied into one of several groups or categories.
ex. color, sex
Numerical (quantitative)
can be analyzed using arithmetic operations
ex. weight, IQ
Types of Descriptive Methods
Tabular Methods
$n$
number of observations
frequency $f$
times it occurs
relative frequency $rf$
\[rf = \frac{f}{n}\]% times it occurs
cumulative frequency $cf$
of observations ≤ to specific value
frequency distribution table
table giving all possible values of a variable and their frequencies
Graphical Methods:
Qualitative Data
Categorical Data
-
Bar Charts
-
Pie Charts
rarely the best method of displaying data and have been shown to be misleading and hard to read accurately
Comparing Categorical Data with Two or More Groups
-
Segmented Bar Charts
takes the distribution from each group and arranges them along either the horizontal or vertical axis and shows the relative frequency of each group represented in one bar for each group
used to show frequency with bars of various sizes or relative frequency where all bars are the same size regardless of group size
-
Mosaic Plots
alternative way to compare groups of categorical data distributions
almost identical to a relative frequency segmented bar chart
! difference ! how to use x-axis
Quantitative Data
Examining Graphs
- center
- mean
- median
- mode
- spread
- range
- standart deviation, variance of a distribution
-
shape
- Symmetric distribution
- Skewed distribution
-
Patterns & Deviations
- clusters
- gaps
-
outliers
1.5 times the interquartile range(25th/75th %)
Continuous Variables
-
Dot plots
effective for smaller data sets
→ for large data sets, box plot
-
Stemplots
advantage disadvantage shows every value inconvenient for large data sets Stem left-most part of each observation
Leaf remaining part
can determine how data is shaped and spread and center
-
Histograms
useful for displaying patterns in large data sets
can’t see how the data is spread out specifically
can be drown using $f, rf, \%$
-
Cumulative Frequency Charts
S-shaped
Univariate Data
Summarizing Distributions
Population
entire group of individuals or things that we are interested in
Sample
part of the population that is actually studied
Describing Distribution
- Center
- Spread
- Shape
Numerical Methods
Measures of Central Tendency
Mean
arithmetic mean (average)
affected by extreme or outlier measurements
- Population Mean $\mu$
- Sample Mean $\bar X$
Median $M$
measure of central tendency
not affected by outliers
\[l = \frac{n+1}{2}\] \[⁍ = the value of the ⁍ measurement\]Measures of Variation (Spread)
Range $R$
difference between the largest and the smallest measurement in a data set
not reliable because it depends only on the two extreme measurement and doesn’t account for others
\[R = range = largest\ measurement - smallest\ measurement\]Interquartile range $IQR$
range of middle 50% of data
not affected by outliers
\[IQR = Q_3 - Q_1\]Standard Deviation
more useful measure than range
takes every measurement into account
affected by outliers
if there is outlier, IQR may be more useful
- Population Standard Deviation $\sigma$
- Sample Standard Deviation $s$
Variance
always positive
\[{standard\ deviation}^2\]Measures of Position
Percentiles
divide a set of values into 100 equal parts
\[⁍\]Quartiles
\[Q_1 = P_{25}\] \[Q_2 = P_{50} = median\] \[Q_3 = P_{75}\]Standardized Scores or z-scores
\[z~score=\frac{measurement-mean}{standard\ deviation}\]negative | positive |
---|---|
smaller than mean | larger than mean |
Graphing Univariate Data
Graphical Summaries
even same revenue data seems different due to y-axis
-
Box plots
box-and-whiskers plot
based on measures of position
useful for identifying outliers and the general shape of the distribution
Effect of changing units on summary measures
Summary Measure | ⁍ | ⁍ |
---|---|---|
Mean | ⁍ | ⁍ |
Median | ⁍ | ⁍ |
Range | ⁍ | ⁍ |
Standard Deviation | ⁍ | ⁍ |
Quartiles | ⁍ | ⁍ |
Interquartile Range | IQR(Y)=IQR(X)\ unaffected | ⁍ |
Comparing Distributions of Two or More Groups
-
Parallel dot plots
showing the errors made by both scanners
used to compare two or more data sets
-
Parallel box plots
showing the errors by both scanners
used to compare two or more data sets
-
Back-to-back stem plots
showing the errors by both scanners
can only be used to compare two data sets
-
Two histograms
showing the errors by both scanners
not a great option
use same scale for both
-
Multiple frequency polygrams (line graph)
frequency polygram
Exploring Bivariate Data
Bivariate Data
two different variables collected from each item in a study
Linear regression two different quantitative variables have a linear relation
Scatterplot | graphical summary measure |
---|---|
correlation coefficient | numerical summary measure |
Graphing Categorical Bivariate Data
Scatterplot
used to describe the nature, degree, and direction of the relation between two variables x and y
shape
| linear | non-linear | | — | — |
Direction
Positive (direct) relation | Negative (inverse) relation |
---|---|
increases | decreases |
Strength of relationship
Strong | Weak | No |
---|---|---|
close to the line | loosely scattered | scattered w/o pattern |
Numerical methods for continuous bivariate data
Correlation Coefficient
Pearson’s correlation coefficient
numeric measure of the strength & direct of the linear relation between two quantitative variables
population correlation coefficient | sample correlation coefficient |
---|---|
⁍ | ⁍ |
⁍ |
Direction
| Positive | negative | | — | — |
Strength
relation between slope of the line(b) and the correlation coefficient(r)
\(b = r(\frac{S_y}{S_x})\) Originally from The Princeton Review