AP Statistics Exploring Data

AP Statistics · 21 Mar 2022

Table of contents

Collecting Data

Descriptive methods

different methods for using collected data by organizing and summarizing it.

depends on the type of data being collected.

Types of Variables

Categorical (qualitative)

places the individuals being studied into one of several groups or categories.

ex. color, sex

Numerical (quantitative)

can be analyzed using arithmetic operations

ex. weight, IQ

Types of Descriptive Methods

Tabular Methods

$n$

number of observations

frequency $f$

times it occurs

relative frequency $rf$

\[rf = \frac{f}{n}\]

% times it occurs

cumulative frequency $cf$

of observations ≤ to specific value

frequency distribution table

table giving all possible values of a variable and their frequencies

Graphical Methods:

Qualitative Data

Categorical Data

Bar Charts
Pie Charts

rarely the best method of displaying data and have been shown to be misleading and hard to read accurately

Comparing Categorical Data with Two or More Groups

Segmented Bar Charts

takes the distribution from each group and arranges them along either the horizontal or vertical axis and shows the relative frequency of each group represented in one bar for each group

used to show frequency with bars of various sizes or relative frequency where all bars are the same size regardless of group size
Mosaic Plots

alternative way to compare groups of categorical data distributions

almost identical to a relative frequency segmented bar chart

! difference ! how to use x-axis

Quantitative Data

Examining Graphs

center
- mean
- median
- mode
spread
- range
- standart deviation, variance of a distribution
shape
- Symmetric distribution
- Skewed distribution
Patterns & Deviations
- clusters
- gaps
- outliers
  
  1.5 times the interquartile range(25th/75th %)

Continuous Variables

Dot plots

effective for smaller data sets

→ for large data sets, box plot
Stemplots

advantage disadvantage

shows every value inconvenient for large data sets

Stem left-most part of each observation

Leaf remaining part

can determine how data is shaped and spread and center
Histograms

useful for displaying patterns in large data sets

can’t see how the data is spread out specifically

can be drown using $f, rf, \%$
Cumulative Frequency Charts

S-shaped

advantage	disadvantage
shows every value	inconvenient for large data sets

Univariate Data

Summarizing Distributions

Population

entire group of individuals or things that we are interested in

Sample

part of the population that is actually studied

Describing Distribution

Center
Spread
Shape

Numerical Methods

Measures of Central Tendency

Mean

arithmetic mean (average)

affected by extreme or outlier measurements

Population Mean $\mu$

\[\mu = \frac{\sum^N_{i =1}X_i}{N}\]

Sample Mean $\bar X$

\[\bar X = \frac{\sum_{i=1}^n X_i}{n}\]

Median $M$

measure of central tendency

not affected by outliers

\[l = \frac{n+1}{2}\] \[⁍ = the value of the ⁍ measurement\]

Untitled

Measures of Variation (Spread)

Range $R$

difference between the largest and the smallest measurement in a data set

not reliable because it depends only on the two extreme measurement and doesn’t account for others

\[R = range = largest\ measurement - smallest\ measurement\]

Interquartile range $IQR$

range of middle 50% of data

not affected by outliers

\[IQR = Q_3 - Q_1\]

Standard Deviation

more useful measure than range

takes every measurement into account

affected by outliers

if there is outlier, IQR may be more useful

Population Standard Deviation $\sigma$

\[\sigma = \sqrt{\frac{\sum_{i=1}^N(x_i-\mu)^2}{N}}\]

Sample Standard Deviation $s$

\[s = \sqrt{\frac{\sum_{i=1}^N(x_i-\bar x)^2}{n-1}}\]

Variance

always positive

\[{standard\ deviation}^2\]

Measures of Position

Percentiles

divide a set of values into 100 equal parts

\[⁍\]

Quartiles

\[Q_1 = P_{25}\] \[Q_2 = P_{50} = median\] \[Q_3 = P_{75}\]

Standardized Scores or z-scores

\[z~score=\frac{measurement-mean}{standard\ deviation}\]

negative	positive
smaller than mean	larger than mean

Graphing Univariate Data

Graphical Summaries

even same revenue data seems different due to y-axis

Box plots

box-and-whiskers plot

based on measures of position

useful for identifying outliers and the general shape of the distribution

Effect of changing units on summary measures

Summary Measure	⁍	⁍
Mean	⁍	⁍
Median	⁍	⁍
Range	⁍	⁍
Standard Deviation	⁍	⁍
Quartiles	⁍	⁍
Interquartile Range	IQR(Y)=IQR(X)\ unaffected	⁍

Comparing Distributions of Two or More Groups

Parallel dot plots

showing the errors made by both scanners

used to compare two or more data sets
Parallel box plots

showing the errors by both scanners

used to compare two or more data sets
Back-to-back stem plots

showing the errors by both scanners

can only be used to compare two data sets
Two histograms

showing the errors by both scanners

not a great option

use same scale for both
Multiple frequency polygrams (line graph)

frequency polygram

Exploring Bivariate Data

Bivariate Data

two different variables collected from each item in a study

Linear regression two different quantitative variables have a linear relation

Scatterplot	graphical summary measure
correlation coefficient	numerical summary measure

Graphing Categorical Bivariate Data

Scatterplot

used to describe the nature, degree, and direction of the relation between two variables x and y

shape

| linear | non-linear | | — | — |

Direction

Positive (direct) relation	Negative (inverse) relation
increases	decreases

Strength of relationship

Strong	Weak	No
close to the line	loosely scattered	scattered w/o pattern

Numerical methods for continuous bivariate data

Correlation Coefficient

Pearson’s correlation coefficient

numeric measure of the strength & direct of the linear relation between two quantitative variables

population correlation coefficient	sample correlation coefficient
⁍	⁍
	⁍

Direction

| Positive | negative | | — | — |

Strength

Untitled

relation between slope of the line(b) and the correlation coefficient(r)

$b = r(\frac{S_y}{S_x})$ Originally from The Princeton Review