AP Statistics Exploring Data

Collecting Data

Descriptive methods

different methods for using collected data by organizing and summarizing it.

depends on the type of data being collected.

Types of Variables

Categorical (qualitative)

places the individuals being studied into one of several groups or categories.

ex. color, sex

Numerical (quantitative)

can be analyzed using arithmetic operations

ex. weight, IQ

Types of Descriptive Methods

Tabular Methods

$n$

number of observations

frequency $f$

times it occurs

relative frequency $rf$

\[rf = \frac{f}{n}\]

% times it occurs

cumulative frequency $cf$

of observations ≤ to specific value

frequency distribution table

table giving all possible values of a variable and their frequencies

Graphical Methods:

Qualitative Data

Categorical Data

Comparing Categorical Data with Two or More Groups

Quantitative Data

Examining Graphs

Continuous Variables

Univariate Data

Summarizing Distributions

Population

entire group of individuals or things that we are interested in

Sample

part of the population that is actually studied

Describing Distribution

Numerical Methods

Measures of Central Tendency

Mean

arithmetic mean (average)

affected by extreme or outlier measurements

\[\mu = \frac{\sum^N_{i =1}X_i}{N}\] \[\bar X = \frac{\sum_{i=1}^n X_i}{n}\]

Median $M$

measure of central tendency

not affected by outliers

\[l = \frac{n+1}{2}\] \[⁍ = the value of the ⁍ measurement\]

Untitled

Measures of Variation (Spread)

Range $R$

difference between the largest and the smallest measurement in a data set

not reliable because it depends only on the two extreme measurement and doesn’t account for others

\[R = range = largest\ measurement - smallest\ measurement\]

Interquartile range $IQR$

range of middle 50% of data

not affected by outliers

\[IQR = Q_3 - Q_1\]

Standard Deviation

more useful measure than range

takes every measurement into account

affected by outliers

if there is outlier, IQR may be more useful

\[\sigma = \sqrt{\frac{\sum_{i=1}^N(x_i-\mu)^2}{N}}\] \[s = \sqrt{\frac{\sum_{i=1}^N(x_i-\bar x)^2}{n-1}}\]

Variance

always positive

\[{standard\ deviation}^2\]

Measures of Position

Percentiles

divide a set of values into 100 equal parts

\[⁍\]

Quartiles

\[Q_1 = P_{25}\] \[Q_2 = P_{50} = median\] \[Q_3 = P_{75}\]

Standardized Scores or z-scores

\[z~score=\frac{measurement-mean}{standard\ deviation}\]
negative positive
smaller than mean larger than mean

Graphing Univariate Data

Graphical Summaries

even same revenue data seems different due to y-axis

Effect of changing units on summary measures

Summary Measure
Mean
Median
Range
Standard Deviation
Quartiles
Interquartile Range IQR(Y)=IQR(X)\ unaffected

Comparing Distributions of Two or More Groups

Exploring Bivariate Data

Bivariate Data

two different variables collected from each item in a study

Linear regression two different quantitative variables have a linear relation

Scatterplot graphical summary measure
correlation coefficient numerical summary measure

Graphing Categorical Bivariate Data

Scatterplot

used to describe the nature, degree, and direction of the relation between two variables x and y

shape

| linear | non-linear | | — | — |

Direction

Positive (direct) relation Negative (inverse) relation
increases decreases

Strength of relationship

Strong Weak No
close to the line loosely scattered scattered w/o pattern

Numerical methods for continuous bivariate data

Correlation Coefficient

Pearson’s correlation coefficient

numeric measure of the strength & direct of the linear relation between two quantitative variables

population correlation coefficient sample correlation coefficient
 

Direction

| Positive | negative | | — | — |

Strength

Untitled

relation between slope of the line(b) and the correlation coefficient(r)

\(b = r(\frac{S_y}{S_x})\) Originally from The Princeton Review