__STATISTICS__

**Univariate statistics**

Looks at the variability of one item. Analysis of a single variable for the purpose of description.

**Bivariate statistics**

Looks at the variability of 2 subsets. The analysis of 2 variables simultaneously for the purpose of determining the empirical relationship between them. This type of analysis is primarily aimed at prediction.

**Multivariate statistics**

Multivariate statistics provide for analysis where there are many independent (IVs) and dependent variables (DVs) which are correlated to each other to varying degrees. Help you to understand complex relationships among variables. Typically talking about 2 or more DVs..

**Inferential statistics**

Tells us how much confidence we have when we generalize from a sample to a population. Infer from a sample to a population.

**Parametric statistics**

A group of statistical techniques that make strong assumptions
about the distribution of the outcome variable (eg, that it is
normally distributed). In short, if we have a basic knowledge of
the underlying distribution of a variable, then we can make
predictions about how, in repeated samples of equal size, this
particular statistic will "behave," that is, how it is
distributed. **Assumptions: **normal distribution, equal
variances, interval level of measurement for DV, independent
observations.

**Nonparametric statistics**

A group of statistical techniques that don't make strong assumptions about the distribution of the outcome variable. Specifically, nonparametric methods were developed to be used in cases when the researcher knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes (and more appropriately) called parameter-free methods or distribution-free methods. These tests have less statistical power than parametric tests and are more likely to make Type II errors. In general, these tests fall into the following categories:

Tests of differences between groups (independent samples);

Tests of differences between variables (dependent samples);

Tests of relationships between variables.

**Assumptions: **independent observations, nominal or
ordinal level of measurement, sample size less than 30, skewed
distribution.

__STATISTICAL HYPOTHESIS TESTING__

**Steps of statistical hypothesis testing**

- State the research problem and nature of the data
- State null and alternative hypotheses
- Choose the level of significance (alpha level)
- Select the test statistic
- Determine critical vale needed for statistical significance
- State a decision rule for rejecting the null
- Compute the test statistic
- Compare the test statistic to the decision rule and make a decision

**Null hypothesis**

Also called the "hypothesis of chance". The null hypothesis usually stated that the observations are the result purely of chance. The null hypothesis is what statistical procedures test. The purpose of most statistical tests, is to determine if the obtained results provide a reason to reject the hypothesis. The null hypothesis says that the results are simply due to chance. The null hypothesis states that there is no difference or relationship.

**Alternative hypothesis**

Also known as the "competing hypothesis". This states that the results are not due to chance. It says that there is a real effect, that the observations are the result of this real effect, plus chance variation.

**Type I error**

A type I error occurs if, based on the sample data, we decide to reject the null hypothesis when in fact the null hypothesis is true. This is like having a fire alarm without a fire (detecting an effect which is not there). Reducing the chances of making this type of error may increase the chance of a Type II error.

**Type II error**

A type II error occurs if, based on the sample data, we decide not to reject the null hypothesis when in fact the null hypothesis is false. This is like having a fire without an alarm (having an effect but not detecting it). Reducing the chances of making this kind of error can increase the chance of making a Type I error.

**Statistical significance**

A result is described as "statistically significant", when it can be demonstrated that the probability of obtaining such a difference by chance only, is relatively low.

**Test statistic**

This is the statistic that will assess the evidence against the null hypothesis.

**Alpha level or P-value**

The p-value represents a decreasing index of the reliability of a result. Specifically, the p-level represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative of the population." The higher the p-level, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. For example, a p-level of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a "fluke."

**Alpha**

The Type I error rate. The probability of rejecting the null hypothesis when it is true.

**Beta**

The Type II error rate. Beta represents the probability of failing to reject the hypothesis tested when that hypothesis is false and a specific alternative hypothesis is true. For a given test, the value of beta is determined by the previously determined value of alpha, certain features of the statistic that is being calculated (particularly the sample size) and the specific alternative hypothesis that is being entertained. The probability of a Type-II error in hypothesis testing, when the null hypothesis is false. Common values are 0.1 or 0.2.

**Power**

The power of a test refers to the probability of detecting an effect when the effect truly does exist. To calculate the power of a given test it is necessary to specify alpha (the probability that the test will lead to the rejection of the hypothesis tested when that hypothesis is true) and to specify a specific alternative hypothesis. Power is effected by sample size, effect size, and alpha level set by the researcher. The power is equal to 1-beta. Statistical power increases as sample size increases because larger samples decrease the opportunity for sampling error, the larger the sample size the better the odds that the sample is representative of the population.

**Effect size**

The magnitude of a finding. It is the proportion (or %) of variance in the DV which is explained by the IV. It is usually expressed in a number between 0 and 1, with larger numbers representing larger effects. Those .80 and above are considered "large" and those .20 are considered "small". For example, in ANOVA, you can use the eta-squared statistic to gauge the effect size.

__MEASURES OF CENTRAL TENDENCY__

**Mean**

A measure of central tendency (the center of the data). The mean is the arithmetic average of the scores in the population. Numerically, it equals the sum of the scores divided by the number of scores.

**Median**

A measure of central tendency (the center of the data). The median of a population is the point that divides the distribution of scores in half. Numerically, half of the scores in a population will have values that are equal to or larger than the median and half will have values that are equal to or smaller than the media

**Mode**

A measure of central tendency (the center of the data). It is the score in the population that occurs most frequently.

__MEASURES OF DISPERSION __

Measures of spread; how far the data tend to range from the
center__.__

**Range**

A measure of dispersion. The range is the difference between the highest and lowest score. Numerically, the range equals the highest score minus the lowest score.

**Interquartile range**

Divides the data in to 4 equal groups and sees how far apart the extreme groups are. To do this you put the data in numerical order; divide the into 2 equal high and low groups at the median; find the median of the low group which is called the first quartile; find the median of the high group with is the third quartile. The interquartile range is the difference between the first and third quartiles.

**Variance**

A measure of dispersion. The variance is a statistical measure of variation or dispersion or scatter of set of values from their mean. Explains the variation in the data.

**Standard deviation**

A measure of dispersion. Measures the spread from the mean. It is the typical or "standard" amount of scores that deviate from their mean. Generally, it is the average distance from the data to the mean. To calculate the standard deviation of a population it is first necessary to calculate that population's variance. Numerically, the standard deviation is the square root of the variance. Unlike the variance, which is a somewhat abstract measure of variability, the standard deviation is easier to conceptualize.

__MISCELLANEOUS__

**The normal curve**

The normal curve is bell-shaped and symmetrical and has certain mathematical properties. The normal distribution consists of 6 standard deviates (3 on each side of the mean). The mean, median, and mode all occur at the point (at the center and highest point of the curve). It is the basis for inferential statistics and hypothesis testing.

**Measurement bias**

A systematic distortion which can affect the quality of data collected. It can result in inaccurate findings and innaccurate conclusions drawn from those findings.

**Alternative explanations**

These include measurement bias, rival hypotheses, and chance. Research designs help eliminate bias and rival hypos, while statistics help eliminate chance explanations.

**Central limit Theorem**

The Central Limit Theorem is a statement about the characteristics of the sampling distribution of means of random samples from a given population. The Central Limit Theorem consists of three statements:

[1] The mean of the sampling distribution of means is equal to the mean of the population from which the samples were drawn.

[2] The variance of the sampling distribution of means is equal to the variance of the population from which the samples were

drawn divided by the size of the samples.

[3] If the original population is distributed normally (i.e. it is bell shaped), the sampling distribution of means will also be normal.

**Sum of Squares**

The sum of squared differences of data values from their mean.

**Z-score**

The z score for an item, indicates how far and in what direction that item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. We can use a Z score when we encounter a variable that is based on measurements from two different populations, converting them to a standard z score.

**F-statistic**

The ratio of two s squares (i.e. estimates of a population variance, based on the information in two or more random samples). When employed in the procedure entitled ANOVA, the obtained value of F provides a test for the statistical significance of the observed differences among the means of two or more random samples.

**Degrees of freedom**

The number of values in the final calculation of a statistic that are free to vary. The number of independent pieces of information contained in the data set that are used for computing a given summary measure or statistics (like the mean). For example, if you have 1 df, you have one independent piece of the information.

**Independent observations**

A subject’s scores on a DV are not influenced by the other subjects in the group.

**Randomization **

The process of randomly assigning study units between the study treatments.

**Confounding **

In estimating the effect of a factor 'A' on an response, confounding is the distortion of this effect by a second factor 'B' that is associated both with 'A' and with the response.

**Ecological fallacy**

Falsely drawing conclusions about individuals based on the observation of groups.

**Reductionism**

A strict limitation of the kinds of concepts to be considered relevant to a phenomenon.

__NONPARAMETRIC STATISICS__

**Nonparametric statistics**

A group of statistical techniques that don't make strong assumptions about the distribution of the outcome variable. Specifically, nonparametric methods were developed to be used in cases when the researcher knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes (and more appropriately) called parameter-free methods or distribution-free methods. These tests have less statistical power than parametric tests and are more likely to make Type II errors. In general, these tests fall into the following categories:

Tests of differences between groups (independent samples);

Tests of differences between variables (dependent samples);

Tests of relationships between variables.

**Assumptions: **independent observations, nominal or
ordinal level of measurement, sample size less than 30, skewed
distribution.

**Nonparametric tests- differences between independent groups**

Differences between independent groups. Usually, when we have two samples that we want to compare concerning their mean value for some variable of interest, we would use the t-test for independent samples in Basic Statistics); nonparametric alternatives for this test are the Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test. If we have multiple groups, we would use analysis of variance (see ANOVA/MANOVA; the nonparametric equivalents to this method are the Kruskal-Wallis analysis of ranks and the Median test.

**Nonparametric tests- differences between dependent groups**

Differences between dependent groups. If we want to compare two variables measured in the same sample we would customarily use the t-test for dependent samples (in Basic Statistics for example, if we wanted to compare students' math skills at the beginning of the semester with their skills at the end of the semester). Nonparametric alternatives to this test are the Sign test and Wilcoxon's matched pairs test. If the variables of interest are dichotomous in nature (i.e., "pass" vs. "no pass") then McNemar's Chi-square test is appropriate.

**Nonparametric tests- relationship between 2 variables**

To express a relationship between two variables one usually computes the correlation coefficient. Nonparametric equivalents to the standard correlation coefficient are Spearman R, Kendall Tau, and coefficient Gamma (see Nonparametric correlation. If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are the Chi-square test, the Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for relationships between multiple cases is available: Kendall coefficient of concordance. This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli.

**Chi-Square measure of association**

Is a nonparametric statistical test. The Pearson Chi-square is
the most common test for significance of the relationship between
categorical/nominal variables. This measure is based on the fact
that we can compute the expected frequencies in a two-way table
(i.e., frequencies that we would expect if there was no
relationship between the variables). You first look at the X^{2
}to see if there is a statistically significant association
beyond chance. If significant, you then look at Phi if square
table (2X2, 3X3, etc.) and Cramer’s V if not square (2X3,
etc.). A "moderate" magnitude for Phi and Cramer’s
V is .26 to .40. Over .50 is considered "strong". **#
of variables: **2** level of measurement: **nominal scale**
Assumptions: **independent observations, all expected
frequencies are greater than 5.

**Chi-Square goodness of fit test**

Is a nonparametric statistical test. Assesses whether observed frequency counts fit some pre-existing "model distribution" or deviate reliably from that model.

**Wilcoxon signed rank test for two correlated samples**

A non-parametric equivalent of the paired t-test. Is used for
matched samples. The null hypo states that the population
distributions corresponding to the two types of observations are
identical, while the alternative hypo states that they are
different. Each subject is measured on the variable at two
different times. The two sets of scores are subtracted and
assigned a rank according to their differences.** Assumptions: **two
groups are randomly and independently selected, variables at
least at ordinal level.

**Mann-Whitney U-test**

This is the nonparametric equivalent to the independent
samples t-test and tests the difference between the two
population distributions. Deals with ranks of observations, based
on the sum of ranks for each group. If one group has a larger sum
of ranks than the other, we suspect that the two samples did not
come from the same distribution. The null hypo states that the
populations from which the two samples were drawn were identical,
while the alternative hypo states that they are not identical. If
the null hypo is rejected, then it is concluded that the two
population distributions are not identical, but differ somehow.**
Assumptions: **two groups are randomly and independently
selected, DV is at least at ordinal level, no tied ranks.

**Kruskal-Wallis test for more than two independent samples**

A non-parametric equivalent of one-way analysis of variance.
This is used for 3 or more non-related groups. Tests the
differences between more than two population distributions. Is
similar to the Mann-Whitney, but involves more than 2 groups. The
null hypo states that the several samples have identical
population distributions, while the alternative hypo says they do
not. All scores are for the several groups are put together in
ascending order and assigned ranks. The ranks are then totaled
within each group. The null hypo is rejected when the totals of
ranks are unequal between the groups, showing that there are
actual differences in the populations.** Assumptions: **groups
are randomly and independently selected, DV is at least at
ordinal level, at least 5 cases or subjects per group.

**Spearman's rank-order coefficient of correlation**

A nonparametric alternative to correlation. Used to correlate
ordinal level data. It is a coefficient which is applied to
ordered, equally spaced ranks of pairs of scores. Data appears as
matched pairs of scores and ranks are assigned and subtracted.**
Assumptions: **data should be at ordinal level. Other
assumptions??

__MEASURES OF RELATIONSHIP__

**Chi-Square measure of association**

Is a nonparametric statistical test. The Pearson Chi-square is
the most common test for significance of the relationship between
categorical/nominal variables. This measure is based on the fact
that we can compute the expected frequencies in a two-way table
(i.e., frequencies that we would expect if there was no
relationship between the variables). You first look at the X^{2
}to see if there is a statistically significant association
beyond chance. If significant, you then look at Phi if square
table (2X2, 3X3, etc.) and Cramer’s V if not square (2X3,
etc.). A "moderate" magnitude for Phi and Cramer’s
V is .26 to .40. Over .50 is considered "strong".** #
of variables: **2** level of measurement: **nominal scale**.
Assumptions: **independent observations, all expected
frequencies are greater than 5.

**Correlation**

Correlation is a measure of the relation between two or more
variables and describes the strength or degree of a linear
relationship. Involves strength and direction. Produces
correlation coefficients (r) which can range from -1.00 to +1.00.
Correlation lets us specify to what extent the two variables
behave alike or vary together. Variables should be at least on
the interval level of measurement. Spearman's rank and Kendall's
tau are the nonparametric alternatives to correlation.
Correlation is not causation!** Correlation coefficient**
Correlation coefficient (r) represents the linear relationship
between two variables. The correlation coefficient (r) provides
an index of the degree to which the paired measures co-vary in a
linear fashion. An r of .70 and above is considered a high
correlation. **r**^{2} -If the correlation
coefficient is squared, then the resulting value (r2, the
coefficient of determination) will represent the proportion of
common variation in the two variables This is important in
determining significance of the correlation.** Assumptions: **data
must be at interval level, the pattern of relationship must be
linear, data must be homoscedastic (equal variance in Y across X,
can see in running a scattergram and seeing data in a cigar
shape).

**Linear Regression**

A regression analysis which involves only one predictor is
called Simple Linear Regression Analysis. Linear regression is
used to make predictions about a single value and uses r to
predict future outcomes. Simple linear regression involves
discovering the equation for a line that most nearly fits the
given data. That linear equation is then used to predict values
for the data. The regression line is one which shows the best fit
that relates y to x. Basically, you build the regression model,
evaluate the regression model, and use the regression model to
partition variance (breaks the Y variance into two parts: a
proportion that is predictable from X and a proportion that is
not explained or accounted for by X).** Assumptions:** data
are in the form of pairs of scores, there is a correlation
between X and Y variables, data is at interval level. (not
completely sure about these)

__MEASURES OF MEAN DIFFERENCES BETWEEN GROUPS__

**T-Test**

A parametric statistical test. The t-test is used to evaluate
the differences in means between two groups. Theoretically, the
t-test can be used even if the sample sizes are very small, as
long as the variables are normally distributed within each group
and there is equality of variances.** # of variables**: 2**
level of measurement: **1 nominal (2 categories) & 1
interval

**Independent samples T-test**

A parametric statistical test. In order to perform the t-test
for independent samples, one independent (grouping) variable
(e.g., Gender: male/female) and at least one dependent variable
(e.g., a test score) are required. The means of the dependent
variable will be compared between selected groups based on the
specified values (e.g., male and female) of the independent
variable.** Assumptions: **independent observations, DV is at
interval or ratio level of measurement, DV is normally
distributed, equal variances.

**Mann-Whitney U-test**

This is the nonparametric equivalent to the independent samples t-test and tests whether two independent groups have been drawn from the same population.

**Paired/Dependent samples T-test**

A parametric statistical test. Two groups of observations
(that are to be compared) are based on the same sample of
subjects who were tested twice (e.g., before and after a
treatment).** Assumptions: **normal distribution, equal
variances, equal means (?)

**Wilcoxon signed rank test**

A non-parametric equivalent of the paired sample t-test, for testing whether two populations have the same distribution.

**ANOVA**

A parametric statistical test. In general, the purpose of
analysis of variance (ANOVA) is to test for significant
differences between means of 3 or more groups. This procedure
employs the statistic (F) to test the statistical significance of
the differences among the obtained means of two or more random
samples from a given population. The Kruskal-Wallis is the
nonparametric equivalent to the 1-way ANOVA.** # of variables:**
2 **level of measurement:** 1 nominal IV (grouping variable, 3
or more groups) and 1 interval or ratio IV**. Assumptions:**
independent observations, the DV is interval or ratio scale, DV
normally distributed, equal variances.

**Kruskal-Wallis**

A non-parametric equivalent of one-way analysis of variance. This is used for 3 or more non-related groups.

**2-way ANOVA**

A type of elaboration. 2 nominal IVs and 1 interval or ratio DV. Looks at main effects of each IV on the DV and interaction effects of the IVs combined on the DV.

**MANOVA**

A parametric and multivariate statistical test. MANOVA is used to assess the statistical significance of the effect of one or more IVs on a set of two or more DVs. It is different from ANOVA because ANOVA uses only 1 DV. Use a MANOVA instead on conducting multiple ANOVAs to control for Type I errors. With MANOVA, you can see if mean scores among groups are significantly different.

__REGRESSION__

**Regression**

Regression is a class of statistical methods in which 1 dependent variable is related to 1 or more independent variables. Regression is used to make predictions of values. These predictions are made possible by knowing something about the values predicted. In other words, based on existing data values, predictions are made about other, similar values.

**Linear Regression**

A regression analysis which involves only one predictor is called Simple Linear Regression Analysis. Linear regression is used to make predictions about a single value and uses r to predict future outcomes. Simple linear regression involves discovering the equation for a line that most nearly fits the given data. That linear equation is then used to predict values for the data. The regression line is one which shows the best fit that relates y to x.

**Multiple Regression**

The general purpose of multiple regression is to learn more
about the relationship between several independent/predictor
variables and 1 dependent/criterion variable (and finding an
equation that satisfies that relationship). In general, multiple
regression allows the researcher to ask (and hopefully answer)
the general question "what is the best predictor of
...". We want to predict 1 continuous, dependent variable by
using 2 or more continuous or nominal independent variables and
we want to determine the utility of predictor variables for
predicting a criterion variable. Multiple regression assumes
multivariate normality of the data.** Assumptions: **multivariate
normality, 1 interval DV and 2 or more interval or nominal IVs,
relationships among variables must be linear, all relevant
predictors must be included and no irrelevant predictors must be
included, error scores have a mean=0, are homoscedastic (have
equal variances at all values of the predictors), and are
uncorrelated.

__DATA REDUCTION AND UNDERLYING CONSTRUCTS__

**Principal component analysis**

PCA is a data reduction technique, trying to reduce large #s
of variables to a few composite indices. It also involves the
formation of new variables that are linear combinations of the
original variables. Original items must be
interrelated/intercorrelated (the less correlation between
variables, the less data reduction that can be achieved).**
Assumptions**: Multivariate normality; Variables/items should
be interrelated/correlated among themselves

**Factor analysis**

Like PCA, FA involves data reduction, trying to reduce large
#s of variables to a few composite indices. Both involve the
formation of new variables that are linear combinations of the
original variables. Factor analysis goes one step further and
tries to determine an underlying structure or construct. It seeks
to explain how certain variables are correlated. Can be used to
develop scales and measure constructs.** Assumptions:**
Multivariate normality; Variables/items should be
interrelated/correlated among themselves.

__PREDICTION__

**Descriminant analysis**

Discriminant function analysis is used to determine which
variables discriminate between two or more naturally occurring
groups and to use those variables to predict group membership of
future cases. In general, Discriminant Analysis is a very useful
tool (1) for detecting the variables that allow the researcher to
discriminate between different (naturally occurring) groups, and
(2) for classifying cases into different groups with a better
than chance accuracy. The main use of discriminant analysis is to
predict group membership from a set of predictors. DV is nominal
and IVs are interval.** Assumptions:** homogeneity of
variances/covariances and multivariate normal distribution.

**Logistic regression**

An alternative procedure to DA (as it also can predict group
membership) and an extension of multiple regression. Can use
logistic regression when you violate the normality assumption of
DA (either because IVs are a mix of categorical and continuous
variables or because continuous variables are not normally
distributed). Logistic regression is used to predict a
dichotomous DV from 1 or more IVs. The DV usually represents the
occurrence or non-occurrence of some outcome event. The procedure
will produce a formula which will predict the probability of the
occurrence as a function of the IVs. It also produces an odds
ratio associated with each predictor value.** Assumptions: **Independent
observations, mutually exclusive and exhaustive categories,
specificity (the model must contain all relevant predictors and
no irrelevant predictors).

**Linear Regression**

A regression analysis which involves only one predictor is called Simple Linear Regression Analysis. Linear regression is used to make predictions about a single value and uses r to predict future outcomes. Simple linear regression involves discovering the equation for a line that most nearly fits the given data. That linear equation is then used to predict values for the data. The regression line is one which shows the best fit that relates y to x.

**Multiple Regression**

The general purpose of multiple regression is to learn more
about the relationship between several independent/predictor
variables and 1 dependent/criterion variable (and finding an
equation that satisfies that relationship). In general, multiple
regression allows the researcher to ask (and hopefully answer)
the general question "what is the best predictor of
...". We want to predict 1 continuous, dependent variable by
using 2 or more continuous or nominal independent variables and
we want to determine the utility of predictor variables for
predicting a criterion variable. Multiple regression assumes
multivariate normality of the data.** Assumptions: **multivariate
normality, 1 interval DV and 2 or more interval or nominal IVs,
relationships among variables must be linear, all relevant
predictors must be included and no irrelevant predictors must be
included, error scores have a mean=0, are homoscedastic (have
equal variances at all values of the predictors), and are
uncorrelated.

__MULTIVARIATE STATISTICS__

**Multivariate statistics**

Multivariate statistics provide for analysis where there are many independent (IVs) and dependent variables (DVs) which are correlated to each other to varying degrees. Help you to understand complex relationships among variables. Typically talking about 2 or more DVs..

**Multivariate normality**

To have multivariate normality, the IVs must be distributed normally, any linear combination of the DVs must be normally distributed, and all subsets of the variables must have a multivariate normal distribution.

**Centered data**

Data is represented as deviations from the mean or average. When we center data, points will have a new position in the relation to the axes. The variance does not change but the mean becomes=0. To center data, take each score in the original data and subtract the mean from it.

**Standardized data**

Standardizing either stretches out or crunches in data to make the SD=1 and makes the dispersion in each direction about the same. The variance and the SD both become=1. To standardize data, divide the mean corrected data by the respective standard deviation.

**Trace**

Represents the total variability by a single score.

**Determinant**

Represents the generalized variance by a single score.

**Eigenvalue**

The variance (might be more to it?). Has a magnitude but no direction.

**Eigenvector**

The vector that corresponds to an eigenvalue. A vector is a quantity that has a magnitude and direction.

**Covariance**

The variance in one variable that is shared by another variable.

**Loadings**

The correlations between the original and new variables in principal component analysis.

**Collinearity**

A numerical problem that results when explanatory variables in a regression model are highly correlated.

**Communality**

The common, or shared variance.

**Factor rotation**

A technique used in factor analysis when you want to achieve a simpler factor structure which can be easily interpreted. Rotation separates the data out. Can use varimax or quartermax rotation methods.

**Principal component analysis**

PCA is a data reduction technique, trying to reduce large #s
of variables to a few composite indices. It also involves the
formation of new variables that are linear combinations of the
original variables. Original items must be
interrelated/intercorrelated (the less correlation between
variables, the less data reduction that can be achieved).**
Assumptions**: Multivariate normality; Variables/items should
be interrelated/correlated among themselves

**Factor analysis**

Like PCA, FA involves data reduction, trying to reduce large
#s of variables to a few composite indices. Both involve the
formation of new variables that are linear combinations of the
original variables. Factor analysis goes one step further and
tries to determine an underlying structure or construct. It seeks
to explain how certain variables are correlated. Can be used to
develop scales and measure constructs.** Assumptions:**
Multivariate normality; Variables/items should be
interrelated/correlated among themselves

**Descriminant analysis**

Discriminant function analysis is used to determine which
variables discriminate between two or more naturally occurring
groups and to use those variables to predict group membership of
future cases. In general, Discriminant Analysis is a very useful
tool (1) for detecting the variables that allow the researcher to
discriminate between different (naturally occurring) groups, and
(2) for classifying cases into different groups with a better
than chance accuracy. The main use of discriminant analysis is to
predict group membership from a set of predictors. DV is nominal
and IVs are interval. **Assumptions:** homogeneity of
variances/covariances and multivariate normal distribution.

**Logistic regression**

An alternative procedure to DA (as it also can predict group
membership) and an extension of multiple regression. Can use
logistic regression when you violate the normality assumption of
DA (either because IVs are a mix of categorical and continuous
variables or because continuous variables are not normally
distributed). Logistic regression is used to predict a
dichotomous DV from 1 or more IVs. The DV usually represents the
occurrence or non-occurrence of some outcome event. The procedure
will produce a formula which will predict the probability of the
occurrence as a function of the IVs. It also produces an odds
ratio associated with each predictor value.** Assumptions: **Independent
observations, mutually exclusive and exhaustive categories,
specificity (the model must contain all relevant predictors and
no irrelevant predictors).

**MANOVA**

A parametric and multivariate statistical test. MANOVA is used
to assess the statistical significance of the effect of one or
more IVs on a set of two or more DVs. It is different from ANOVA
because ANOVA uses only 1 DV. Use a MANOVA instead on conducting
multiple ANOVAs to control for Type I errors. With MANOVA, you
can see if mean scores among groups are significantly different.
Often, a precursor to discriminant analysis.** Assumptions: **Independent
observations, multivariate normality (but can violate this),
equal covariances of DVs.

**Multiple Regression**

The general purpose of multiple regression is to learn more
about the relationship between several independent/predictor
variables and 1 dependent/criterion variable (and finding an
equation that satisfies that relationship). In general, multiple
regression allows the researcher to ask (and hopefully answer)
the general question "what is the best predictor of
...". We want to predict 1 continuous, dependent variable by
using 2 or more continuous or nominal independent variables and
we want to determine the utility of predictor variables for
predicting a criterion variable. Multiple regression assumes
multivariate normality of the data.** Assumptions: **multivariate
normality, 1 interval DV and 2 or more interval or nominal IVs,
relationships among variables must be linear, all relevant
predictors must be included and no irrelevant predictors must be
included, error scores have a mean=0, are homoscedastic (have
equal variances at all values of the predictors), and are
uncorrelated.

**Cluster Analysis**

Cluster analysis (CA) is a multivariate procedure for detecting natural groupings in data. Cluster analysis classification is based upon the placing of objects into more or less homogeneous groups, in a manner such that the relationship between groups is revealed. The two key steps within cluster analysis are the measurement of distances between objects and to group the objects based upon the resultant distances (linkages).