In statistics, dependence or association is any statistical relationship, whether casual or not, between two random variables or bivariate data.
Formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence.
However, when used is a technical sense, correlation refers to any of several specific types of relationship between mean values. There are several correlation coefficients, measuring the degree of correlation.
The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables. Other correlation coefficients have been developed to be more robust than the Pearson correlation – that is more sensitive to nonlinear relationships. Mutual information can also be applied to measure dependence between two variables.
Correlation and linearity
The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. In particular, if the conditional mean of Y given X, denoted E(Y | X), is not linear in X, the correlation coefficient will not fully determine the form E(Y | X).
In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called “simple linear regression”.
Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of it” in some other norm (as with least absolute derivation regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the least squares approach can be used to fit models that are not a linear one. Thus, although the terms “least squares” and “linear model” are closely linked, they are not synonymous.
A fitted linear regression model can be used to identify the relationship between a single predictor variable Xj and the response variable y when all the other predictor variables in the model are “help fixed”. Specifically, the interpretation of βj is the expected change in y for a one-unit change in Xj when the other covariates are help-fixed. That is the expected value of the partial derivative of y with respect to Xj. This is sometimes called the unique effect of Xj on y. In contrast, the marginal effect of Xj on y can be assessed using a correlation coefficient or simple linear regression model relating only Xj to y; this effect is the total derivative of y with respect to Xj.
Pearson’s chi-squared test (χ2) is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is suitable for unpaired data from large samples. It is the most widely used of many chi-squared tests, statistical procedures whose results are evaluated by reference to the chi-squared distribution. It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The events considered must be mutually exclusive and have total probability 1. A common case for this is where the events each cover an outcome of a categorical variable.
Pearson’s chi-squared test is used to assess three types of comparison, the goodness of fit, homogeneity, and independence. In particular, we are going to examine the independence one:
Test of independence
It assesses whether unpaired observations on two variables, expressed in a contingency table, are independent of each other. For all three tests, the computational procedure includes the following steps:
- Calculate the chi-squared test statistic, , which resembles a normalized sum of squared deviations between observed and theoretical frequencies.
- Determine the degrees of freedom, df, of the statistic. For a test of goodness-of-fit, this is essentially the number of categories reduced by the number of parameters of the fitted distribution. For the test of homogeneity, df = (Rows – 1)×(Cols – 1), where Rows corresponds to the number of categories and Cols corresponds to the number of independent groups. For the test of independence, df = (Rows – 1)×(Cols – 1), wherein this case, Rows corresponds to the number of categories in one variable, and Cols corresponds to the number of categories in the second variable.
- Select the desired level of confidence for the result of the test
- Compare to the critical value from the chi-squared distribution with df degrees of freedom and the selected confidence level, which in many cases gives a good approximation of the distribution of .
- Accept or reject the null hypothesis that the observed frequency distribution is the same as the theoretical distribution based on whether the test statistic exceeds the critic value of , the null hypothesis ( = there is no difference between the distributions) can be rejected, and the alternative hypothesis ( = there is a difference between the distributions) can be accepted, both with the selected level of confidence.