Does this look familiar to you? And do you know what this table tells you?
You probably stumbled upon regression tables from time to time. Whether your colleagues are sharing results, you are trying to figure out which features to select for an algorithm, or if you are reading scientific papers — chances are high that you are confronted with regression tables.
The amount of information and statistics they contain, however, can seem a little bit intimidating. As such, this article helps you how to read them and how to extract the information you need.
Regression tables usually contain 3 parts:
A description of the data and model
Detailed results on the independent variables
Test statistics that indicate the quality and robustness of the model
Before you can assess a model, you should get an overview of the model(s) and data. Find out what has been estimated (dependent variable) using which (independent) variables, and for what kind of dataset.
If various models have been estimated, it is common to summarize all models in a single regression table. In this case, each model is reported in a single column. As these models usually differ in the set of independent variables, you will see some white space in rows of independent variables that have not been considered for specific models.
Information on the dependent variable is often at the top of regression tables. Depending on the tools you use or the domain in which you are working, the dependent variable is either directly specified, the name of a column or it might be mentioned in the caption of a table (Effect of … on <dependentvariable>).
Pay attention to how your dependent variable is encoded. Is it a binary variable (either 0 or 1), is it a continuous variable (all values possible), or can it take only positive/negative values?
The independent variables are often easily located. They are the ones who are in the middle of the table and for whom coefficient estimates, standard errors, t-statistics or p-values are given. The independent variables are the variables who are expected to determine the dependent variable. As such, they should be chosen carefully and have a significant effect on the dependent variable.
Again, check how they are encoded. Like the dependent variable, they can also be binary or restricted to a certain range of numbers.
Finally, check the size of the dataset. The number of observations is often abbreviated with n = . Make sure it is sufficiently large and meets your expectations. During the data cleaning process, some observations are usually dropped due to missing values, a poor data quality, or domain-specific selection rules.
Especially if you have not been involved in the data wrangling process, it is always a good idea to check how many observations have been considered in the final model.
First of all, this regression table reports the result of 4 models (see the 4 grey columns). The first three models describe the effect on wages, whereas the fourth model describes the effect on working hours (dependent variable).
To estimate wages, the first model only considers the hours worked, IQ, education (measured in years), work experience (measured in years), and age.
Additionally, the second model also takes into account whether an individual is married (encoded 1 if married and 0 otherwise), black (encoded 1 if black and 0 otherwise), and how many siblings an individual has.
Finally, the third model does not take the number of siblings into account, but whether an individual is living in a rural area.
To estimate the number of hours worked, the fourth model considers the IQ, education, work experience, age, as well as a dummy variable for being married.
At the bottom of the table, we can see that all models include the full dataset of 935 observation. So there are no differences between the samples.
After we got an overall impression on what has been estimated using which variables, the next step would be to look at more detailed results of the independent variables. As mentioned earlier, they are supposed to explain the value of the dependent variable. But do they really help in explaining the dependent variable?
To assess the effect of each independent variable, regression tables provide a coefficient for each independent variable.
Note, the interpretation of these coefficients crucially depends on the units and values of the dependent and independent variables. If any of the variables is expressed as a logarithm or as a binary variable, the interpretation changes!
Finally, p-values, standard-errors, t-statistics, and confidence intervals indicate the significance and precision of single coefficient estimates.
Let’s have a look at our example, again, and see what our estimations suggest:
First, let’s only consider the models that estimate wages (coefficients with yellow background): All models suggest that the number of hours worked, IQ, education, work experience, and age have a positive effect on wages.
The second and third model suggests that the married people have higher wages on average but black people receive lower wages.
The second model suggests that individuals with siblings would have lower wages, but this effect is not significant as indicated by the missing asterisks.
Finally, with the fourth model, we learn that individuals living in urban areas tend to have higher wages.
With the fourth model, however, we do not find any significant variables that predict the number of hours worked. Having a high IQ, good education, lots of experience, old age, or being married cannot tell us how many hours an individual work. In this case, we should come up with new independent variables like the occupation that is better suitable to predict the number of hours worked.
Lastly, numerous test statistics allow you to assess the overall quality of a model. Some of them give hints on the robustness and validity of a model and others check whether the assumptions hold.
The most prominent test statistic is the R2 and the adjusted R2.
The R2 indicates how much of the variation in the independent variable is explained by the model. Similarly, the adjusted R2 does the same but it also takes the number of variables into account and favors parsimonious models, i.e. models with few independent variables.
Please do not assess the quality of a model solely on the R2 or adjusted R2 . Although they have their benefits, it would be naive to pay too much attention to them.
The R2 and adjusted R2 are around 20% for the three models predicting wages. As we do not find any significant variables for the number of hours worked, the last model hardly explains 1% in the variation of hours worked.
Regression tables are a great way to communicate the results of linear regression models. Do not be intimidated by the number of statistics they provide but read them in a systematic way:
First, understand what has been estimated using which variables. Find out how variables have been encoded and what size the dataset has.
Second, have a closer look at the effects of single independent variables. Which variables have a positive and negative effect on the independent variable, and which of them are significant.
Finally, check the overall quality of the model.