Causal Inference - Lalonde dataset

24 minute read

Based on: https://rugg2.github.io/, https://github.com/iqDF/Lalonde-Causal-Matching

Causal Inference in Python

This notebook is an exploration of causal inference in python using the famous Lalonde dataset.

  • Causal inference is a technique to estimate the effect of one variable onto another, given the presence of other influencing variables (confonding factors) that we try to keep ‘controlled’.

The study looked at the effectiveness of a job training program (the treatment) on the real earnings of an individual, a couple years after completion of the program.

The data consists of a number of demographic variables (age, race, academic background, and previous real earnings), as well as a treatment indicator, and the real earnings in the year 1978 (the response).

Robert Lalonde, “Evaluating the Econometric Evaluations of Training Programs”, American Economic Review, Vol. 76, pp. 604-620

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
# Core
import numpy as np
import scipy
import causalinference
import statsmodels.api as sm
import pylab


%matplotlib inline
sns.set_style('darkgrid')
plt.style.use('ggplot')
# Silent warnings
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In this notebook we’ll be using the tools provided by Laurence Wong in the Package CausalInference. Comments on what each function does come from the very good package documentation: http://laurence-wong.com/software/

This package relies heavily on Rubin causal model, and so will this analysis https://en.wikipedia.org/wiki/Rubin_causal_model

The reason why several models exist is that it is impossible to observe the causal effect on a single unit, and so assumptions must be made to estimate the missing counterfactuals. We’ll explain what all that means in this post.

# https://pypi.org/project/CausalInference/
from causalinference import CausalModel
lalonde = pd.read_csv('../data/lalonde.csv', index_col=0)
lalonde.head()
treat age educ black hispan married nodegree re74 re75 re78
NSW1 1 37 11 1 0 1 1 0.0 0.0 9930.0460
NSW2 1 22 9 0 1 0 1 0.0 0.0 3595.8940
NSW3 1 30 12 1 0 0 0 0.0 0.0 24909.4500
NSW4 1 27 11 1 0 0 1 0.0 0.0 7506.1460
NSW5 1 33 8 1 0 0 1 0.0 0.0 289.7899
# let's have an overview of the data
lalonde.describe()
treat age educ black hispan married nodegree re74 re75 re78
count 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000 614.000000
mean 0.301303 27.363192 10.268730 0.395765 0.117264 0.415309 0.630293 4557.546569 2184.938207 6792.834483
std 0.459198 9.881187 2.628325 0.489413 0.321997 0.493177 0.483119 6477.964479 3295.679043 7470.730792
min 0.000000 16.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 20.000000 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 238.283425
50% 0.000000 25.000000 11.000000 0.000000 0.000000 0.000000 1.000000 1042.330000 601.548400 4759.018500
75% 1.000000 32.000000 12.000000 1.000000 0.000000 1.000000 1.000000 7888.498250 3248.987500 10893.592500
max 1.000000 55.000000 18.000000 1.000000 1.000000 1.000000 1.000000 35040.070000 25142.240000 60307.930000

Here is the raw difference in earning between the control group and the treated group:

lalonde.groupby('treat')['re78'].agg(['median','mean'])
median mean
treat
0 4975.505 6984.169742
1 4232.309 6349.143530

The control group has higher earning that the treatment group - does this mean the treatment had a negative impact?

lalonde.groupby('treat')['re78'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
treat
0    AxesSubplot(0.125,0.125;0.775x0.755)
1    AxesSubplot(0.125,0.125;0.775x0.755)
Name: re78, dtype: object

png

This dataset is not a balanced trial. Indeed people in the control group are very different from people in the test (treatment) group. Below is a plot of the different income distributions:

lalonde.groupby('treat')['re74'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
treat
0    AxesSubplot(0.125,0.125;0.775x0.755)
1    AxesSubplot(0.125,0.125;0.775x0.755)
Name: re74, dtype: object

png

lalonde.groupby('treat')['age'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
treat
0    AxesSubplot(0.125,0.125;0.775x0.755)
1    AxesSubplot(0.125,0.125;0.775x0.755)
Name: age, dtype: object

png

lalonde.groupby('treat')['educ'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
treat
0    AxesSubplot(0.125,0.125;0.775x0.755)
1    AxesSubplot(0.125,0.125;0.775x0.755)
Name: educ, dtype: object

png

notation, aims and assumptions

Notations.

  • Y represents the response, here is is 1978 earnings (‘re78’)
  • D represents the treatment: the job training program (‘treat’)
  • X represents the confounding variables, here it likely is age, education, race and marital status. X is also called a covariate or the counter factual.

Aims. What we want to know here is the Average Treatment Effect (ATE): \(\Delta = E[Y_{1} - Y_{0}]\)

However, as we saw, if we try to estimate this quantity from the row observational distribution, we get: \(\Delta_{raw} = E[Y|D=1] - E[Y|D=0] = E[Y_{1}|D=1] - E[Y_{0}|D=0] \\ \neq \Delta = E[Y_{1} - Y_{0}]\)

because: \(E[Y_{i}|D=i] \neq E[Y_{i}]\)

General problem. If we believe that age, education, race, and marital status all have a likely influence on earnings Y, we need a way to disentangle the effect of D on Y from the perturbative effect of X on Y.

Assumptions. The Causalinference package is based on a typical assumption called unconfoundedness or ignorability: \((Y(0), Y(1)) \perp D \; | \; X\)

Indeed we saw that the treatment assignment is probably not independent of each subject’s potential outcomes, e.g. poorer people are more represented in the treatment group than in the control group.

However the treatment is assumed to be unconfounded in the sense that the dependence between the treatment assignment and the outcomes is only through something we observe, namely the covariates X.

What this means is that if we control for X, i.e. look across people with similar levels of X, then the difference between treated and control should be attributable to the treatment itself, just as a randomized experiment would be.

This is the assumption, and if it doesn’t hold our results could be completely wrong.

Simple approach

The simplest type of model we can use is a linear model:

\(Y_0 = \alpha + \beta X + \epsilon\) \(Y_1 = Y_0 + \gamma D\)

If this is accurate, fitting the following model to the data using linear regression will give us an estimate of the Average Treatment Effect (ATE): \(Y = \alpha + \beta X + \gamma D\)

$\epsilon$ is called a residual and represents the noise

covariates = ['age', 'educ', 'black', 'hispan', 'married', 'nodegree', 're74', 're75']

causal.est_via_ols?
# we use the CausalModel method from the causalinference package

causal = CausalModel(
    Y=lalonde['re78'].values, 
    D=lalonde['treat'].values, 
    X=lalonde[covariates].values)

causal.est_via_ols(adj=1)
# adj=1 corresponds to the simplicity of the model we entered
# This is called a "constant treatment effect"

print(causal.estimates)
Treatment Effect Estimates: OLS

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE   1548.244    734.521      2.108      0.035    108.584   2987.904

This model predicts that the Average Treatment Effect (ATE, the job training) is $1548 extra annual earnings. This is very different from our previous raw results predicting that the job training had negative effects on earnings!

Assuming that our model accurately describes the counterfactual X, CausalModel provides the 95% confidence interval. What this means is that, if we were to repeat this treatment experiment, in 95% of the cases the Average Treatment Effect would be within that interval. That doesn’t mean that the true value is within that interval.

Based on the assumption that the residuals are normally distributed, the 95% confidence interval is calculated as: \(AVG \pm 1.96 * STD / \sqrt{n}\)

In practice, as the confidence interval is very large, my interpretation is that the experiment should have had more people if a better estimate of the extra earnings was desired. Ways to control the standard deviation could also be explored.

Overall, assuming that we controlled for all the effects and did it well, it seems that the job training had a positive effect on earnings. Indeed, although the standard deviation is very large, the p value of 0.035 rejects the null hypothesis (no effect) with a confidence level of 97.5%. However, the truth is that we don’t know if we modelled the counterfactual well, and this could change everything… As we will see later, estimators such as the Ordinary Least Square (OLS) estimator can behave poorly when there is not enough covariate overlap, and that’s because the estimator needs to extrapolate too much from one group to another.

A more structured approach as we will see below can allow us to increase our confidence that the covariants are well controlled for. We will see many steps, but one simple idea is the technique of matching: the idea is to find for each sample which received the treatment a similar sample in the control group, and to directly compare these values.

formula = 're78 ~ treat + age + educ + black + hispan + married + nodegree + re74 + re75'
ols_mod = smf.ols(formula=formula, data=lalonde).fit()

print(ols_mod.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   re78   R-squared:                       0.148
Model:                            OLS   Adj. R-squared:                  0.135
Method:                 Least Squares   F-statistic:                     11.64
Date:                Fri, 29 Jan 2021   Prob (F-statistic):           5.99e-17
Time:                        11:52:28   Log-Likelihood:                -6297.8
No. Observations:                 614   AIC:                         1.262e+04
Df Residuals:                     604   BIC:                         1.266e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     66.5145   2436.746      0.027      0.978   -4719.009    4852.038
treat       1548.2438    781.279      1.982      0.048      13.890    3082.598
age           12.9776     32.489      0.399      0.690     -50.827      76.783
educ         403.9412    158.906      2.542      0.011      91.865     716.017
black      -1240.6441    768.764     -1.614      0.107   -2750.420     269.132
hispan       498.8969    941.943      0.530      0.597   -1350.983    2348.777
married      406.6208    695.472      0.585      0.559    -959.217    1772.458
nodegree     259.8174    847.442      0.307      0.759   -1404.474    1924.108
re74           0.2964      0.058      5.086      0.000       0.182       0.411
re75           0.2315      0.105      2.213      0.027       0.026       0.437
==============================================================================
Omnibus:                      216.123   Durbin-Watson:                   2.009
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1182.664
Skew:                           1.467   Prob(JB):                    1.54e-257
Kurtosis:                       9.134   Cond. No.                     7.64e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.64e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Structure for a more complete approach

Pre-processing phase:

  1. assess covariate balance
  2. estimate propensity score
  3. trim sample
  4. stratify sample

Estimation phase:

  1. blocking estimator or/and
  2. matching estimator

Pre-processing phase

In the pre-processing phase, the data is inspected and manipulated to allow credible analysis to be conducted on it.

As we discussed in the previous section, one key method for disantangling the treatment effect from the covariant effects is the matching technique. In this technique we compare subjects that have similar covariate values (i.e. same age, rage, income etc). However, our ability to compare such pairs depends heavily on the degree of overlap of the covariates between the treatment and control group. This is called covariate balance.

Said otherwise, to control the effect of education, one way is to look at people in the tested group and in the non-tested group that all have the same level of education, say a bachelor degree. However, if nobody in the test group has a bachelor degree while many do in the non-test group, this procedure is impossible.

(1) assess covariate balance to assess whether how easily people can be matched. If there is too much unbalance, direct matching will rarely be possible, and we may need to use more complex techniques, if at all possible.

lalonde.columns
Index(['treat', 'age', 'educ', 'black', 'hispan', 'married', 'nodegree',
       're74', 're75', 're78'],
      dtype='object')
print(causal.summary_stats)
Summary Statistics

                       Controls (N_c=429)         Treated (N_t=185)             
       Variable         Mean         S.d.         Mean         S.d.     Raw-diff
--------------------------------------------------------------------------------
              Y     6984.170     7294.162     6349.144     7867.402     -635.026

                       Controls (N_c=429)         Treated (N_t=185)             
       Variable         Mean         S.d.         Mean         S.d.     Nor-diff
--------------------------------------------------------------------------------
             X0       28.030       10.787       25.816        7.155       -0.242
             X1       10.235        2.855       10.346        2.011        0.045
             X2        0.203        0.403        0.843        0.365        1.668
             X3        0.142        0.350        0.059        0.237       -0.277
             X4        0.513        0.500        0.189        0.393       -0.719
             X5        0.597        0.491        0.708        0.456        0.235
             X6     5619.237     6788.751     2095.574     4886.620       -0.596
             X7     2466.484     3291.996     1532.055     3219.251       -0.287

Raw-diff is the raw difference between the means of the control and treatment groups.

As we saw previously, the treated group (trained) is earning $635 less than the control group, which is surprising.

Nor-diff in this package is Imbens and Rubin’s normalized differences (2015) in average covariates, defined as: \(\frac{\bar{X}_{k,t}-\bar{X}_{k,c}}{\sqrt{(s_{k,t}^2 + s_{k,c}^2) / 2}}\)

Here \(\bar{X}_{k,t}\) and \(s_{k,t}\) are the sample mean and sample standard deviation of the kth covariate of the treatment group, and \(\bar{X}_{k,c}\) and \(s_{k,c}\) are the analogous statistics for the control group.

The aim here is to assess the overlap between the control and treatment groups. It can be seen that X2, X4, and X6 (black, married, revenue in 1974) have a large normalized difference, beyond 0.5. This can be interpreted as an imbalance. Concretely, there are way more black people, less married people and lower income in 1974 in the treatment group than in the control group.

The impact of imbalance is to make the matching technique harder to apply. We’ll see later how we can try to correct for it (however, ideally the study would be more balanced!).

After computing this measurement for all of our features, there is a rule of thumb that are commonly used to determine whether that feature is balanced or not, (similar to the 0.05 for p-value idea).

  • Smaller than 0.1 . For a randomized trial, the smd between all of the covariates should typically fall into this bucket.
  • 0.1 - 0.2 . Not necessarily balanced, but small enough that people are usually not too worried about them. Sometimes, even after performing matching, there might still be a few covariates whose smd fall under this range.
  • 0.2 . Values that are greater than this threshold are considered seriously imbalanced.
# Define params and inputs
covariates = ['age', 'educ', 'black', 'hispan', 'married', 'nodegree', 're74', 're75']
agg_operations = {'treat': 'count'}
agg_operations.update({
    feature: ['mean', 'std'] for feature in covariates
})
def compute_table_one_smd(table_one, covariates, round_digits=4):
    feature_smds = []
    for feature in covariates:
        feature_table_one = table_one[feature].values
        neg_mean = feature_table_one[0, 0]
        neg_std = feature_table_one[0, 1]
        pos_mean = feature_table_one[1, 0]
        pos_std = feature_table_one[1, 1]

        smd = (pos_mean - neg_mean) / np.sqrt((pos_std ** 2 + neg_std ** 2) / 2)
        smd = round(abs(smd), round_digits)
        feature_smds.append(smd)

    return pd.DataFrame({'covariates': covariates, 'smd': feature_smds})
table_one_matched = lalonde.groupby('treat').agg(agg_operations)
table_one_smd_matched = compute_table_one_smd(table_one_matched, covariates)
table_one_smd_matched
covariates smd
0 age 0.2419
1 educ 0.0448
2 black 1.6677
3 hispan 0.2769
4 married 0.7195
5 nodegree 0.2350
6 re74 0.5958
7 re75 0.2870

(2) Propensity Score - the probability of receiving the treatment, conditional on the covariates.

Propensity is useful for assessing and improving covariate balance. Indeed a theorem by Rosenbaum and Rubin in 1983, proves that, for subjects that share the same propensity score (even if their covariate vectors are different), the difference between the treated and the control units actually identifies a conditional average treatment effect.

Thus, instead of matching on the covariate vectors X themselves, we can also match on the single-dimensional propensity score p(X), aggregate across subjects, and still arrive at a valid estimate of the overall average treatment effect.

\[E[Y(1) - Y(0) | p(X)] \approx E[Y(1) - Y(0)]\]
This is if p(X) = P(D=1 X), which the CausalInference package estimates for us using a sequence of likelihood ratio tests.

reference: http://laurence-wong.com/software/propensity-score

#this function estimates the propensity score, so that propensity methods can be employed
causal.est_propensity_s()
print(causal.propensity)
Estimated Parameters of Propensity Score

                    Coef.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
     Intercept    -21.096      2.687     -7.851      0.000    -26.363    -15.829
            X2      2.635      0.367      7.179      0.000      1.915      3.354
            X4     -3.026      0.717     -4.222      0.000     -4.431     -1.621
            X6      0.000      0.000      0.847      0.397     -0.000      0.000
            X3      5.137      1.845      2.785      0.005      1.521      8.753
            X1      1.175      0.316      3.713      0.000      0.555      1.796
            X5      0.376      0.450      0.836      0.403     -0.505      1.258
            X7      0.000      0.000      1.496      0.135     -0.000      0.000
            X0      0.988      0.142      6.983      0.000      0.711      1.266
         X0*X0     -0.015      0.002     -6.524      0.000     -0.019     -0.010
         X1*X1     -0.064      0.018     -3.539      0.000     -0.099     -0.028
         X6*X0     -0.000      0.000     -2.876      0.004     -0.000     -0.000
         X6*X6      0.000      0.000      2.420      0.016      0.000      0.000
         X3*X0     -0.223      0.081     -2.752      0.006     -0.382     -0.064
         X4*X3      2.845      1.071      2.656      0.008      0.746      4.945
         X2*X4      1.525      0.781      1.952      0.051     -0.006      3.055

(3) Trim sample. This excludes subjects with extreme propensity scores. Indeed it will be very hard to match those extreme subjects, so the usual strategy is to focus attention on the remaining units that exhibit a higher degree of covariate balance.

# extreme propensity is a very high probability to be either in the control group or the treatment group
# that makes matching difficult

#by default, causal.cutoff is set to 1
# the trim function will drop units whose estimated propensity score <= 0.1 or >= 0.9
#causal.cutoff = 0.1
#causal.trim()

#however, there is a procedure that tried to select an optimal cutoff value
causal.trim_s()
print(causal.summary_stats)
Summary Statistics

                       Controls (N_c=156)         Treated (N_t=140)             
       Variable         Mean         S.d.         Mean         S.d.     Raw-diff
--------------------------------------------------------------------------------
              Y     5511.740     6023.366     6351.987     6397.833      840.246

                       Controls (N_c=156)         Treated (N_t=140)             
       Variable         Mean         S.d.         Mean         S.d.     Nor-diff
--------------------------------------------------------------------------------
             X0       23.603        7.112       24.986        7.510        0.189
             X1       10.250        2.360       10.329        2.177        0.035
             X2        0.468        0.501        0.836        0.372        0.834
             X3        0.250        0.434        0.071        0.258       -0.500
             X4        0.263        0.442        0.221        0.417       -0.096
             X5        0.647        0.479        0.664        0.474        0.035
             X6     2690.978     4488.665     2110.413     4215.220       -0.133
             X7     1918.904     3088.360     1635.661     3414.688       -0.087

In this new subset, the normal difference for most variables is rather balanced. Only X2 (number of black people) is still unbalanced.

It is worth noting that the initial sample of 614 people (429 controls, 185 treated) has been drastically trimmed to 297 people (157 controls, 140 treated).

In this more balanced sub-sample, without using any model, the average earnings in 1978 is more like what we would expect: populations that received training (treated) earn in average $875 more than the control group.

(4) Stratify sample - group similar subjects together. People are grouped in layers of similar propensity scores. These bins should have an improved covariate balance, and we should be able to compare and match samples within those bins.

# the default is to have 5 bins with equal number of samples
# however, it is possible to split the sample in a more data-driven way.
# The larger the sample, the more bins we can afford, so that samples can be increasingly similar within smaller bins
# the limit in dividing too much is that there are too few datapoints in each bin for the bins to be statistically different (t-test)

causal.stratify_s()
print(causal.strata)
Stratification Summary

              Propensity Score         Sample Size     Ave. Propensity   Outcome
   Stratum      Min.      Max.  Controls   Treated  Controls   Treated  Raw-diff
--------------------------------------------------------------------------------
         1     0.091     0.205        67         9     0.139     0.171  1480.667
         2     0.205     0.465        46        28     0.328     0.390  1105.270
         3     0.466     0.676        27        46     0.555     0.579  -319.253
         4     0.677     0.909        16        57     0.779     0.814  2244.311

Within bins, the raw difference in outcome should be a good representation of the real treatment effect. For example:

  • People in group 1 are unlikely to be in the treatment group (well off?). For them, the training improved their earnings by $1399 in average.

  • People in group 4 are likely to be in the treatment group (poor?). For them, the training improved their earnings even more, with a mean of $2211 for that year 1978.

Something that looks quite bad is that outcomes for the group 3 are totally different from that of the other groups.

The trend seems to be that the higher the propensity score, the higher the raw difference in outcome for each stratum. but this one shows opposite results…

This may be a sign that we haven’t controlled for enough factors (or that the propensity calculation is wrong?). Or it might also be a true representation or reality: some people may benefit from the job training, while other may not. It might also be random and the reflection that we are working with a relatively small sample (74 elements in bin 3).

Let’s see in the analysis phase if regressions within each stratum will be able to control for confounding variables better.

Estimation phase

In the estimation phase, treatment effects of the training can be estimated in several ways.

(1) The blocking estimator - although each layer of the stratum is pretty balanced and gives reasonable raw results, this estimator goes further and controls for the confounding factors within each layer of the stratum. More precisely, this estimator uses a least square estimate within each propensity bin, and from this produces an overall average treatment effect estimate.

# causal.est_via_blocking()
# print(causal.estimates)

# for some reason I'm having a singular matrix when calculating this blocking estimator
# on one of the stratum
# I've tried changing the stratum structure and the set of variables,
# however, the singularity persists when calculating the covariance matrix
# this would need a closer look at the dataset, which I haven't taken the time to do yet

# this is one of the issue of this causalinference package:
# it needs to invert large matrixes, which can fail

(2) The matching estimator - although each layer of the stratum is pretty balanced and gives reasonable raw results, this matching estimator controls for the confounding factors by matching even more thinely samples within each layer of the stratum. More precisely, this pairing is done via nearest-neighborhood matching. If the matching is imperfect, biias correction is recommended.

If issues arrive with least square, such as excessive extrapolation, this matching estimator pushes until the end the unconfoundedness assumption and nonparametrically matches subjects with similar covariate values together. In other words, if the confounding factors are equal for both element of a pair, the difference between the two will be the real treatment effect. In the causalinference package, samples are weighted by the inverse of the standard deviation of the sample covariate, so as to normalize.

Where matching discrepancy exist, least square will be used, but very locally, so large extrapolations should be less of a problem.

causal.est_via_matching(bias_adj=False)
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE    546.115   1211.957      0.451      0.652  -1829.321   2921.550
           ATC    544.509   1502.414      0.362      0.717  -2400.223   3489.241
           ATT    547.904   1389.501      0.394      0.693  -2175.519   3271.327
causal.est_via_matching(bias_adj=False)
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE    546.115   1211.957      0.451      0.652  -1829.321   2921.550
           ATC    544.509   1502.414      0.362      0.717  -2400.223   3489.241
           ATT    547.904   1389.501      0.394      0.693  -2175.519   3271.327

The model provides estimates of three quantities: ATE, ATT and ATC:

  • ATE is the Average Treatment Effect, and this is what we are most interested in. \(ATE = E[Y_1-Y_0] \approx E[Y_1-Y_0 | X]\)
  • Here is seems that the average effect of the treatment (job training) was to increase earnings by $384.
  • However, this effect may just be a random variation, and the treatment may well not have any impact (the null hypothesis). The probability to reject the null hypothesis is 25%. The most common interpretation of this number is that the treatment of job trainings did not have a statistically significant impact on earnings, given the models and data processing we did
  • ATT is the Average Treatment effect of the Treated \(ATT = E[Y_1-Y_0 | D=1]\)
  • ATC is the Average Treatment effect of the Control \(ATC = E[Y_1-Y_0 | D=0]\)
# allowing several matches
causal.est_via_matching(bias_adj=True, matches=4)
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE   1015.624    881.557      1.152      0.249   -712.228   2743.476
           ATC    711.019   1003.465      0.709      0.479  -1255.772   2677.809
           ATT   1355.041    934.752      1.450      0.147   -477.073   3187.156

Allowing several matches attributes $1027 of revenue increase to the treatment, with 75% probability to be significant. A common interpretation would be still to reject this as proof of statistical significance.

Conclusions

The effect of training is hard to establish firmly. Although it seems the sample from Lalonde had positive effects, it is actually quite likely to be without any effect.

This isn’t so far from what Lalonde concluded: http://people.hbs.edu/nashraf/LaLonde_1986.pdf By glancing at it, Lalonde seemed to know the gender of participants, which does not seem to be in this dataset, or may be hidden in the NSW vs AFDC.

More work could be done to better estimate the counterfactual. For instance we could introduce polynomial variables to capture non-linear effects and/or introduce categorical variables to bin numerical variables such aseducation.

This was an example of how the CausalInference package could be used, and our conclusions are attached to those models. This package relies heavily on propensity matching and its ignorability / confoundedness assumption.

Other models exist, e.g. Bayesian models. This will be for another time for us. Meanwhile, the curious can have a look at this other post: https://engl.is/causal-analysis-introduction-examples-in-python-and-pymc.html

Alternative propensity score estimation

formula = 'treat ~ age + educ + black + hispan + married + nodegree + re74 + re75'
glm_model = smf.glm(formula=formula, data=lalonde, family=sm.families.Binomial()).fit()
glm_model.summary()
Generalized Linear Model Regression Results
Dep. Variable: treat No. Observations: 614
Model: GLM Df Residuals: 605
Model Family: Binomial Df Model: 8
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -243.92
Date: Fri, 29 Jan 2021 Deviance: 487.84
Time: 16:17:12 Pearson chi2: 556.
No. Iterations: 6
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept -4.7286 1.017 -4.649 0.000 -6.722 -2.735
age 0.0158 0.014 1.162 0.245 -0.011 0.042
educ 0.1613 0.065 2.477 0.013 0.034 0.289
black 3.0654 0.287 10.698 0.000 2.504 3.627
hispan 0.9836 0.426 2.311 0.021 0.149 1.818
married -0.8321 0.290 -2.866 0.004 -1.401 -0.263
nodegree 0.7073 0.338 2.095 0.036 0.045 1.369
re74 -7.178e-05 2.87e-05 -2.497 0.013 -0.000 -1.54e-05
re75 5.345e-05 4.63e-05 1.153 0.249 -3.74e-05 0.000
pscore = glm_model.predict()
# create seperate structure for data and target 
treatment = lalonde['treat']
rev78 = lalonde['re78']
cleaned_df = lalonde.drop(['treat', 're78'], axis=1)
mask = lalonde['treat'] == 1
lalonde_treat = lalonde[mask]
lalonde_notreat = lalonde[~mask]
# drop labeling (e.g NSW3) from the targets data
targets = treatment.reset_index().drop('index', axis=1);
X = cleaned_df.reset_index().drop('index', axis=1);
# we seperate the pscore based on it's 
# corresponding true label value
mask = targets.squeeze() == 1
pos_pscore = pscore[mask]
neg_pscore = pscore[~mask]

print('treatment count:', pos_pscore.shape)
print('control count:', neg_pscore.shape)
treatment count: (185,)
control count: (429,)
boxplot_df = pd.DataFrame({'pscore': pscore, 'treat': targets.squeeze()})

# plot boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='treat', y='pscore', data=boxplot_df);
plt.ylabel('Propensity Score');
plt.xlabel('Treament');

png

Looking at the plot below, we can see that our features, X , does in fact contain information about the user receiving treatment. The distributional difference between the propensity scores for the two group justifies the need for matching, since they are not directly comparable otherwise.

Although, there’s a distributional difference in the density plot, but in this case, what we see is that there’s overlap everywhere, so this is actually the kind of plot we would like to see if we’re going to do propensity score matching.

What we mean by overlap is that no matter where we look on the plot, even though there might be more control than treatment or vice versa, there will still be some subject from either group.

The notion of overlap means that our positivity assumption is probably reasonable. Remember positivity refers to the situation where all of the subjects in the study have at least some chance of receiving either treatment. And that appears to be the case here, hence this would be a situation where we would feel comfortable to proceed with our propensity score matching.

# change default style figure and font size
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12

sns.distplot(neg_pscore, label='control')
sns.distplot(pos_pscore, label='treatment')
plt.xlim(0, 1)
plt.title('Propensity Score Distribution of Control vs Treatment')
plt.ylabel('Density')
plt.xlabel('Scores')
plt.legend()
plt.tight_layout()
plt.show()

png

# Define params and inputs
mask = lalonde['treat'] == 1
covariates = ['age', 'educ', 'black', 'hispan', 'married', 'nodegree', 're74', 're75']
agg_operations = {'treat': 'count'}
agg_operations.update({
    feature: ['mean', 'std'] for feature in covariates
})
from sklearn.neighbors import NearestNeighbors

# Helper function that uses K-NN algorithm to calculate 
# similarity distance and get the similarity indices for matching
def get_similar(pos_pscore, neg_pscore, topn=5, n_jobs=1):
    knn = NearestNeighbors(n_neighbors=topn + 1, metric='euclidean', n_jobs=n_jobs)
    knn.fit(neg_pscore.reshape(-1, 1))
    
    distances, indices = knn.kneighbors(pos_pscore.reshape(-1, 1))
    sim_distances = distances[:, 1:]
    sim_indices = indices[:, 1:]

    return sim_distances, sim_indices
sim_distances, sim_indices = get_similar(pos_pscore, neg_pscore, topn=1)
# Display matching table
df_pos = lalonde[mask]
df_neg = lalonde[~mask].iloc[sim_indices[:, 0]]
df_matched = (df_pos.reset_index(drop=True)
              .merge(df_neg.reset_index(drop=True), 
                     left_index=True, 
                     right_index=True))

num_matched_pairs = df_neg.shape[0]

print('Number of matched pairs: ', num_matched_pairs)
df_matched
Number of matched pairs:  185
treat_x age_x educ_x black_x hispan_x married_x nodegree_x re74_x re75_x re78_x treat_y age_y educ_y black_y hispan_y married_y nodegree_y re74_y re75_y re78_y
0 1 37 11 1 0 1 1 0.00 0.00 9930.0460 0 55 4 1 0 0 1 0.0000 0.0000 0.00000
1 1 22 9 0 1 0 1 0.00 0.00 3595.8940 0 26 8 0 1 0 1 3168.1340 5872.2580 11136.15000
2 1 30 12 1 0 0 0 0.00 0.00 24909.4500 0 26 12 1 0 0 0 0.0000 1448.3710 0.00000
3 1 27 11 1 0 0 1 0.00 0.00 7506.1460 0 18 11 1 0 0 1 0.0000 1367.8060 33.98771
4 1 33 8 1 0 0 1 0.00 0.00 289.7899 0 19 11 1 0 0 1 5607.4220 3054.2900 94.57450
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
180 1 33 12 1 0 1 0 20279.95 10941.35 15952.6000 0 17 11 0 1 0 1 0.0000 873.6774 7759.54200
181 1 25 14 1 0 1 0 35040.07 11536.57 36646.9500 0 32 15 0 0 0 0 489.8167 968.5645 7684.17800
182 1 35 9 1 0 1 1 13602.43 13830.64 12803.9700 0 52 8 1 0 1 1 5454.5990 666.0000 0.00000
183 1 35 8 1 0 1 1 13732.07 17976.15 3786.6280 0 47 10 1 0 0 1 21918.3200 4323.6290 19438.02000
184 1 33 11 1 0 1 1 14660.71 25142.24 4181.9420 0 18 9 1 0 0 1 1183.3970 1822.5480 803.88330

185 rows × 20 columns

Assess covariate balance after matching. For this, compute the absolute standardized differences in means in the covariates after matching (Rosenbaum and Rubin, 1985), \(ASMD_a(x)=\frac{\bar{x}_{t,a}-{\bar{x}_{c,a}}}{\sqrt{\frac{s^{2}_{t,b} + s^{2}_{c,b}}{2}}},\) where \(\bar{x}_{t,a}\) and \(\bar{x}_{c,a}\) are, respectively, the means of covariate $x$ in the treatment and control groups after matching, and \(s^{2}_{t,b}\) and \(s^{2}_{c,b}\) are, correspongdingly, the sample variances treatment and control groups before matching. (One reason to use the sample variances before matching rather than the sample variances after matching is to free the comparisons of the means after matching from simultaneous changes in the variances.) Comment on covariate balance.

# Create matching table
df_pos = lalonde[mask]
df_neg = lalonde[~mask].iloc[sim_indices[:, 0]]
df_matched = pd.concat([df_pos, df_neg], axis=0) # this time we join by axis-0

# Calculate Rosenbaum ASMD
table_one_matched = df_matched.groupby('treat').agg(agg_operations)
table_one_smd_matched = compute_table_one_smd(table_one_matched, covariates)

table_one_smd_matched
covariates smd
0 age 0.0234
1 educ 0.0965
2 black 0.0147
3 hispan 0.0438
4 married 0.1157
5 nodegree 0.1098
6 re74 0.0700
7 re75 0.0576
# pair t-test
ttest = scipy.stats.ttest_rel(df_pos['re78'].values, df_neg['re78'].values)

print("The statistic of t-test:", ttest.statistic)
print("The p-value of t-test:", ttest.pvalue)
df_pos['re78'].mean() - df_neg['re78'].mean()
The statistic of t-test: 1.8691170316786174
The p-value of t-test: 0.0631955759711721





1411.4997588108117