Analyzing Wage Determinants
Analyzing Wage Determinants with Econometrics, Machine Learning and Causal Inference
In this notebook, we will:
- Simulate a Dataset: Create a synthetic dataset that includes features such as Education and Experience and an outcome variable Wage.
- Econometric Analysis (OLS Regression): Estimate the relationship between the predictors and Wage using Ordinary Least Squares.
- Machine Learning (Random Forest): Predict Wage using a Random Forest model and evaluate its performance.
- Causal Inference (Double Machine Learning): Use Double Machine Learning (DML) to estimate the causal effect of education on wage.
- Causal Forest (Heterogeneous Treatment Effects): Estimate heterogeneous treatment effects (HTE) with a causal forest.
- Visualization & Subgroup Analysis: Visualize the results and conduct subgroup analysis by education levels.
Each section includes detailed explanations to help you understand the process.
1. Import Libraries
We import all the necessary libraries, including those for numerical computing (numpy
, pandas
), econometric and statistical analysis (statsmodels
), machine learning (sklearn
), causal inference (econml
), and visualization (matplotlib
, seaborn
).
# Import necessary libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LassoCV
from econml.dml import LinearDML, CausalForestDML
# Ignore Warnings
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
# This magic command ensures plots are shown inline in the notebook
%matplotlib inline
2. Data Simulation
In this section, we simulate a dataset with 1000 observations.
- Education: Random integer values between 10 and 19 (years of education).
- Experience: Random integer values between 1 and 39 (years of experience).
- Ability: A normally distributed variable representing unobserved ability.
The outcome variable Wage is generated as a linear combination of these variables with added random noise.
# ---------------------------
# Data Simulation
# ---------------------------
def simulate_data(n=1000, random_state=42):
"""
Simulate a dataset with Education, Experience, and Wage.
Parameters:
- n: Number of observations.
- random_state: Seed for reproducibility.
Returns:
- data: A pandas DataFrame with columns 'Education', 'Experience', and 'Wage'.
"""
np.random.seed(random_state)
# Generate features
education = np.random.randint(10, 20, size=n) # Years of education (10 to 19)
experience = np.random.randint(1, 40, size=n) # Years of experience (1 to 39)
ability = np.random.randn(n) # Unobserved ability (standard normal)
# Generate outcome variable (log wage)
wage = 2 + 0.08 * education + 0.02 * experience + 0.5 * ability + np.random.normal(0, 0.5, size=n)
# Create and return DataFrame
data = pd.DataFrame({
'Education': education,
'Experience': experience,
'Wage': wage
})
return data
# Data Simulation
data = simulate_data()
print("Data simulation complete.\n")
data.head()
Data simulation complete.
Education | Experience | Wage | |
---|---|---|---|
0 | 16 | 33 | 4.168880 |
1 | 13 | 12 | 2.225531 |
2 | 17 | 36 | 3.099148 |
3 | 14 | 4 | 3.889295 |
4 | 16 | 5 | 3.198863 |
3. Econometric Analysis: OLS Regression
We run an Ordinary Least Squares (OLS) regression to model Wage as a function of Education and Experience.
This classic econometric technique provides us with coefficients, standard errors, and other statistics.
# ---------------------------
# Econometrics Approach (OLS)
# ---------------------------
def run_ols(data):
"""
Run an OLS regression using Education and Experience to predict Wage.
Parameters:
- data: The input DataFrame.
Returns:
- model: The fitted OLS regression model.
"""
# Add a constant term for the intercept
X = sm.add_constant(data[['Education', 'Experience']])
Y = data['Wage'].values
model = sm.OLS(Y, X).fit()
print("Econometrics (OLS) Results:")
print(model.summary())
return model
# Econometrics Approach (OLS Regression)
ols_model = run_ols(data)
Econometrics (OLS) Results:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.176
Model: OLS Adj. R-squared: 0.174
Method: Least Squares F-statistic: 106.6
Date: Tue, 11 Feb 2025 Prob (F-statistic): 1.15e-42
Time: 14:16:56 Log-Likelihood: -1109.6
No. Observations: 1000 AIC: 2225.
Df Residuals: 997 BIC: 2240.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.9147 0.127 15.084 0.000 1.666 2.164
Education 0.0850 0.008 10.638 0.000 0.069 0.101
Experience 0.0217 0.002 10.486 0.000 0.018 0.026
==============================================================================
Omnibus: 0.458 Durbin-Watson: 2.050
Prob(Omnibus): 0.795 Jarque-Bera (JB): 0.546
Skew: 0.014 Prob(JB): 0.761
Kurtosis: 2.889 Cond. No. 146.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4. Machine Learning Approach: Random Forest Regression
We apply a Random Forest regression to predict Wage.
The dataset is split into training (80%) and testing (20%) sets.
The model is then evaluated using Mean Squared Error (MSE).
# ---------------------------
# Machine Learning Approach (Random Forest)
# ---------------------------
def run_random_forest(data):
"""
Train a Random Forest to predict Wage and print the test MSE.
Parameters:
- data: The input DataFrame.
Returns:
- rf: The fitted Random Forest model.
- mse: The mean squared error on the test set.
"""
X = data[['Education', 'Experience']]
Y = data['Wage']
# Train-test split (80% train, 20% test)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Train Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, Y_train)
# Predict on test set and calculate MSE
Y_pred = rf.predict(X_test)
mse = mean_squared_error(Y_test, Y_pred)
print("Machine Learning (Random Forest) MSE:", mse)
return rf, mse
# Machine Learning Approach (Random Forest)
rf_model, rf_mse = run_random_forest(data)
Machine Learning (Random Forest) MSE: 0.7115261448887159
5. Causal Inference: Double Machine Learning (DML)
We use the Double Machine Learning (DML) approach to estimate the causal effect of Education on Wage.
In this setup:
- Treatment (T): Education
- Outcome (Y): Wage
- Confounders (W): Experience
Two nuisance models are used: one for predicting the outcome and one for the treatment assignment.
# ---------------------------
# Double Machine Learning (DML)
# ---------------------------
def run_dml(data):
"""
Run Double Machine Learning (DML) using LinearDML to estimate
the causal effect of Education on Wage.
Parameters:
- data: The input DataFrame.
Returns:
- dml: The fitted DML model.
- treatment_effect: The estimated constant marginal treatment effect.
"""
Y = data['Wage'].values
T = data['Education'].values.reshape(-1, 1) # Treatment: Education
W = data[['Experience']].values # Confounder: Experience
# Define ML models for the nuisance parameters:
# ml_model_y: Model for predicting the outcome
# ml_model_t: Model for predicting the treatment
ml_model_y = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
ml_model_t = LassoCV(cv=5, n_jobs=-1, max_iter=10000)
# Fit the DML model
dml = LinearDML(model_y=ml_model_y, model_t=ml_model_t, random_state=42)
dml.fit(Y, T, X=None, W=W)
# Estimate the constant marginal treatment effect
treatment_effect = dml.const_marginal_effect()
print(f"Estimated Causal Effect of Education on Wage (DML): {treatment_effect.flatten()[0]:.4f}")
return dml, treatment_effect
# 4. Double Machine Learning (DML) Causal Estimation
dml_model, dml_treatment_effect = run_dml(data)
Estimated Causal Effect of Education on Wage (DML): 0.0847
6. Causal Forest: Estimating Heterogeneous Treatment Effects (HTE)
We apply a Causal Forest to estimate the heterogeneous treatment effects of Education on Wage.
In the first run, we use Experience as the effect modifier.
In the second run, we introduce an interaction term between Education and Experience as additional modifiers.
The causal forest allows us to see how the treatment effect may vary across different levels of the modifiers.
# ---------------------------
# Causal Forest (Heterogeneous Treatment Effects)
# ---------------------------
def run_causal_forest(data):
"""
Run Causal Forest (CausalForestDML) to estimate heterogeneous treatment effects (HTE).
First, using Experience as the effect modifier.
Then, re-fit using an interaction term between Education and Experience.
Parameters:
- data: The input DataFrame.
Returns:
- cfdml: The fitted causal forest model.
- hte_effects: Estimated HTE when using Experience as the effect modifier.
- hte_effects_interaction: Estimated HTE when using both Experience and the interaction term as modifiers.
"""
Y = data['Wage'].values
T = data['Education'].values.reshape(-1, 1)
ml_model_y = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
ml_model_t = LassoCV(cv=5, n_jobs=-1, max_iter=10000)
# First, use Experience as the effect modifier
cfdml = CausalForestDML(model_y=ml_model_y, model_t=ml_model_t, random_state=42)
cfdml.fit(Y, T, X=data[['Experience']].values, W=None)
hte_effects = cfdml.effect(data[['Experience']].values)
print(f"Estimated HTE Mean (Experience as modifier): {np.mean(hte_effects):.4f}")
# Next, create an interaction term and use both Experience and Interaction as modifiers
data['Interaction'] = data['Education'] * data['Experience']
X_effect = data[['Experience', 'Interaction']].values
cfdml.fit(Y, T, X=X_effect, W=None)
hte_effects_interaction = cfdml.effect(X_effect)
return cfdml, hte_effects, hte_effects_interaction
# 5. Causal Forest for Heterogeneous Treatment Effects (HTE)
cfdml_model, hte_effects, hte_effects_interaction = run_causal_forest(data)
Estimated HTE Mean (Experience as modifier): 0.0842
7. Visualization Functions
We define helper functions to visualize:
- The distribution of the estimated treatment effects (histogram).
- The relationship between the treatment effect and a variable (scatter plot).
- Subgroup analysis by binning the Education variable.
# ---------------------------
# Visualization Functions
# ---------------------------
def plot_histogram(data_array, xlabel, ylabel, title, color='blue'):
"""
Plot a histogram with a kernel density estimate.
Parameters:
- data_array: The data to plot.
- xlabel: Label for the x-axis.
- ylabel: Label for the y-axis.
- title: Plot title.
- color: Color for the plot.
"""
plt.figure(figsize=(8, 5))
sns.histplot(data_array, bins=30, kde=True, color=color)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
def plot_scatter(x, y, xlabel, ylabel, title, color='blue'):
"""
Plot a scatter plot.
Parameters:
- x: Data for the x-axis.
- y: Data for the y-axis.
- xlabel: Label for the x-axis.
- ylabel: Label for the y-axis.
- title: Plot title.
- color: Color for the markers.
"""
plt.figure(figsize=(8, 5))
sns.scatterplot(x=x, y=y, color=color, alpha=0.6)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
def subgroup_analysis(data, hte_effects):
"""
Perform subgroup analysis by binning Education and plotting the average HTE.
Parameters:
- data: The input DataFrame.
- hte_effects: Estimated treatment effects.
"""
data = data.copy() # Work on a copy to avoid modifying the original DataFrame
data['HTE'] = hte_effects
# Bin Education into categories; include the lowest value explicitly
education_bins = pd.cut(
data['Education'],
bins=[10, 12, 14, 16, 18, 20],
labels=['10-12', '12-14', '14-16', '16-18', '18-20'],
include_lowest=True
)
edu_means = data.groupby(education_bins)['HTE'].mean()
plt.figure(figsize=(8, 5))
edu_means.plot(kind='bar', color='purple')
plt.xlabel("Education Level (Years)")
plt.ylabel("Average Estimated Treatment Effect")
plt.title("HTE by Education Level")
plt.show()
8. Visualizing the Results
We now generate several plots:
- Histogram: Distribution of the estimated treatment effects when using Experience as the effect modifier.
- Scatter Plot: Relationship between Experience and the estimated treatment effects.
- Subgroup Analysis: Average estimated treatment effects by education levels.
- Interaction Plot: Relationship between the interaction term (Education × Experience) and the estimated treatment effects.
# Histogram of HTE (using Experience as effect modifier)
plot_histogram(
hte_effects,
xlabel="Estimated Treatment Effect",
ylabel="Frequency",
title="Distribution of HTE (Experience as Modifier)"
)
# Scatter plot: HTE vs. Experience
plot_scatter(
x=data['Experience'],
y=hte_effects,
xlabel="Experience (Years)",
ylabel="Estimated Treatment Effect",
title="HTE vs. Experience"
)
# Subgroup Analysis by Education Levels
subgroup_analysis(data, hte_effects)
# Scatter plot: HTE vs. Interaction of Education & Experience
plot_scatter(
x=data['Interaction'],
y=hte_effects_interaction,
xlabel="Education * Experience Interaction",
ylabel="Estimated Treatment Effect",
title="HTE vs. Interaction of Education & Experience",
color='green'
)