{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 20 - Plug-and-Play Estimators\n", " \n", "So far, we've seen how to debias our data in the case where the treatment is not randomly assigned, which results in confounding bias. That helps us with the identification problem in causal inference. In other words, once the units are exchangeable, or $Y(0), Y(1) \\perp X$, it becomes possible to learn the treatment effect. But we are far from done. \n", " \n", "Identification means that we can find the average treatment effect. In other words, we know how effective a treatment is on average. Of course this is useful, as it helps us to decide if we should roll out a treatment or not. But we want more than that. We want to know if there are subgroups of units that respond better or worse to the treatment. That should allow for a much better policy, one where we only treat the ones that will benefit from it.\n", " \n", " \n", "## Problem Setup\n", " \n", "Let's recall our setup of interest. Given the potential outcomes, we can define the individual treatment effect as the difference between the potential outcomes.\n", " \n", "$\n", "\\tau_i = Y_i(1) − Y_i(0),\n", "$\n", " \n", "or, the continuous treatment case, $\\tau_i = \\partial Y(t)$, where $t$ is the treatment variable. Of course, we can never observe the individual treatment effect, because we only get to see the one of potential outcomes\n", " \n", "$\n", "Y^{obs}_i(t)= \n", "\\begin{cases}\n", "Y_i(1), & \\text{if } t=1\\\\\n", "Y_i(0), & \\text{if } t=0\n", "\\end{cases}\n", "$\n", " \n", "We can define the average treatment effect (ATE) as\n", " \n", "$\n", "\\tau = E[Y_i(1) − Y_i(0)] = E[\\tau_i]\n", "$\n", " \n", "and the conditional average treatment effect (CATE) as\n", " \n", "$\n", "\\tau(x) = E[Y_i(1) − Y_i(0)|X] = E[\\tau_i|X]\n", "$\n", " \n", "In Part I of this book, we've focused mostly on the ATE. Now, we are interested in the CATE. The CATE is useful for personalising a decision making process. For example, if you have a drug as the treatment $t$, you want to know which type of patients are more responsive to the drug (higher CATE) and if there are some types of patient with a negative response (CATE < 0). \n", " \n", "We've seen how to estimate the CATE using a linear regression with interactions between the treatment and the features\n", " \n", "$\n", "y_i = \\beta_0 + \\beta_1 t_i + \\beta_2 X_i + \\beta_3 t_i X_i + e_i.\n", "$\n", " \n", "If we estimate this model, we can get estimates for $\\tau(x)$\n", " \n", "$\n", "\\hat{\\tau}(x) = \\hat{\\beta}_1 + \\hat{\\beta}_3 X_i\n", "$\n", " \n", "Still, the linear models have some drawbacks. The main one being the linearity assumption on $X$. Notice that you don't even care about $\\beta_2$ on this model. But if the features $X$ don't have a linear relationship with the outcome, your estimates of the causal parameters $\\beta_1$ and $\\beta_3$ will be off. \n", " \n", "It would be great if we could replace the linear model by a more flexible machine learning model. We could even plug the treatment as a feature to a ML model, like boosted trees or a neural network\n", " \n", "$\n", "y_i = M(X_i, T_i) + e_i\n", "$\n", " \n", "but from there, it is not clear how we can get treatment effect estimates, since this model will output $\\hat{y}$ predictions, not $\\hat{\\tau(x)}$ predictions. Ideally, we would use a machine learning regression model that, instead of minimising the outcome MSE\n", " \n", "$\n", "E[(Y_i - \\hat{Y}_i)^2]\n", "$\n", " \n", "would minimise the treatment effect MSE\n", " \n", "$\n", "E[(\\tau(x)_i - \\hat{\\tau}(x)_i)^2] = E[(Y_i(1) - Y_i(0) - \\hat{\\tau}(x)_i)^2].\n", "$\n", " \n", "However, this criterion is what we call infeasible. Again, the problem here is that $\\tau(x)_i$ is not observable, so we can't optimize it directly. This puts us in a tough spot... Let's try to simplify it a bit and maybe we can think of something.\n", "\n", "![img](./data/img/plug-and-play-estimators/infeasible.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Target Transformation\n", " \n", "Suppose your treatment is binary. Let's say you are an investment firm testing the effectiveness of sending a financial education email. You hope the email will make people invest more. Also, let's say you did a randomized study where 50% of the customers got the email and the other 50% didn't. \n", " \n", "Here is a crazy idea: let's transform the outcome variable by multiplying it with the treatment.\n", " \n", "$\n", "Y^*_i = 2 Y_i * T_i - 2 Y_i*(1-T_i)\n", "$\n", " \n", " \n", "So, if the unit was treated, you would take the outcome and multiply it by 2. If it wasn't treated, you would take the outcome and multiply it by -2. For example, if one of your customers invested BRL 2000,00 and got the email, the transformed target would be 4000. However, if he or she didn't get the email, it would be -4000. \n", " \n", "This seems very odd, because you are saying that the effect of the email can be a negative number, but bare with me. If we do a little bit of math, we can see that, on average or in expectation, this transformed target will be the treatment effect. This is nothing short of amazing. What I'm saying is that by applying this somewhat wacky transformation, I get to estimate something that I can't even observe. \n", " \n", "To understand that, we need a bit of math. Because of random assignment, we have that $T \\perp Y(1), Y(1)$, which is our old unconfoundedness friend. That implies that $E[T, Y(t)]=E[T]*E[Y(t)]$, which is the definition of independence.\n", "\n", "\n", "Also, we know that\n", "\n", "$\n", "Y_i*T_i = Y(1)_i*T_i \\text{ and } Y_i*(1-T_i) = Y(0)_i*T_i\n", "$\n", "\n", "because the treatment is what materializes one or the other potential outcomes. With that in mind, let's take the expected value of $Y^*_i$ and see what we end up with. \n", " \n", "\n", "\\begin{align}\n", "E[Y^*_i|X_i=x] &= E[2 Y(1)_i * T_i - 2 Y(0)_i*(1-T_i)|X_i=x] \\\\\n", "&= 2E[Y(1)_i * T_i | X_i=x] - 2E[Y(0)_i*(1-T_i)|X_i=x]\\\\\n", "&= 2E[Y(1)_i| X_i=x] * E[ T_i | X_i=x] - 2E[Y(0)_i| X_i=x]*E[(1-T_i)|X_i=x] \\\\\n", "&= 2E[Y(1)_i| X_i=x] * 0.5 - 2E[Y(0)_i| X_i=x]*0.5 \\\\ \n", "&= E[Y(1)_i| X_i=x] - E[Y(0)_i| X_i=x] \\\\\n", "&= \\tau(x)_i\n", "\\end{align}\n", "\n", " \n", "So, this apparently crazy idea ended up being an unbiased estimate of the individual treatment effect $\\tau(x)_i$. Now, we can replace our infeasible optimization criterion with \n", " \n", "$\n", "E[(Y^*_i - \\hat{\\tau}(x)_i)^2]\n", "$\n", " \n", "In simpler terms, all we have to do is use any regression machine learning model to predict $Y^*_i$ and this model will output treatment effect predictions. \n", " \n", "Now that we've solved the simple case, what about the more complicated case, where treatment is not 50% 50%, or not even randomly assigned? As it turns out, the answer is a bit more complicated, but not much. First, if we don't have random assignment, we need at least conditional independence $T \\perp Y(1), Y(1) | X$. That is, controlling for $X$, $T$ is as good as random. With that, we can generalize the transformed target to\n", " \n", "$\n", "Y^*_i = Y_i * \\dfrac{T_i - e(X_i)}{e(X_i)(1-e(X_i))}\n", "$\n", " \n", "where $e(X_i)$ is the propensity score. So, if the treatment is not 50% 50%, but randomized with a different probability $p$, all you have to do is replace the propensity score in the above formula with $p$. If the treatment is not random, then you have to use the propensity score, either stored or estimated. \n", " \n", "If you take the expectation of this, you will see that it also matches the treatment effect. The proof is left as an exercise to the reader. Just kidding, here it is. It's a bit cumbersome, so feel free to skip it. \n", " \n", "\n", "\\begin{align}\n", "E[Y^*_i|X_i=x] &= E\\big[Y_i * \\dfrac{T_i - e(X_i)}{e(X_i)(1-e(X_i))}|X_i=x\\big] \\\\\n", "&= E\\big[Y_i T_i * \\dfrac{T_i - e(X_i)}{e(X_i)(1-e(X_i))} + Y_i (1-T_i) * \\dfrac{T_i - e(X_i)}{e(X_i)(1-e(X_i))}|X_i=x\\big]\\\\\n", "&= E\\big[Y(1)_i * \\dfrac{T_i(1 - e(X_i))}{e(X_i)(1-e(X_i))} | X_i=x\\big] - E\\big[Y(0)_i * \\dfrac{(1-T_i)e(X_i)}{e(X_i)(1-e(X_i))}|X_i=x\\big]\\\\\n", "&= \\dfrac{1}{e(X_i)} E[Y(1)_i * T_i|X_i=x] - \\dfrac{1}{1-e(X_i)} E[Y(0)_i * (1-T_i)| X_i=x]\\\\\n", "&= \\dfrac{1}{e(X_i)} E[Y(1)_i|X_i=x] * E[T_i|X_i=x] - \\dfrac{1}{1-e(X_i)} E[Y(0)_i|X_i=x] * E[(1-T_i)| X_i=x]\\\\\n", "&= E[Y(1)_i|X_i=x] - E[Y(0)_i|X_i=x]\\\\\n", "&= \\tau(x)_i\n", "\\end{align}\n", "\n", " \n", "As always, I think this will become much more concrete with an example. Again, consider the investment emails we've sent trying to make people invest more. The outcome variable the binary (invested vs didn't invest) converted." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns\n", "from nb21 import cumulative_gain, elast" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageincomeinsuranceinvestedem1em2em3converted
044.15483.806155.2914294.810110
139.82737.9250069.407468.151000
249.02712.515707.085095.651011
339.72326.3715657.976345.201110
435.32787.2627074.4414114.861110
\n", "
" ], "text/plain": [ " age income insurance invested em1 em2 em3 converted\n", "0 44.1 5483.80 6155.29 14294.81 0 1 1 0\n", "1 39.8 2737.92 50069.40 7468.15 1 0 0 0\n", "2 49.0 2712.51 5707.08 5095.65 1 0 1 1\n", "3 39.7 2326.37 15657.97 6345.20 1 1 1 0\n", "4 35.3 2787.26 27074.44 14114.86 1 1 1 0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email = pd.read_csv(\"./data/invest_email_rnd.csv\")\n", "email.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal here is one of personalization. Let's focus on email-1. We wish to send it only to those customers who will respond better to it. In other words, we wish to estimate the conditional average treatment effect of email-1\n", " \n", "$\n", "E[Converted(1)_i - Converted(0)_i|X_i=x] = \\tau(x)_i\n", "$\n", " \n", "so that we can target those customers who will have the best response to the email (higher CATE)\n", " \n", "But first, let's break our dataset into a training and a validation set. We will estimate $\\tau(x)_i$ on one set and evaluate the estimates on the other." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(9000, 8) (6000, 8)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "np.random.seed(123)\n", "train, test = train_test_split(email, test_size=0.4)\n", "print(train.shape, test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we will apply the target transformation we've just learned. Since the emails were randomly assigned (although not on a 50% 50% basis), we don't need to worry about the propensity score. Rather, it is constant and equal to the treatment probability." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "y = \"converted\"\n", "T = \"em1\"\n", "X = [\"age\", \"income\", \"insurance\", \"invested\"]\n", "\n", "ps = train[T].mean()\n", "\n", "y_star_train = train[y] * (train[T] - ps)/(ps*(1-ps))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the transformed target, we can pick any ML regression algorithm to predict it. Lets use boosted trees here." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from lightgbm import LGBMRegressor\n", "\n", "np.random.seed(123)\n", "cate_learner = LGBMRegressor(max_depth=3, min_child_samples=300, num_leaves=5)\n", "cate_learner.fit(train[X], y_star_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model can now estimate $\\tau(x)_i$. In other words, what it outputs is $\\hat{\\tau}(x)_i$. For example, if we make predictions on the test set, we will see that some units have higher CATE than others. For example, customer 6958 has a CATE of 0.1, meaning the probability he or she will buy our investment product is predicted to increase by 0.1 if we send the email to this customer. In contrast, for customer 3903, the probability of buying the product is predicted to increase just 0.04." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageincomeinsuranceinvestedem1em2em3convertedcate
695840.94486.1437320.3312559.2500100.105665
753442.66386.1913270.4729114.4200100.121922
297547.61900.2625588.722420.3900100.034161
390341.05802.1957087.3720182.2010110.046805
843749.12202.965050.819245.881010-0.009099
\n", "