Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Causality

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import sklearn as skl

Causality

Associational models versus causal models

Causal models differ from associational models in that they codify causal directions not just associations. In our program, you might have learned of the use of propensity scores, counterfactuals or randomization to study causality. There, typically the goal is to make causal statements with as few assumptions as possible or at least understanding the assumptions. Typically, the object of study is the estimation of an effect avergated over covarites.

Causal graphs take a different approach, even if they wind up at the same place. Here, the goal is to postulate hyptothetical causal relationships and use those hypothetical relationships to estimate causal effects.

Graphs and graphical models

A graph, GG is a collection of nodes, say V={1,,p}V=\{1,\ldots, p\} and a set edges between the nodes, i.e. a set of elements (i,j)(i,j). The graph is directed if (i,j)(i,j) is considered different then (j,i)(j,i).

Node ii is a parent of node jj if (i,j)E(i,j) \in E and (j,i)E(j,i)\notin E. Similarly, node ii is a child of node jj if (j,i)E(j,i) \in E and (i,j)E(i,j)\notin E. A node is a descendant of another if it is a child, or a child of a child and so on.

DAG and SCMs

DAGs define a unique factorization (set of independence relationships) with compatible probability models. I find it useful to think of causal DAGs in the terms of structural causal models (SCMs). Such models demonstrate an example of a generative models that statisfy the DAG and the have clear connections with the probabability models. An SCM over a collection of variables, X=(X1,,Xp)X=(X_1, \ldots, X_p), postulates a set of functional relationships

Xj=f(Pj,ϵj)X_j = f(P_j, \epsilon_j)

where PjP_j are the antecedent causes of XjX_j, called the parents of XjX_j, and ϵj\epsilon_j is an accumulation of variables treated as mutally independent. This defines a directed graph, GG say, where a graph is collection of vertices corresponding to our variables, V={1,,p}V=\{1,\ldots, p\}, corresponding to the XiX_i, and edges, EE, which is a set of ordered pairs of nodes.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("X1",  pos = (0.5,   1) )
G.add_node("X2",  pos = (0  ,   0) )
G.add_node("X3",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X1", "X2"],
  ["X1", "X3"],
  ["X2", "X3"],
] )

nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
plt.show()
<Figure size 300x300 with 1 Axes>

In this case, P1={}P_1 = \{\}, P2={1}P_2 = \{1\} and P3={1,2}P_3 = \{1,2\}. DAGs in general define the independence assumptions associated with compatible probability models. SCMs are such an example that clearly define compatible probability models. Of course, given a large enough cross-sectional sample, we can estimate the joint distribution of P(X1,,Xp)P(X_1,\ldots, X_p) and all of its conditionals. Disregarding statistical variation, which can be accounted for using traditional inferential methods, these conditionals should agree with the independence relationships from the DAG, if the DAG is correct. This yields a fruitful way to consider probability models. For example one could use DAGs as a heuristic and see how the observed data agrees with the independence relationships in compatible probability models implied by the DAG.

By itself, this does not create any causal claims. However, the following strategy could. Postulate a causal model, like the SCM, consider the independence relationships implied by the SCM, compares those indepnence relationships with those seen in the observed data. This gives us a method to falsify causal models using the data.

One specific way in which we use the assumptions is to investigate how the graph changes when we fix a node at a specific value, like an intevention, thus breaking its association with its parents. This operation is conceptual, but at times we can relate probabilities associated with interventions that were not realized. Consider an instrance where where X1X_1 is a collection of confounders, X2X_2 is an exposure and X3X_3 is an outcome. Ideally, we’d like to know

P(X3  do(X2)=x2)P(X_3 ~|~ do(X_2) = x_2)

That is, the impact on the response if we were to set the exposure to e0e_0.

Blocking and d-separation

Before we talk about interventions, let’s consider discussing compatibility of the hypothetical directed graph and our observed data. Return to our previous diagram.

  • X1X1 is a confounder betweend X2X2 and X3X3

  • X2X2 is a mediator between X1X1 and X3X3

  • X3X3 is a collider between X1X1 and X2X2

Consider an example. X1X1 is having a BMI > 35, X2X2 is sleep disordered breathing and X3X3 is hypertension.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("BMI35",  pos = (0.5,   1) )
G.add_node("SDB",  pos = (0  ,   0) )
G.add_node("HTN",  pos =  (1 ,   0) )
G.add_edges_from([
  ["BMI35", "SDB"],
  ["BMI35", "HTN"],
  ["SDB", "HTN"],
] )

nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 4000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
plt.show()
<Figure size 300x300 with 1 Axes>

Here if we’re studying whether SDB causes HTN, BMI35 confounds the relationship as a possible common cause of both. We would need to adjust for BMI35 to make sure the association between SDB and HTN isn’t just due to this common cause.

If we were studying whether BMI35 causes HTN, we might be interested in how much of that effect is mediated (indirectly) through SDB and how much is directly from BMI35.

If we are studying the relationship between BMI35 and SDB directly, adjusting for HTN may cause an association. Consider the (fictitious) case where there is a large number of people who have SDB who are not obese, yet all have hypertension, for whatever the reason. Then, among the HTN, there could be a negative association between BMI35 and SDB, because of the large collection of patients would who have SDB and are not obese and the same for obese and not hyptertensive. That is, adjusting for HTN created an association. This is an example of Berkson’s paradox. This is a somewhat contrived example, but hopefully you get the point. The wikipedia article has a funny example where they consider X1X_1 is whether or not the hamburger is good at a fast food restaurant, X2X_2 is whether or not the fries are good and X3X_3 is whether or not people eat there. Since few people would eat at a place where both the hamburger and fries are bad, conditioning on X3X_3 can create a negative association.

The main point here is that adjusting for colliders may open up a pathway between the nodes.

A path between two nodes n1n_1 and nkn_k is a sequence of nodes, v1,v2,vkv_1, v_2,\ldots v_{k}, where viv_{i} and vi+1v_{i+1} are connected. The path is directed if viv_{i} points to vi+1v_{i+1} for i=1,,ki=1,\ldots,k. A graph is a Directed Acyclic Graph (DAG) if all edges are directed and there are no two nodes viv_i and vjv_j with a directed path in both directions.

A path between v1v_1 and vkv_k, v1,,vkv_1,\ldots, v_k, is blocked by a set of nodes, SS, if for some vjv_j in SS

  1. vjSv_j\in S and vkv_k is a mediator or confounder between vj1v_{j-1} and vj+1v_{j+1} in either direction or

  2. vjSv_j\notin S and all of the descendants of vjSv_j \notin S and vjv_{j} is a collider between vj1v_{j-1} and vj+1v_{j+1}.

In other words, a path is blocked if a mediator or confounder is included in SS or a collider and all of its descendants is excluded from SS.

For 1. this is equivalent to saying one of

  • vj1vjvj+1v_{j-1}\rightarrow v_{j} \rightarrow v_{j+1}

  • vj1vjvj+1v_{j-1}\leftarrow v_{j} \leftarrow v_{j+1}

  • vj1vjvj+1v_{j-1}\leftarrow v_{j} \rightarrow v_{j+1}

holds. For 2. recall a collider is vj1vjvj+1v_{j-1}\rightarrow v_{j} \leftarrow v_{j+1}.

This could be translated into the following statistical statement. Conditioning on a mediator or confounder or not conditioning on a collider blocks a path, conditioning on a collider opens a path.

We say that two nodes or groups of nodes are d-separated by a set of nodes, SS, if every path between nodes in the two groups is blocked by SS. d-separation is useful because it gives us conditional independence relationships in the sense that if XiX_i is d-separated with XjX_j given SS then XiXj  SX_i \perp X_j ~|~ S on all probability distribution compatible with the graph.

Consider the following graph.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("X1",  pos = (0,   0) )
G.add_node("X2", pos = (0,   1) )
G.add_node("X3", pos = (1, 1.5) )
G.add_node("X4", pos = (1,  .5) )
G.add_node("X5", pos = (2,   1) )
G.add_node("X6",  pos = (2,   0) )
G.add_edges_from([
  ["X2", "X1" ],
  ["X2", "X3"],
  ["X3", "X1" ],
  ["X3", "X4"],
  ["X4", "X6" ],
  ["X5", "X3"],
  ["X5", "X6" ]
] )

nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 2.3])
ax.set_ylim([-.3, 2.3])
plt.show()
<Figure size 300x300 with 1 Axes>

X1X_1 and X5X_5 are conditionally indepndent given X2X_2 and X3X_3. Why? Conditioning on X2X_2 blocks the paths X1X2X3X5X_1 \leftarrow X_2 \rightarrow X_3 \leftarrow X_5 even despite the part X2X3X5X_2 \rightarrow X_3 \leftarrow X_5 is opened by conditioning on the collider, X3X_3. Furthermore, conditioning on X2X_2 or X3X_3 blocks the path X1X2X3X4X6X5X_1 \leftarrow X_2 \rightarrow X_3 \rightarrow X_4 \rightarrow X_6 \leftarrow X_5. Finally, conditioning on X3X_3 blocks the path X1X3X5X_1 \leftarrow X_3 \leftarrow X_5.

Another interesting one is that X2X_2 and X5X_5 are marginally independent. This is because not conditioining on X3X_3 blocks the path X2X3X5X_2 \rightarrow X_3 \leftarrow X_5 and not conditioning on X6X_6 blocks the path X2X3X4X6X5X_2 \rightarrow X_3 \rightarrow X_4 \rightarrow X_6 \leftarrow X_5.

Here’s the complete set of conditional independence relationships.

  1. X1X_1 and X4X_4 are d-separated by {X3}\{X_3\}

  2. X1X_1 and X5X_5 are d-separated by {X2,X3}\{X_2, X_3\}

  3. X1X_1 and X6X_6 are d-separated by {X4,X5}\{X_4, X_5\}, {X3,X5}\{X_3, X_5\}, {X2,X3}\{X_2, X_3\}

  4. X2X_2 and X4X_4 are d-separated by {X3}\{X_3\}

  5. X2X_2 and X5X_5 are d-separated by {}\{\} (the null set, i.e. they’re marginally independent).

  6. X2X_2 and X6X_6 are d-separated by {X4,X5}\{X_4, X_5\}, {X3,X5}\{X_3, X_5\}

  7. X3X_3 and X6X_6 are d-separated by {X4,X5}\{X_4, X_5\}

  8. X4X_4 and X5X_5 are d-separated by X3X_3.

These all imply the independence relationships, such as X1X4  X3X_1 \perp X_4 ~|~ X_3.

Do calculus and backdoor criterion

Recall that specifying a causal graph implies the independence relationships of a probability distribution under assumptions such as the SCM. Sometimes, we’re interested in the causal relationship between an exposure, XX and an outcome, YY. Consider a theoretical intervention obtained by setting X=aX = a, which we write as do(X)=ado(X) = a. We want to estimate P(Y  do(X)=a)P(Y ~|~ do(X) = a).

A set ZZ satisfies the back door criterion with respect to nodes XX and YY if

  1. No descendant of XX is in ZZ.

  2. ZZ blocks every path XX and YY that contains an arrow pointing to XX.

The back door criteria is similar to d-separation. However, we only focus on arrows pointing into XX and don’t allow for descendants of XX.

The magic of the back door adjustment comes from the relationship, the adjustment formula:

P(Y  do(X)=x)=zSP(y x,z)p(z)P(Y ~|~ do(X) = x) = \sum_{z\in S} P(y ~ | x, z) p(z)

where SS satisfies the back door criterion. If the zz are all observed variables, then the RHS of this equation is estimable. Note the interesting statement that not all variables need to be observed, just yy, xx and zz.

So, in our previous example, adjusting for S={X2,X3}S = \{X_2, X_3\} allows us to estimate the causal effect of XX on YY, even if X4X_4 and X5X_5 are not measured.

It’s important to emphasize, that every aspect of the adjustment formula is theoretically estimable if YY, XX and the nodes in SS are observed.

Consider the following graph.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("X",  pos = (0,   0) )
G.add_node("Y",  pos = (2,   0) )
G.add_node("X2", pos = (0,   1) )
G.add_node("X3", pos = (1, 1.5) )
G.add_node("X4", pos = (1,  .5) )
G.add_node("X5", pos = (2,   1) )
G.add_edges_from([
  ["X", "Y"  ],
  ["X2", "X" ],
  ["X2", "X3"],
  ["X3", "X" ],
  ["X3", "X4"],
  ["X4", "Y" ],
  ["X5", "X3"],
  ["X5", "Y" ]
] )

nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 2.3])
ax.set_ylim([-.3, 2.3])
plt.show()
<Figure size 300x300 with 1 Axes>

Here are the minimal backdoor adjustment variables between XX and YY:

  1. S={X2,X3}S = \{X_2, X_3\}

  2. S={X3,X5}S = \{X_3, X_5\}

  3. S={X4,X5}S = \{X_4, X_5\}

Here are some invalid backdoor sets of variables.

  1. SS equal to any single node.

    1. S={X3}S=\{X_3\} does not block the path XX2X3X5YX\leftarrow X_2 \rightarrow X_3 \leftarrow X_5 \rightarrow Y.

    2. S={X4}S=\{X_4\} or S={X2}S=\{X_2\} does not block the path XX3X5YX \leftarrow X_3 \leftarrow X_5 \rightarrow Y.

    3. S={X5}S=\{X_5\} does not block the path XX3X4YX \leftarrow X_3 \leftarrow X_4 \rightarrow Y.

  2. S={X3,X4}S=\{X_3, X_4\} does not block the path XX2X3X5YX \leftarrow X_2 \rightarrow X_3 \leftarrow X_5 \rightarrow Y.

  3. S={X2,X4}S=\{X_2, X_4\} does not block the path XX3X5YX\leftarrow X_3 \leftarrow X_5 \rightarrow Y.

Example graphs

In all the following, we’re interested in the causal effect of XX and YY. ZZ variables are observed and UU variables are unobserved. Every variable is binary to make the discussion easier.

Randomization

If XX is randomized and everyone takes the treatment assigned to them (left plot) then XX has no parents other than the randomization mechanism,RR. We’re omitting any descendants of XX since we don’t have to worray about them. Regardless of the complexity of the relationship between the collection of observed, unobserved, known and unknown variables, Z,UZ, U, and YY we can estimate the causal effect simply without conditioning on anything.

In contrast, if some people ignore their randomized treatment status and elect to choose a different treatment one may have opened a backdoor path (right plot). For example, if the treatment can’t be blinded and those randomized to the control with the worst baseline symptoms elect to obtain the treatment elsewhere.

plt.figure(figsize=[6, 3])

plt.subplot(1, 2, 1)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("R",  pos = (0  ,   1) )
G.add_node("U,Z",  pos = (1,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["R", "X"],
  ["X", "Y"],
  ["U,Z", "Y"],
  ["Y", "U,Z"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])




plt.subplot(1, 2, 2)

G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("R",  pos = (0  ,   1) )
G.add_node("U,Z",  pos = (1,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["R", "X"],
  ["X", "Y"],
  ["U,Z", "Y"],
  ["U,Z", "X"],
  ["Y", "U,Z"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])



plt.show()
<Figure size 600x300 with 2 Axes>

Simple confounding

The diagram below shows classic confounding. Conditioning in ZZ allows for the estimation of the causal effect.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("Z",  pos = (0.5,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["Z", "X"],
  ["Z", "Y"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
plt.show()
<Figure size 300x300 with 1 Axes>

Now the estimate of the adjusted effect (under our assumptions) is

P(Y  do(X)=x)=P(Y  X=x,z=0)P(z=0)+P(Y  X=x,Z=1)P(Z=1)P(Y ~|~ do(X) = x) = P(Y ~|~ X=x, z = 0)P(z = 0) + P(Y ~|~ X=x, Z=1)P(Z=1)

In the following two examples, the unmeasured confounder UU can be controlled for by conditioning on ZZ and exactly the same estimate can be used as in the simple confounding model.

plt.figure(figsize=[6, 3])

plt.subplot(1, 2, 1)
G = nx.DiGraph()
G.add_node("X",  pos = (0,   0) )
G.add_node("Z",  pos = (0,   1) )
G.add_node("U",  pos = (1,   1) )

G.add_node("Y",  pos = (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["U", "Z"],
  ["Z", "X"],
  ["U", "Y"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])


plt.subplot(1, 2, 2)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("U",  pos = (0,   1) )
G.add_node("Z",  pos = (1,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["U", "X"],
  ["U", "Z"],
  ["Z", "Y"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
(-0.3, 1.3)
<Figure size 600x300 with 2 Axes>

Mediation

In mediation, all or part of the effect of XX on YY flows through yet another variable ZZ.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("Z",  pos = (0.5,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["X", "Z"],
  ["Z", "Y"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
plt.show()
<Figure size 300x300 with 1 Axes>

The backdoor criteria does not apply here, since ZZ is a descendant of XX. To answer the question “What is the causal effect of XX on YY?” one need not adjust. However, mediation is typically studied in a different way. Instead, one asks questions such as “How much of the effect of XX on YY flows or doesn’t flow through ZZ?”. To answer this question, one usually conditions on ZZ for a different goal than the backdoor adjustment is accomplishing.

cinelli2021crash shows an interesting example of mediation where one would want to adjust for ZZ (left plot below).

plt.figure(figsize=[10, 3])

plt.subplot(1, 2, 1)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("U",  pos = (0  ,   1) )
G.add_node("Z",  pos = (1,   1) )
G.add_node("M",  pos =  (1 ,   0) )
G.add_node("Y",  pos =  (2 ,   0) )

G.add_edges_from([
  ["U", "X"],
  ["U", "Z"],
  ["X", "M"],
  ["Z", "M"],
  ["M", "Y"]
] )

nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 2.3])
ax.set_ylim([-.3, 1.3])

plt.subplot(1, 2, 2)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("Z",  pos = (1,   1) )
G.add_node("M",  pos =  (1 ,   0) )
G.add_node("Y",  pos =  (2 ,   0) )

G.add_edges_from([
  ["Z", "X"],
  ["X", "M"],
  ["Z", "M"],
  ["M", "Y"]
] )

nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 2.3])
ax.set_ylim([-.3, 1.3])
#plt.show()
(-0.3, 1.3)
<Figure size 1000x300 with 2 Axes>

In this case, MM still mediates the relationship between XX and YY. However, ZZ is in a backdoor path to MM. So, some of the variation in MM that impacts YY could be due to ZZ rather than XX. The right plot is similar and makes the point more explicit. ZZ confounds the relationship between XX and YY through MM. Without adjusting for ZZ, the path XZMYX\leftarrow Z \rightarrow M \rightarrow Y remains unblocked.

Bad controls

The following are all unhelpful for conditioning on ZZ using the backdoor criteria.

Upper left. Adjusting for colliders is the standard bad control. Below adjusting for ZZ open ups a backdoor path that was closed. From a common sense perspective, why would you want to adjust for a consequence of XX and YY when exploring their relationship?

In the upper right diagram below, ZZ is a so-called instrumental variable. A good example is ZZ being the randomization indicator and XX being the treatment the person actually took. It is important in this example to emphasize that use of the instrumental variable is often a very fruitful method of analysis. However, it’s not a useful backdoor adjustment and conditioning on ZZ simply removes most of the relevant variation in XX. If one wants to use ZZ as an instrumental variable in this setting, then specific methods taylored to instrumental variable use need to be employed.

In the lower left plot, ZZ is a descendant of XX. Conditioning on ZZ removes relevant pathway information regarding the relationship between XX and YY>

The lower right plot is similar. Conditioning on ZZ removes variation in MM which hinders our ability to study the relationship between XX and YY through MM.

plt.figure(figsize=[6, 6])


#| echo: false
plt.subplot(2, 2, 1)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("Z",  pos = (0.5,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["X", "Z"],
  ["Y", "Z"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])

plt.subplot(2, 2, 2)
G = nx.DiGraph()
G.add_node("X",  pos = (0, 0) )
G.add_node("Z",  pos = (0, 1) )
G.add_node("U",  pos = (1, 1) )
G.add_node("Y",  pos = (1, 0) )
G.add_edges_from([
  ["X", "Y"],
  ["U", "X"],
  ["U", "Y"],
  ["Z", "X"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])



plt.subplot(2, 2, 3)
G = nx.DiGraph()
G.add_node("Z",  pos = (0, 0) )
G.add_node("X",  pos = (0, 1) )
G.add_node("Y",  pos = (1, 0) )
G.add_edges_from([
  ["X", "Z"],
  ["Z", "Y"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])

plt.subplot(2, 2, 4)
G = nx.DiGraph()
G.add_node("X",  pos = (0, 1) )
G.add_node("M",  pos = (0, 0) )
G.add_node("Z",  pos = (1, 1) )
G.add_node("Y",  pos = (1, 0) )
G.add_edges_from([
  ["X", "M"],
  ["M", "Z"],
  ["M", "Y"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])


plt.show();
<Figure size 600x600 with 4 Axes>

Conditioning may help

In the upper left plot, adjusting for ZZ may reduce variability in YY to help focus on the relationship between XX and YY.

In the upper left plot, adjusting for ZZ may reduce variation in the mediator unrelated to the relationship between XX and YY.

plt.figure(figsize=[6, 3])

plt.subplot(1, 2, 1)
G = nx.DiGraph()
G.add_node("X",  pos = (0, 0) )
G.add_node("Z",  pos = (1, 1) )
G.add_node("Y",  pos = (1, 0) )
G.add_edges_from([
  ["X", "Y"],
  ["Z", "Y"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])


plt.subplot(1, 2, 2)
G = nx.DiGraph()
G.add_node("X",  pos = (0, 1) )
G.add_node("M",  pos = (0, 0) )
G.add_node("Z",  pos = (1, 1) )
G.add_node("Y",  pos = (1, 0) )
G.add_edges_from([
  ["X", "M"],
  ["Z", "M"],
  ["M", "Y"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
(-0.3, 1.3)
<Figure size 600x300 with 2 Axes>

Exercises

Graphs

Consider the following graph where we want to answer the question: what is P(Y  do(X)=x)P(Y ~|~ do(X) = x) where every variable is binary.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("X",   pos = (0, 0) )
G.add_node("Z1",  pos = (0, 1) )
G.add_node("Z2",  pos = (0.5, 0.5) )
G.add_node("Z3",  pos = (1, 1) )
G.add_node("Y",  pos = (1, 0) )
G.add_edges_from([
  ["X", "Y"],
  ["Z2", "X"],
  ["Z1", "X"],
  ["Z1", "Z2"],
  ["Z3", "Y"],
  ["Z3", "Z2"],
  ["Z2", "Y"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])
(-0.3, 1.3)
<Figure size 300x300 with 1 Axes>
  1. What are the minimal set of adjustment variables for the backdoor criteria?

  2. Is XY  Z1,Z2X \perp Y ~|~ Z_1, Z_2?

  3. Is XY  Z2,Z3X \perp Y ~|~ Z_2, Z_3?

  4. Given a cross sectional sample, if Z3Z_3 is unobserved, give a formula for the estimation of P(Y  do(X)=x)P(Y ~|~ do(X) = x) that only requires observed variables.

plt.figure(figsize=[3, 3])
G = nx.DiGraph()
G.add_node("X",   pos = (0, 0) )
G.add_node("Z1",  pos = (0, 1) )
G.add_node("Z2",  pos = (1, 1) )
G.add_node("Y",  pos = (1, 0) )
G.add_edges_from([
  ["X", "Y"],
  ["X", "Z2"],
  ["Z1", "X"],
  ["Z1", "Z2"],
  ["Z2", "Y"]
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3]);
<Figure size 300x300 with 1 Axes>
  1. What are the minimal set of adjustment variables for the backdoor criteria?

  2. Given a cross sectional sample, give a formula for the estimation of P(Y  do(X)=x)P(Y ~|~ do(X) = x) that only requires observed variables.

Data exercise

The wikipedia page on Simpson’s paradox gives this data concerning two treatments of kidney stones, the percentage of succcessful procedures and the size of the stone. Note, among both large stones and small stones A is better than B. However, among all B is preferable to A.

SizeTreatmentSuccessNProp
LargeA19226373%
B558069%
SmallA818793%
B23427087%
BothA27335078%
B28935083%

Estimate the treatment effect difference: $$ P(Success ~|~ Do(Treatment) = B)

  • P(Success ~|~ Do(Treatment) = A) $underthefollowinggraphicalmodelswhere under the following graphical models where Xistreatment, is treatment, Yissuccessand is success and Z$ is stone size:

plt.figure(figsize=[6, 6])


#| echo: false
plt.subplot(2, 2, 1)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("Z",  pos = (0.5,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["Z", "X"],
  ["Z", "Y"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])

plt.subplot(2, 2, 2)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("Z",  pos = (0.5,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["X", "Z"],
  ["Z", "Y"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])

plt.subplot(2, 2, 3)
G = nx.DiGraph()
G.add_node("X",  pos = (0  ,   0) )
G.add_node("Z",  pos = (0.5,   1) )
G.add_node("Y",  pos =  (1 ,   0) )
G.add_edges_from([
  ["X", "Y"],
  ["X", "Z"],
  ["Y", "Z"],
] )
nx.draw(G,
        nx.get_node_attributes(G, 'pos'),
        with_labels=True,
        font_weight='bold',
        node_size = 2000,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-.3, 1.3])
ax.set_ylim([-.3, 1.3])


plt.show();
<Figure size 600x600 with 3 Axes>

Comment on how reasonable each of these models are given the setting. Here’s a reference: julious1994confounding.

Give any other DAGs, perhaps including unmeasured variables, that you think are relevant.

Reading