1  Introduction

It is instructive to start with the first sentence of Train (2009) Section 1.3, “Discrete choice analysis consists of two interrelated tasks: specification of the behavioral model and estimation of the parameters of that model” (emphasis added). In particular, chapters 1 and 2 of Train (2009) only specify models; there is not even a hint of model estimation. And since I follow Train (2009), we too begin with a focus on model specification.

Let me be clear: many students will see \(y\) and \(x\) later in this section and immediate think about “fitting” a model; that is, they assume they have data that they will put into an estimation routine to find estimates of the parameters of the model. We are not there yet! At this early stage in the book, we are only specifying models; that is, listing sets of assumptions about data generating processes and exploring the implications of those assumptions. There is no data yet. There might be parameters introduced in our choice of model specification, but even if so, we are not yet estimating those parameters. Have patience, we will eventually do these things.

1.1 Recap of Train Ch. 1

Train denotes the outcome in any given situation as \(y\), determined by some observable factors collected in the vector \(\mathbf{x}\) and some unobservable factors collected in the vector \(\boldsymbol{\varepsilon}\). The factors (\(\mathbf{x}\) and \(\boldsymbol{\varepsilon}\)) relate to the agent’s choice (\(y\)) through a function \(y = h(\mathbf{x}, \boldsymbol{\varepsilon})\). We assume for the moment that we know \(h(\cdot)\) and that \(\mathbf{x}\) and \(\boldsymbol{\varepsilon}\) are length-one vectors (i.e., scalars) denoted \(x\) and \(\varepsilon\).

Since we do not observe \(\varepsilon\), we can’t predict \(y\) exactly. Instead, we focus on the probability of \(y\), that is:

\[ \begin{aligned} p(y|x) &= \Pr \left( \varepsilon \textrm{ such that } h(x,\varepsilon)=y \right) \\ &= \Pr \left( I \left[ h(x,\varepsilon)=y \right] = 1 \right) \\ &= \int I \left[ h(x,\varepsilon)=y \right] f(\varepsilon) \, d\varepsilon \end{aligned} \tag{1.1}\]

For certain special choices of \(h\) and \(f\), a closed-form expression1 for the integral is available. But more generally, for almost any choice of \(h\) and \(f\), we can approximate the integral through simulation. Train provides pseudo code on how to do so:

  1. Repeat the following two steps many (\(r=1, \ldots, R\)) times:
    • Draw \(\varepsilon^{(r)}\) from \(f(\varepsilon)\).
    • Determine whether \(h(x,\varepsilon^{(r)}) = y\). If so, set \(I^{(r)}=1\); else set \(I^{(r)}=0\).
  2. Average the \(R\) values of \(I^{(r)}\)

Next we look at two examples where we use this procedure to approximate the \(p(y|x)\) integral.

1.2 A Simple Example

Let’s first set up a toy example to demonstrate how simulation can approximate the \(p(y|x)\) integral. Suppose \(x=0.5\) and \(\varepsilon\) is uniformly distributed between \(-1\) and \(1\). Define \(h(x, \varepsilon)\) to be:

\[ h(x, \varepsilon) = \begin{cases} 0 & \text{if } x + \varepsilon < 0 \\ 1 & \text{if } x + \varepsilon \in [0,1] \\ 2 & \text{if } x + \varepsilon > 1 \end{cases} \tag{1.2}\]

We’ll focus on the outcome \(y=2\). You can probably intuit that the \(p(y=2 | x) = 0.25\) since only one quarter of the time will \(\varepsilon\) be sufficiently positive to make \(x + \varepsilon > 1\).2 Nevertheless, let’s approximate the integral representation of \(p(y=2|x)\) through simulation to ensure we understand the process.

To walk you through the code, we first set a seed so that the pseudo-random numbers generated by runif() can be replicated exactly each time the code is run (even on different computers). We then specify that we will use 1,000 draws in the simulation and we create an vector I to hold our results. The simulation occurs via a for() loop where each time through the loop we take a draw of \(\varepsilon\), calculate \(0.5 + \varepsilon\) and check whether that sum is greater than one. If so, then \(h(x,\varepsilon)=2\) matching the value of \(y\) for the choice probability we want to assess — i.e., \(p(y=2|x)\) — and thus we store a \(1\) in the \(r^\textrm{th}\) position of I; otherwise we store a 0. We then average the values in I to get our approximation of \(p(y=2|x)\).


R <- 1000
I <- vector(length=R)

for(r in 1:R) {
    eps  <- runif(1, min=-1, max=1)
    h    <- 0.5 + eps 
    I[r] <- as.integer(h > 1)
[1] 0.258

The simulated value 0.258 approximates the exact value 0.25 and can be made closer by increasing the number of draws used in the simulation.

R users will recognize that we can shorten the code by taking advantage of R’s vectorized functions and its conversion of boolean values to 0/1 when used in mathematical operations. Here is a shorter implementation of the simulation; whether it’s “better” code is a matter of preference.

R <- 1000
mean( runif(R, min=-1, max=1) + 0.5 > 1 )
[1] 0.258

That’s it. If you can generate pseudo-random draws from the density \(f\) and you know \(h\), approximating a choice probability by simulation requires only a handful of lines of code.

1.3 A Binary Logit Model Example

As an example of a model with a complete closed-form solution, Train provides the binary logit model.

The “binary” part refers to the aspect of the model whereby the decision maker does one of two things; they either take an action (\(y=1\)) or not (\(y=0\)). To tie this model into a framework of behavior, we start with a utility function \(U\). In Train’s specific example, utility is specified as

\[ U(\mathbf{x}, \boldsymbol{\beta}, \varepsilon) = \mathbf{x}'\boldsymbol{\beta}+ \varepsilon \tag{1.3}\]

where \(\mathbf{x}\) is a vector of observable explanatory variables, \(\boldsymbol{\beta}\) is a vector of parameters that through the functional form \(\mathbf{x}'\boldsymbol{\beta}\) effectively serve as weights on the covariates, and \(\varepsilon\) is a scalar index collecting the value of information used by the decision maker but unobserved to the researcher.

In this model, the threshold for action is 0. That is, we can specify \(h\) as:

\[ h(\mathbf{x}, \boldsymbol{\beta}, \varepsilon) = \begin{cases} 0 & \text{if } U \le 0 \\ 1 & \text{if } U > 0 \end{cases} \tag{1.4}\]

The “logit” part of the model’s name refers to the choice of \(f\). The binary logit model assumes \(f\) is the logistic distribution:

\[ f(\varepsilon) = \frac{e^{-\varepsilon}}{(1+e^{-\varepsilon})^2} \tag{1.5}\]

Having specified \(h\) and \(f\), let’s choose some values for \(\mathbf{x}\) and \(\boldsymbol{\beta}\) and use simulation to approximate the integral for \(p(y|\mathbf{x},\boldsymbol{\beta})\). Let’s pick \(\mathbf{x}=(0.5, 2)\) and \(\boldsymbol{\beta}=(3,-1)\) such that \(\mathbf{x}'\boldsymbol{\beta}= (0.5)(3)+(2)(-1) = 1.5 - 2 = -0.5\). We know from the closed-form solution to this model provided by Train that, with these values of \(\mathbf{x}\) and \(\boldsymbol{\beta}\), the probability the decision maker takes action is:

\[ p(y=1 | \mathbf{x}, \boldsymbol{\beta}) = \frac{e^{\mathbf{x}'\boldsymbol{\beta}}}{1 + e^{\mathbf{x}'\boldsymbol{\beta}}} = \frac{e^{-0.5}}{1+e^{-0.5}} = 0.3775407 \tag{1.6}\]

We can approximate this integral as before. Below I use the function rlogis() to take R=1000 draws from the binary logistic distribution, and I approximate the integral with the proportion of times \(\mathbf{x}'\boldsymbol{\beta}+ \varepsilon\) is greater than \(0\):

R <- 1000

x    <- c(0.5, 2)
beta <- c(3, -1)

U <- as.vector(x %*% beta) + rlogis(R)
mean(U > 0)
[1] 0.364

Our simulated value 0.364 approximates the exact value of the integral 0.378.

1.4 Key Learnings

The key learning from this chapter is that with discrete choice models our focus is on the probability of outcome \(y\). That outcome results from the joint distribution \(f\) of unobserved factors \(\boldsymbol{\varepsilon}\) and the behavior model \(h\) that relates \(y\) to \((\mathbf{x}, \boldsymbol{\varepsilon})\). The probability \(p(y|x)\) can be written in closed form for only very special choices of \(f\) and \(h\), but for almost any choice of \(f\) and \(h\) we can simulate \(p(y|x)\).

  1. In this context, a closed-form expression means a way of writing the integral so that the anti-derivative sign is not part of solution. For example, the integral \(\int x \, dx\) has the closed for expression \(x^2/2\) plus some constant. We will see later that the Extreme Value distribution is often chosen for \(f\) predominantly because it leads to a closed for expression for the choice probability \(p(y|x)\).↩︎

  2. More precisely, \(p(y=2|x) = \Pr(x+\varepsilon > 1|x=0.5) = \Pr(\varepsilon > 0.5) = \int_{0.5}^1 f(\varepsilon) d\varepsilon = (0.5\varepsilon)\vert_{0.5}^1 = 0.25\).↩︎