R Companion to Discrete Choice Methods with Simulation

Author

Dan Yavorsky

Warning

This is an early draft. The book is incomplete. Every aspect of it is subject to change. Many things might be incorrect. Are you sure you want to read this yet?

Welcome

You don’t understand it, until you have coded it.

It’s a mantra that my mentors instilled in me in graduate school, and one that I propagate on my students today. I believe it in deeply. If you want to understand a statistical model, it is insufficient to interact with the model only through pen and paper, no matter how well you organize the subscripts or how masterfully you interweave elements of the Greek and Roman alphabets. To really understand a statistical model, you need to be able to simulate data from the model, and then take that simulated data to an estimation routine that enables you to recover the parameters of interest.

This is something you learn to do in graduate school. But, unless you have one of the most caring and pedagogical advisers, no one teaches it to you! Instead, it’s a skill that students develop independently and inefficiently between classes and assignment due dates or, in my case, after I was knee-deep in my dissertation research. There is a tremendous imbalance between how immensely important this skill is and the lack of time and instruction dedicated to it.¹ That imbalance is the motivation for this book.

I address this gap in the specific domain of Discrete Choice Modeling. I do so alongside Kenneth Train’s masterful text Discrete Choice Methods with Simulation (Second Edition) freely available online at https://eml.berkeley.edu/books/choice2.html.

Train and I align in our thinking:

“…the true value of the new approach to choice modeling is the ability to create tailor-made models. The computation and programming steps that are needed to implement a new model are usually not difficult. An important goal of the book is to teach these skills as an integral part of the exposition of the models themselves.”²

What is “not that difficult” for an experienced researcher can be immensely difficult for a typical graduate student. The aim of this book is to ease that difficulty. I demonstrate R code alongside the math. You will see Train’s pseudo-code transformed into working R code. No mysteries will remain. The goal, in fact, is to provide you with much of the same instruction that Train’s students received when taking his course. To borrow a quote Quintilian, Cobbett, Cooke, and many others: my goal is to take you through the process of programming the simulation and estimation of discrete choice models not so that you can understand, but so that you cannot possibly misunderstand.

How to read this book

As the title suggests, the structure, topics, and organization of this book parallel Train (2009). You should read — or most likely, re-read — one chapter from Train (2009) and then read and work through the corresponding chapter here, alternating between our texts. Do not simply read this book from start to finish without frequently returning to Train (2009). Doing so only deprives you of the intended experience and likely dramatically reduces the amount you will learn from the process.

In addition, I strongly encourage you write or copy the code as you work though this book. I would mandate this if I could, but I’m not sitting there next to you and so I must settle for simply sharing my encouragement. Write the code. Play with the code. Break the code. Make it your own. That behavior is where deep understanding comes from, not from highlighting the occasional sentence or equation you find important.

Prerequisites

This book is intended for a narrow audience, predominantly graduate students with an interest in discrete choice modelling who will find value from seeing and interacting with the programmatic implementation of the multinomial logit and its extended family of related models. In other words, someone who read Train (2009) and thought how would I code that?

We will simulate data from the statistical models and then estimate the parameters of those models from the simulated data. We will do all of this in R, a freely available software environment for statistical computing and graphics. As a result, I assume you are reasonably familiar with R. If not, there is a tremendous set of free online R resources collected and organized at https://www.bigbookofr.com/. Popular books include Wickham, Cetinka-Rundel, and Grolemund (2023) and Wickham (2019).

I also assume you have taken introductory statistics or econometrics courses, as the concepts and techniques taught there are foundational for understanding and estimating the discrete choice models covered by Train (2009). In particular, you should have no uncertainty about the difference between a model, the estimator, and the estimate. To briefly review:

a model is the set of mathematical assumptions about how data are generated,
the estimator (or equivalently, the estimation routine) is an algorithm or function of the data, and
the estimate is the result of applying the estimator to a particular dataset.

In my experience, students are often given a model and an estimator, after which much time in the classroom is spent deriving properties of the estimator for that particular model, and then a homework assignment asks students to implement the estimator on a dataset to find an estimate. This approach puts almost no emphasis on the specification of the model or the choice of estimator. For example, should part of the model be specified as \(\beta_0 + \beta_1x_1\) or should it be \(\beta_0 + \beta_1x_1 + \beta_2x_2\)? And once we have specified a model, should we derive an estimator via least squares, the method of moments, maximum likelihood, a Bayesian approach, or some other way?

If my description captures your experience in introductory statistics and econometrics courses and you would like to review key ideas, I highly recommend Abramovich and Ritov (2023) on the topic of mathematical statistics, Goldberger (1991) and Kennedy (2008) for econometrics, and McElreath (2018) and Gelman et al. (2013) for Bayesian statistics.

Acknowledgments

I am immensely grateful to the teams that work on the R-Project, those at Posit who provide RStudio and Quarto, and the related communities of developers, academics, and R users. The free tools they provide and the welcoming communities they have established are both exceptional.

I also thank my academic mentors Ella Honka, Peter Rossi, Eric Bradlow, and (although we have yet to meet) Kenneth Train as well as the folks that tolerate me through my consulting work, including Prachi Bhalerao, June Wu, Chris Diener, Keith Chrzan, and the hosts and attendees of Sawtooth Software’s annual Analytics and Insights Summit. I have learned so much from these researcher’s guidance and often their written work.

Special thanks go to generous people who reviewed and assisted with drafts of this book, including Darren Aeillo, Kalyan Rallabandi, and Geoff Zheng.

Unquestionably, my deepest thanks go to Alison, Zach, and Ben for their patience and support.

Colophon

An online version of this book is available at https://dcms-r.danyavorsky.com. The source of the book is available at https://github.com/dyavorsky/dcms-r. The book is authored using Quarto, an open-source scientific and technical publishing system that makes it easy to create articles, presentations, websites, books, and other publications that combine text and executable code.

What I’m talking about is the field of statistical computing. My experience was that graduate students in economics, psychology, and business had little, if any, formal exposure to these ideas. See Rizzo (2019) for a related text that emphasizes computational statistics, and McElreath (2018) and Gelman et al. (2020) for a Bayesian perspective that emphasize a “Bayesian Workflow.”↩︎
Train (2009) pg 2. Train provides pseudo-code throughout his book and generously makes Matlab code available on his website.↩︎