On January 28th, we had the first meeting of Michigan State University’s DH Reading Group. There was a good turn out to discuss topic modeling. Topic modeling involves algorithmic methods for organizing, sorting, and utilizing large corpuses of information. These topics can be modeled over time as well as in relation to other topics. They are not restricted to texts but can also be used for images, sounds, and other media structures.

We read and discussed the following articles:

Megan Brett’s article offers an easy-to-follow introduction to topic modeling. David Blei’s articles are well written, providing more in-depth discussion of topic modeling from a statistical perspective. Schmidt’s article offers some words of caution in the use of topic models in the humanities.

There were a number of points we lingered on in the reading group. We considered how topic modeling is based on a conjectured model of what documents consist of, namely a certain combination of numerous topics/themes. The algorithms used to discover those latent (“hidden”) variables vary. The most common algorithm in much DH work now is the latent Dirichlet allocation (LDA). There are multiple solutions to those algorithms. Thee software package MALLET, for example, uses Gibbs sampling, but that is not the only solution. Especially after reading Schmidt’s article, we considered the numerous of variables that influence the results of topic modeling.

We all more or less agreed that we do not have a strong enough understanding of Bayesian probability, the statistical basis of topic modeling. We hope to start a reading group on that important topic next Fall in conjunction with the Social Science Data Analytics Initiative here at MSU.

-A. Sean Pue (@seanpue), Associate Professor, Department of Linguistics and Germanic, Slavic, Asian, and African Languages, Michigan State University

A. Sean Pue

Associate Professor of Hindi Language and South Asian Literature and Culture