In This Article Big Data in Political Science Research

  • Introduction
  • Reference Works
  • Textbooks
  • Journals and Proceedings
  • The Advantages and Problems of Big Data
  • Gathering and Managing Big Data
  • Visualizing Data
  • Optimization
  • Measurement with Big Data
  • Statistical Learning Theory
  • Supervised Machine Learning
  • Unsupervised Machine Learning
  • Probabilistic Graphical Models
  • Natural Language Processing and Text Analysis
  • Computer Vision and Image Analysis

Political Science Big Data in Political Science Research
Keith T. Poole, L. Jason Anastasopoulos, James E. Monogan
  • LAST MODIFIED: 27 September 2017
  • DOI: 10.1093/obo/9780199756223-0232


In both applied politics and academic political science research, big data techniques have gained considerable traction for the purposes of analyzing causal relationships, making useful classifications, and forecasting. Generally speaking, big data methods refer to techniques that can tax existing software or hardware, thereby requiring a degree of ingenuity to deal with the computational demands of the data or the method. To that end, what can be classified as big data is a moving target from year to year and generation to generation as computer processing, storage, and software improve. To illustrate this pattern: Admiral Grace Hopper, inventor of the first compiler for a programming language, would routinely hand out 11.8-inch pieces of wire in lectures she gave. She remarked that, in a perfect vacuum, this is how far light could travel in a single nanosecond. Computers had to be small to be fast in order to minimize the distance traveled. As computers have gotten smaller and more efficient, in part due to Hopper’s work, the opportunities for analyzing larger data sets with more complex structures and with more complex machine learning methods becomes increasingly feasible. How has big data become a part of political research? In applied politics, campaigns now frequently engage in microtargeting, a type of cluster analysis that takes extensive databases about as many voters as possible and determines logical classifications for them. By tracking known behaviors to standardize the model, campaigns can then forecast what the ideal message would be for voters they are reaching out to. In academic political science research, meanwhile, many methods of machine learning are taking off in order to allow scholars to answer more complicated questions and use more complicated data. A large body of research treats texts such as written records or floor speeches as data, again using clustering algorithms to determine common speech patterns. A longer stream of research uses Monte Carlo and Markov Chain Monte Carlo techniques to evaluate methods and to estimate Bayesian models. Tree-based methods have emerged as a technique for escaping the curse of dimensionality—the problem that emerges when there are more variables that could potentially affect an outcome than can possibly be included in the model. Measurement techniques that may need to span many years and make comparisons across many different settings have become pivotal to political science. The following bibliography points to several general topics that the potential big data analyst may need to consider and several sources to consult within each topic.

Reference Works

The books listed here are all useful references for the analyst who is engaged in applied big data analysis. Some are references that analysts may want to consult as sources of R commands that would be relevant when implementing one’s own code (such as Grolemund and Wickham 2017 or Monogan 2015). A more big data–specific reference on programming can be found in James, et al. 2013 (which uses R); Joshi 2016 (which uses Python); or Press, et al. 1996 (which uses C). Johnson, et al. 1994–1995; Johnson, et al. 1997; and Johnson, et al. 2005 all serve as essential background on the probability theory that is critical to big and little data analysis. For background on computational algorithms that are key to big data, readers can consult Steele, et al. 2016 as well as Efron and Hastie 2016. All of these volumes provide information that the applied analyst may need to look up at times.

  • Cheney, Ward, and David Kincaid. Numerical Mathematics and Computing. 7th ed. Boston: Brooks/Cole, 2013.

    E-mail Citation »

    This is an updated numerical analysis text for the era of big data. In addition to the inclusion of common numerical optimization routines, numerical differentiation, and integration, it covers mathematical preliminaries and floating point representation, linear systems, Monte Carlo methods and simulations, and a section on linear programming.

  • Efron, Bradley, and Trevor Hastie. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. New York: Cambridge University Press, 2016.

    DOI: 10.1017/CBO9781316576533E-mail Citation »

    This up-to-date book by two very distinguished statisticians discusses the history of modern statistics since fast digital computers made more advanced statistical methods possible. The book discusses classical frequentist, Bayesian, and mixed methods. The discussions are lucid and very valuable to the understanding of current data science.

  • Golub, Gene H., and Charles F. Van Loan. Matrix Computations. 4th ed. Baltimore: Johns Hopkins University Press, 2013.

    E-mail Citation »

    The development of efficient algorithms for high-dimensional data necessarily requires an understanding of advanced linear algebra and high-dimensional matrix computations. To this end, Golub and Van Loan provide an excellent introduction to efficient algorithms for advanced matrix computations. Among the topics covered in this text are a wide variety of matrix factorization techniques, parallel computation algorithms, and optimization routines.

  • Grolemund, Garrett, and Hadley Wickham. R for Data Science. Sebastopol, CA: O’Reilly, 2017.

    E-mail Citation »

    For modern data analysis in R, Hadley Wickham’s packages provide essential tools. Wickham and Grolemund cover all of the most useful R packages for modern data analysis. This text covers the popular ggplot2 for beautiful data visualizations; dplyr for data transformations; data wrangling with tibble, readr, and tidy; stringr for dealing with text data, and many others. The text also covers document creation using R Markdown and practical programming advice. Available online.

  • James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: With Applications in R. New York: Springer, 2013.

    DOI: 10.1007/978-1-4614-7138-7E-mail Citation »

    This book provides an excellent, intermediate-level introduction to statistical learning theory and applications in R. The text is the less advanced version of its more mathematically mature relative, The Elements of Statistical Learning, and provides an all-inclusive self-guiding tour through some of the most popular machine learning algorithms and implementations. This text is essential reading for graduate students and faculty in political science interested in learning some of the most basic machine learning algorithms.

  • Joshi, Prateek. Python Machine Learning Cookbook. Birmingham, UK: Packt, 2016.

    E-mail Citation »

    This volume walks through a litany of topics including classifers, clustering, text-as-data, image analysis, neural networks, and visualizing data. It focuses on algorithms in Python to complete each of these sorts of tasks. Example Python code is included throughout the text, and there is a dedicated GitHub page with data and code from the book.

  • Johnson, Norman L., Samuel Kotz, and N. Balakrishnan. Continuous Univariate Distributions. 2d ed. 2 vols. New York: Wiley, 1994–1995.

    E-mail Citation »

    This and the other two books in the series (Johnson, et al. 1997 and Johnson, et al. 2005) are essential reference works for continuous and discrete distributions used in probability and statistics. These two volumes specifically focus on probability distributions of a single variable that has a continuous metric, such as the normal, logistic, beta, uniform, χ2, t, F, and many more. Many underlying distributions like extreme value distributions (a motivator of the logistic) are also included.

  • Johnson, Norman L., Samuel Kotz, and N. Balakrishnan. Discrete Multivariate Distributions. New York: Wiley, 1997.

    E-mail Citation »

    This book covers the complex area of discrete multivariate distributions, wherein multiple variables are discrete in nature. These include the multinomial, multivariate Poisson, and others. All of these books by Johnson, Kotz, and Balakrishnan cover the history and mathematical derivations of well-known distributions with references to the original papers by the mathematicians and statisticians who derived them. The authors extensively cross-reference the distributions because so many distributions have obscure variants that were derived to solve specific statistical problems.

  • Johnson, Norman L., Samuel Kotz, and N. Balakrishnan. Univariate Discrete Distributions. 3d ed. New York: Wiley, 2005.

    DOI: 10.1002/0471715816E-mail Citation »

    This volume lays much of the theoretical foundation behind all probability distributions: gamma and beta functions, Bayes’ theorem, moments of a probability distribution, order statistics, and many other features. It then describes families of discrete distributions, which generally take on values that are some subset of nonnegative integers (such as a count or a binary outcome). Distributions covered include the binomial, Poisson, negative binomial, and several others.

  • Monogan, James E., III. Political Analysis Using R. New York: Springer, 2015.

    DOI: 10.1007/978-3-319-23446-5E-mail Citation »

    This book offers an initial and intermediate introduction on the statistical program R. In particular, chapter 8 introduces how user-written add-on packages allow analysts to apply data-intensive methods in R, such as roll-call scaling or Markov Chain Monte Carlo analysis. Chapters 10 and 11 introduce R’s programming functionality, allowing applied analysts of big data more flexibility than many programs when conducting research.

  • Press, William H., Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C: The Art of Scientific Computing. Cambridge, UK: Cambridge University Press, 1996.

    E-mail Citation »

    Press and colleagues provide a foundational text for implementing basic numerical analysis algorithms in C. One of the great benefits of a text that implements numerical analysis algorithms in C is that the simplicity of the language serves well as a conceptual basis for other, more modern-day programming languages such as C++, R, and Python. This book covers the most relevant numerical algorithms used for applied linear algebra problems, integration, random number generation, sorting and optimization.

  • Steele, Brian, John Chandler, and Swarna Reddy. Algorithms for Data Science. Cham, Switzerland: Springer, 2016.

    DOI: 10.1007/978-3-319-45797-0E-mail Citation »

    This text is an essential introduction for anyone planning on effectively analyzing high-dimensional data. It reads like a good introductory statistics book that covers basic visualization and data analysis techniques, but with an eye toward scalability for each method implemented. In addition to covering linear regression, the text covers tools for handling big data, such as Hadoop and MapReduce, and also discusses elementary machine learning algorithms, such as k-means clustering and naïve Bayes.

  • VanderPlas, Jake. Python Data Science Handbook. Sebastopol, CA: O’Reilly, 2017.

    E-mail Citation »

    This text is a great reference for those just starting to use Python. The book discusses how to effectively utilize iPython notebooks; describes the basic elements of Python’s most popular numerical computation packages, NumPy and Pandas; and also discusses Matplotlib for data visualization. This text covers basic machine learning algorithms available through the Scikit-Learn package, including naïve Bayes, linear regression, support vector machines, decision trees and random forests, principal components analysis, and k-means clustering.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login.

How to Subscribe

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here.

Purchase an Ebook Version of This Article

Ebooks of the Oxford Bibliographies Online subject articles are available in North America via a number of retailers including Amazon, vitalsource, and more. Simply search on their sites for Oxford Bibliographies Online Research Guides and your desired subject article.

If you would like to purchase an eBook article and live outside North America please email to express your interest.