• Introduction
• General Overviews
• Philosophies of EDA

Introduction

Exploratory data analysis (EDA) is a strategy of data analysis that emphasizes maintaining an open mind to alternative possibilities. EDA is a philosophy or an attitude about how data analysis should be carried out, rather than being a fixed set of techniques. It is difficult to obtain a clear-cut answer from “messy” human phenomena, and thus the exploratory character of EDA is very suitable to psychological research. This research tradition was founded by John Tukey, who often relates EDA to detective work. In EDA, the role of the researcher is to explore the data in as many ways as possible until a plausible “story” emerges. A detective does not collect just any information. Instead, he or she collects clues related to the central question of the case. By the same token, EDA is not “fishing” or “torturing” the data set until it confesses. Rather, it is a systematic way to investigate relevant information from multiple perspectives. Tukey emphasizes the role of data analysis in research, rather than mathematics, statistics, and probability. Mathematics is secondary in the sense that it is a tool for understanding the data. Classical statistics aims to infer from the sample to the population based on the probability as the relative frequency in the long run. However, in many stages of inquiry, the working questions are non-probabilistic and the focal point should be the data at hand rather than the probabilistic inference in the long run. Hence, prematurely adopting a specific statistical model would hinder the researchers from considering different possible solutions. Because EDA endorses open-mindedness and triangulation, it is not a standalone approach. Rather, it complements traditional confirmatory data analysis (CDA) by generating a working hypothesis, as well as spotting outliers and assumption violations that might invalidate CDA. Additionally, it can also be operated with Bayesian statistics and resampling side by side. With the advent of high-power computers and voluminous data, many exploratory techniques have been developed in data science. These methods are known as data mining. Because it is tedious or even impossible to detect the data patterns when the sample size is extremely large or there are too many variables (this problem is called the “curse of dimensionality”), some data miners use machine learning to explore alternate routes for understanding the data. There are different taxonomies of EDA. Traditionally, EDA comprises residual analysis, data re-expression, resistant procedures, and data visualization. With the advance of high-power computing and big data analytics, the alternate taxonomy is goal oriented, namely, clustering, variable screening, and pattern recognition.

General Overviews

There are several concise general overviews of EDA. Behrens 1996; Behrens 2000; and Behrens, et al. 2013 summarize the conceptual aspects and computational tools of EDA, illustrating how EDA can complement hypothesis testing, in the context of psychology. Traditionally, data explorers subscribe to the empiricist notion: “Let the data speak for themselves.” In response to this notion, de Mast and Trip 2007 and de Mast and Kemper 2009 go one step further by arguing that even if the surprising feature pops up, the problem being studied may still be far from resolution. Hence, background knowledge is essential to data interpretation. Additionally, they also present other major features of EDA for problem solving in quality management. Sometimes EDA is misunderstood as fishing—trying different procedures until finding a significant result. To debunk this common misconception, Jebb, et al. 2017 explains what EDA is and what EDA is not. The work of the founder of EDA, John Tukey, provides a historical review of the development of EDA in the form of Tukey 1977, Tukey 1980 (cited under Exploratory Data Analysis and Confirmatory Data Analysis), Tukey 1986a, Tukey 1986b, and Tukey 1988. Although some of Tukey’s ideas presented in these books are not entirely new (e.g., Francis Galton proposes nonparametric methods and quantiles during the 19th century; Arthur Lyon Bowley explores a prototypical stemplot and also uses a seven-point summary during the early 20th century), Tukey’s approach is still revolutionary given the fact that computing resources at his time were limited and thus computing-intensive data exploration and visualization were out of reach by researchers. Hence, some of Tukey’s proposed data visualization techniques, such as the stem-leaf plot and the five-point boxplot, could be done manually without a computer. More importantly, there is one overarching theme in all Tukey’s works: counteracting confirmation bias. Confirmation bias is a psychological flaw that humans tend to pay attention to data favoring their predetermined hypothesis while overlooking counterexamples. Tukey is well aware of this potential pitfall in confirmatory data analysis, though he didn’t explicitly name the term “conformation bias.” As a remedy, Tukey proposes an exploratory approach to urge researchers consider the otherwise. Confirmation bias is related to another psychological weakness: false sense of certainty. The traditional statistical approach that presents the finding in a confirmatory tone is embraced by the audience who prefers certainty to ambiguity. Tukey creates a paradigm shift by asserting that progress of statistics can only be made when analysts move away from certainty.

• Behrens, J. T. 1996. Principles and procedures of exploratory data analysis. Psychological Methods 2:131–160.

Besides illustrating the computational tools of EDA, Behrens also emphasizes that the proper application of EDA is determined not by computation, but rather by the purpose of the procedure.

• Behrens, J. T. 2000. Exploratory data analysis. In Encyclopedia of psychology. Edited by A. E. Kazdin, 303–305. New York: Oxford Univ. Press.

This article illustrates the 4Rs of classical EDA using S-Plus and Xlisp-Stat. It emphasized that the future directions of EDA is tied to computer technology.

• Behrens, J. T., K. E. Dicerbo, N. Yel, and R. Levy. 2013. Exploratory data analysis. In Handbook of psychology. 2d ed. Vol. 2. Edited by J. A. Schinka, W. F. Velicer, and I. B. Weiner, 34–70. Hoboken, NJ: Wiley.

This book chapter is a comprehensive introduction to EDA, including the history and philosophy of EDA, the toolbox of EDA, computer software demonstrations, and future directions. It is noteworthy that the chapter also covers the legacy of John Tukey in the fields of regression diagnostics, robustness studies, and computer graphics for statistical use.

• de Mast, J., and B. Kemper. 2009. Principles of exploratory data analysis in problem solving: What can we learn from a well-known case? Quality Engineering 21:366–375.

De Mast and Kemper use the example of John Snow’s discovery of cholera outbreak to argue that data visualization alone is insufficient for problem solving. Rather, a comprehensive EDA should go beyond the empirical data by following these main steps: (1) display the data, (2) identify salient features, and (3) interpret salient features.

• de Mast, J., and A. Trip. 2007. Exploratory data analysis in quality improvement projects. Journal of Quality Technology 39:301–311.

De Mast and Trip outline the major principles of EDA, including the purpose of EDA, data visualization, identification and interpretation of salient features, the role of automated procedures, and integration of EDA and confirmatory data analysis (CDA).

• Jebb, A., S. Parrigon, and S. E. Woo. 2017. Exploratory data analysis as a foundation of inductive research. Human Resource Management Review 27:265–276.

This article is concerned with the principle and philosophy of EDA. These authors argue that due to the natural uncertainty of data patterns, EDA should be integrated with replication-based procedures, such as cross-validation. Additionally, they argued against fishing, data dredging, or p-hacking.

This is the seminal work of EDA. It is noteworthy that Tukey wrote the book before the age of high-power computing, and thus certain graphing techniques are done by pencil and paper, such as the stem-and-leaf plot. Readers should focus on the conceptual aspect, not the computational procedures, of Tukey’s EDA.

• Tukey, J. W. 1986a. Philosophy and principles of data analysis: 1949–1964. Vol. 3 of The collected works of John W. Tukey. Edited by L. V. Jones. Monterey, CA: Wadsworth & Brooks.

In this volume, Tukey argues that data analysis should be bottom-up (data driven) rather than top-down (model based). However, many materials are repetitive.

• Tukey, J. W. 1986b. Philosophy and principles of data analysis: 1965–1986. Vol. 4 of The collected works of John W. Tukey. Edited by L. V. Jones. Monterey, CA: Wadsworth & Brooks.

The content of Volume 4 is similar to that of Volume 3. Tukey emphasizes that statistics is an empirically based discipline and therefore data analysts must be mentally prepared for surprising results. Tukey also points out that many statistical models are constructed even though assumptions are violated.

• Tukey, J. W. 1988. Graphics. Vol. 5 of The collected works of John W. Tukey. Edited by W. S. Cleveland. Pacific Grove, CA: Wadsworth.

This volume focuses on graphical methods of EDA. This book should be used for understanding the history of EDA and data visualization. In the early 21st century, much more powerful graphing techniques are available in different software packages.