Exploratory Data Analysis
- LAST REVIEWED: 24 April 2023
- LAST MODIFIED: 24 April 2023
- DOI: 10.1093/obo/9780199828340-0200
- LAST REVIEWED: 24 April 2023
- LAST MODIFIED: 24 April 2023
- DOI: 10.1093/obo/9780199828340-0200
Introduction
Exploratory data analysis (EDA) is a strategy of data analysis that emphasizes maintaining an open mind to alternative possibilities. EDA is a philosophy or an attitude about how data analysis should be carried out, rather than being a fixed set of techniques. It is difficult to obtain a clear-cut answer from “messy” human phenomena, and thus the exploratory character of EDA is very suitable to psychological research. This research tradition was founded by John Tukey, who often relates EDA to detective work. In EDA, the role of the researcher is to explore the data in as many ways as possible until a plausible “story” emerges. A detective does not collect just any information. Instead, he or she collects clues related to the central question of the case. By the same token, EDA is not “fishing” or “torturing” the data set until it confesses. Rather, it is a systematic way to investigate relevant information from multiple perspectives. Tukey emphasizes the role of data analysis in research, rather than mathematics, statistics, and probability. Mathematics is secondary in the sense that it is a tool for understanding the data. Classical statistics aims to infer from the sample to the population based on the probability as the relative frequency in the long run. However, in many stages of inquiry, the working questions are non-probabilistic and the focal point should be the data at hand rather than the probabilistic inference in the long run. Hence, prematurely adopting a specific statistical model would hinder the researchers from considering different possible solutions. Because EDA endorses open-mindedness and triangulation, it is not a standalone approach. Rather, it complements traditional confirmatory data analysis (CDA) by generating a working hypothesis, as well as spotting outliers and assumption violations that might invalidate CDA. Additionally, it can also be operated with Bayesian statistics and resampling side by side. With the advent of high-power computers and voluminous data, many exploratory techniques have been developed in data science. These methods are known as data mining. Because it is tedious or even impossible to detect the data patterns when the sample size is extremely large or there are too many variables (this problem is called the “curse of dimensionality”), some data miners use machine learning to explore alternate routes for understanding the data. There are different taxonomies of EDA. Traditionally, EDA comprises residual analysis, data re-expression, resistant procedures, and data visualization. With the advance of high-power computing and big data analytics, the alternate taxonomy is goal oriented, namely, clustering, variable screening, and pattern recognition.
General Overviews
There are several concise general overviews of EDA. Behrens 1996; Behrens 2000; and Behrens, et al. 2013 summarize the conceptual aspects and computational tools of EDA, illustrating how EDA can complement hypothesis testing, in the context of psychology. Traditionally, data explorers subscribe to the empiricist notion: “Let the data speak for themselves.” In response to this notion, de Mast and Trip 2007 and de Mast and Kemper 2009 go one step further by arguing that even if the surprising feature pops up, the problem being studied may still be far from resolution. Hence, background knowledge is essential to data interpretation. Additionally, they also present other major features of EDA for problem solving in quality management. Sometimes EDA is misunderstood as fishing—trying different procedures until finding a significant result. To debunk this common misconception, Jebb, et al. 2017 explains what EDA is and what EDA is not. The work of the founder of EDA, John Tukey, provides a historical review of the development of EDA in the form of Tukey 1977, Tukey 1980 (cited under Exploratory Data Analysis and Confirmatory Data Analysis), Tukey 1986a, Tukey 1986b, and Tukey 1988. Although some of Tukey’s ideas presented in these books are not entirely new (e.g., Francis Galton proposes nonparametric methods and quantiles during the nineteenth century; Arthur Lyon Bowley explores a prototypical stemplot and also uses a seven-point summary during the early twentieth century), Tukey’s approach is still revolutionary given the fact that computing resources at his time were limited and thus computing-intensive data exploration and visualization were out of reach by researchers. Hence, some of Tukey’s proposed data visualization techniques, such as the stem-leaf plot and the five-point boxplot, could be done manually without a computer. More importantly, there is one overarching theme in all Tukey’s works: counteracting confirmation bias. Confirmation bias is a psychological flaw that humans tend to pay attention to data favoring their predetermined hypothesis while overlooking counterexamples. Tukey is well aware of this potential pitfall in confirmatory data analysis, though he did not explicitly name the term “conformation bias.” As a remedy, Tukey proposes an exploratory approach to urge researchers consider the otherwise. Confirmation bias is related to another psychological weakness: false sense of certainty. The traditional statistical approach that presents the finding in a confirmatory tone is embraced by the audience, who prefers certainty to ambiguity. Tukey creates a paradigm shift by asserting that progress of statistics can be made only when analysts move away from certainty. Although this notion is still suspected by some academicians, today EDA and data visualization is a standard practice in the industry for business intelligence. Please consult Yu 2022 for more information.
Behrens, J. T. 1996. Principles and procedures of exploratory data analysis. Psychological Methods 2.2: 131–160.
DOI: 10.1037/1082-989X.2.2.131
Besides illustrating the computational tools of EDA, Behrens also emphasizes that the proper application of EDA is determined not by computation, but rather by the purpose of the procedure.
Behrens, J. T. 2000. Exploratory data analysis. In Encyclopedia of psychology. Edited by A. E. Kazdin, 303–305. New York: Oxford Univ. Press.
This article illustrates the 4Rs of classical EDA using S-Plus and Xlisp-Stat. It emphasizes that the future directions of EDA is tied to computer technology.
Behrens, J. T., K. E. Dicerbo, N. Yel, and R. Levy. 2013. Exploratory data analysis. In Handbook of psychology. 2d ed. Vol. 2. Edited by J. A. Schinka, W. F. Velicer, and I. B. Weiner, 34–70. Hoboken, NJ: Wiley.
This book chapter is a comprehensive introduction to EDA, including the history and philosophy of EDA, the toolbox of EDA, computer software demonstrations, and future directions. It is noteworthy that the chapter also covers the legacy of John Tukey in the fields of regression diagnostics, robustness studies, and computer graphics for statistical use.
de Mast, J., and B. Kemper. 2009. Principles of exploratory data analysis in problem solving: What can we learn from a well-known case? Quality Engineering 21.4: 366–375.
DOI: 10.1080/08982110903188276
De Mast and Kemper use the example of John Snow’s discovery of cholera outbreak to argue that data visualization alone is insufficient for problem solving. Rather, a comprehensive EDA should go beyond the empirical data by following these main steps: (1) display the data, (2) identify salient features, and (3) interpret salient features.
de Mast, J., and A. Trip. 2007. Exploratory data analysis in quality improvement projects. Journal of Quality Technology 39.4: 301–311.
DOI: 10.1080/00224065.2007.11917697
De Mast and Trip outline the major principles of EDA, including the purpose of EDA, data visualization, identification and interpretation of salient features, the role of automated procedures, and integration of EDA and confirmatory data analysis (CDA).
Jebb, A., S. Parrigon, and S. E. Woo. 2017. Exploratory data analysis as a foundation of inductive research. Human Resource Management Review 27.2: 265–276.
DOI: 10.1016/j.hrmr.2016.08.003
This article is concerned with the principle and philosophy of EDA. These authors argue that due to the natural uncertainty of data patterns, EDA should be integrated with replication-based procedures, such as cross-validation. Additionally, they argue against fishing, data dredging, or p-hacking.
Tukey, J. W. 1977. Exploratory data analysis. Reading, MA: Addison-Wesley.
This is the seminal work of EDA. It is noteworthy that Tukey wrote the book before the age of high-power computing, and thus certain graphing techniques are done by pencil and paper, such as the stem-and-leaf plot. Readers should focus on the conceptual aspect, not the computational procedures, of Tukey’s EDA.
Tukey, J. W. 1986a. Philosophy and principles of data analysis, 1949–1964. Vol. 3 of The collected works of John W. Tukey. Edited by L. V. Jones. Monterey, CA: Wadsworth & Brooks.
In this volume, Tukey argues that data analysis should be bottom-up (data driven) rather than top-down (model based). However, many materials are repetitive.
Tukey, J. W. 1986b. Philosophy and principles of data analysis, 1965–1986. Vol. 4 of The collected works of John W. Tukey. Edited by L. V. Jones. Monterey, CA: Wadsworth & Brooks.
The content of Volume 4 is similar to that of Volume 3. Tukey emphasizes that statistics is an empirically based discipline and therefore data analysts must be mentally prepared for surprising results. Tukey also points out that many statistical models are constructed even though assumptions are violated.
Tukey, J. W. 1988. Graphics. Vol. 5 of The collected works of John W. Tukey. Edited by W. S. Cleveland. Pacific Grove, CA: Wadsworth.
This volume focuses on graphical methods of EDA. This book should be used for understanding the history of EDA and data visualization. In the early twenty-first century, much more powerful graphing techniques are available in different software packages.
Yu, C. H. 2022. Data mining and exploration: From traditional statistics to modern data science. New York: CRC Press.
Chapter 2 of this book explains how data exploration, data mining, and data visualization are commonly used in the industry for business intelligence.
Users without a subscription are not able to see the full content on this page. Please subscribe or login.
How to Subscribe
Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here.
Article
- Abnormal Psychology
- Academic Assessment
- Acculturation and Health
- Action Regulation Theory
- Action Research
- Addictive Behavior
- Adolescence
- Adoption, Social, Psychological, and Evolutionary Perspect...
- Adulthood
- Advanced Theory of Mind
- Affective Forecasting
- Affirmative Action
- Ageism
- Ageism at Work
- Aggression
- Allport, Gordon
- Alzheimer’s Disease
- Ambulatory Assessment in Behavioral Science
- Analysis of Covariance (ANCOVA)
- Anger
- Animal Behavior
- Animal Learning
- Anxiety Disorders
- Art and Aesthetics, Psychology of
- Artificial Intelligence, Machine Learning, and Psychology
- Assessment and Clinical Applications of Individual Differe...
- Attachment in Social and Emotional Development across the ...
- Attention-Deficit/Hyperactivity Disorder (ADHD) in Adults
- Attention-Deficit/Hyperactivity Disorder (ADHD) in Childre...
- Attitudes
- Attitudinal Ambivalence
- Attraction in Close Relationships
- Attribution Theory
- Authoritarian Personality
- Autism
- Bayesian Statistical Methods in Psychology
- Behavior Therapy, Rational Emotive
- Behavioral Economics
- Behavioral Genetics
- Belief Perseverance
- Bereavement and Grief
- Biological Psychology
- Birth Order
- Body Image in Men and Women
- Burnout
- Bystander Effect
- Categorical Data Analysis in Psychology
- Childhood and Adolescence, Peer Victimization and Bullying...
- Clark, Mamie Phipps
- Clinical Neuropsychology
- Clinical Psychology
- Cognitive Consistency Theories
- Cognitive Dissonance Theory
- Cognitive Neuroscience
- Communication, Nonverbal Cues and
- Comparative Psychology
- Competence to Stand Trial: Restoration Services
- Competency to Stand Trial
- Computational Psychology
- Conflict Management in the Workplace
- Conformity, Compliance, and Obedience
- Consciousness
- Coping Processes
- Correspondence Analysis in Psychology
- Counseling Psychology
- Courage
- Creativity
- Creativity at Work
- Critical Thinking
- Cross-Cultural Psychology
- Cultural Psychology
- Daily Life, Research Methods for Studying
- Data Science Methods for Psychology
- Data Sharing in Psychology
- Death and Dying
- Deceiving and Detecting Deceit
- Defensive Processes
- Depression
- Depressive Disorders
- Development, Prenatal
- Developmental Psychology (Cognitive)
- Developmental Psychology (Social)
- Diagnostic and Statistical Manual of Mental Disorders (DSM...
- Discrimination
- Disgust
- Dissociative Disorders
- Drugs and Behavior
- Eating Disorders
- Ecological Psychology
- Ecopsychology
- Educational Settings, Assessment of Thinking in
- Effect Size
- Embodiment and Embodied Cognition
- Emerging Adulthood
- Emotion
- Emotional Intelligence
- Empathy and Altruism
- Employee Stress and Well-Being
- Environmental Neuroscience and Environmental Psychology
- Ethics in Psychological Practice
- Event Perception
- Evolutionary Psychology
- Expansive Posture
- Experimental Existential Psychology
- Exploratory Data Analysis
- Eyewitness Testimony
- Eysenck, Hans
- Factor Analysis
- Festinger, Leon
- Five-Factor Model of Personality
- Flynn Effect, The
- Forensic Psychology
- Forgiveness
- Friendships, Children's
- Fundamental Attribution Error/Correspondence Bias
- Gambler's Fallacy
- Game Theory and Psychology
- Geropsychology, Clinical
- Global Mental Health
- Habit Formation and Behavior Change
- Happiness
- Health Psychology
- Health Psychology Research and Practice, Measurement in
- Heider, Fritz
- Heuristics and Biases
- History of Psychology
- Human Factors
- Humanistic Psychology
- Humor
- Hypnosis
- Implicit Association Test (IAT)
- Industrial and Organizational Psychology
- Inferential Statistics in Psychology
- Insanity Defense, The
- Intelligence
- Intelligence, Crystallized and Fluid
- Intercultural Psychology
- Intergroup Conflict
- International Classification of Diseases and Related Healt...
- International Psychology
- Interviewing in Forensic Settings
- Intimate Partner Violence, Psychological Perspectives on
- Introversion–Extraversion
- Item Response Theory
- Kurtosis
- Language
- Laughter
- Law, Psychology and
- Lazarus, Richard
- Leadership
- Learned Helplessness
- Learning Theory
- Learning versus Performance
- LGBTQ+ Romantic Relationships
- Lie Detection in a Forensic Context
- Life-Span Development
- Lineups
- Locus of Control
- Loneliness and Health
- Mathematical Psychology
- Meaning in Life
- Mechanisms and Processes of Peer Contagion
- Media Violence, Psychological Perspectives on
- Mediation Analysis
- Meditation
- Memories, Autobiographical
- Memories, Flashbulb
- Memories, Repressed and Recovered
- Memory, False
- Memory, Human
- Memory, Implicit versus Explicit
- Memory in Educational Settings
- Memory, Semantic
- Meta-Analysis
- Metacognition
- Metamemory
- Metaphor, Psychological Perspectives on
- Microaggressions
- Military Psychology
- Mindfulness
- Mindfulness and Education
- Minnesota Multiphasic Personality Inventory (MMPI)
- Money, Psychology of
- Moral Conviction
- Moral Development
- Moral Psychology
- Moral Reasoning
- Motivation
- Music
- Narcissism
- Narrative
- Nature versus Nurture Debate in Psychology
- Neuroscience of Associative Learning
- Nonergodicity in Psychology and Neuroscience
- Nonparametric Statistical Analysis in Psychology
- Observational (Non-Randomized) Studies
- Obsessive-Complusive Disorder (OCD)
- Occupational Health Psychology
- Older Workers
- Olfaction, Human
- Operant Conditioning
- Optimism and Pessimism
- Organizational Justice
- Parenting Stress
- Parenting Styles
- Parents' Beliefs about Children
- Path Models
- Peace Psychology
- Perception
- Perception, Person
- Performance Appraisal
- Personality and Health
- Personality Disorders
- Personality Psychology
- Person-Centered and Experiential Psychotherapies: From Car...
- Phenomenological Psychology
- Placebo Effects in Psychology
- Play Behavior
- Positive Psychological Capital (PsyCap)
- Positive Psychology
- Posttraumatic Stress Disorder (PTSD)
- Prejudice and Stereotyping
- Pretrial Publicity
- Prisoner's Dilemma
- Problem Solving and Decision Making
- Procrastination
- Prosocial Behavior
- Prosocial Spending and Well-Being
- Protocol Analysis
- Psycholinguistics
- Psychological Literacy
- Psychological Perspectives on Food and Eating
- Psychology, Political
- Psychoneuroimmunology
- Psychophysics, Visual
- Psychotherapy
- Psychotic Disorders
- Publication Bias in Psychology
- Race
- Reasoning, Counterfactual
- Rehabilitation Psychology
- Relationships
- Reliability–Contemporary Psychometric Conceptions
- Religion, Psychology and
- Replication Initiatives in Psychology
- Research Methods
- Resilience
- Risk Taking
- Role of the Expert Witness in Forensic Psychology, The
- Rumination
- Sample Size Planning for Statistical Power and Accurate Es...
- Savoring
- Schizophrenic Disorders
- School Psychology
- School Psychology, Counseling Services in
- Self, Gender and
- Self, Psychology of the
- Self-Construal
- Self-Control
- Self-Deception
- Self-Determination Theory
- Self-Efficacy
- Self-Esteem
- Self-Monitoring
- Self-Regulation in Educational Settings
- Self-Report Tests, Measures, and Inventories in Clinical P...
- Sensation Seeking
- Sex and Gender
- Sexual Minority Parenting
- Sexual Orientation
- Signal Detection Theory and its Applications
- Simpson's Paradox in Psychology
- Single People
- Single-Case Experimental Designs
- Situational Strength
- Skinner, B.F.
- Sleep and Dreaming
- Small Groups
- Social Class and Social Status
- Social Cognition
- Social Neuroscience
- Social Support
- Social Touch and Massage Therapy Research
- Somatoform Disorders
- Spatial Attention
- Sports Psychology
- Stanford Prison Experiment (SPE): Icon and Controversy
- Stereotype Threat
- Stereotypes
- Stress and Coping, Psychology of
- Student Success in College
- Subjective Wellbeing Homeostasis
- Suicide
- Taste, Psychological Perspectives on
- Teaching of Psychology
- Terror Management Theory
- Testing and Assessment
- The Concept of Validity in Psychological Assessment
- The Neuroscience of Emotion Regulation
- The Reasoned Action Approach and the Theories of Reasoned ...
- The Weapon Focus Effect in Eyewitness Memory
- Theory of Mind
- Therapy, Cognitive-Behavioral
- Thinking Skills in Educational Settings
- Time Perception
- Trait Perspective
- Trauma Psychology
- Twin Studies
- Type A Behavior Pattern (Coronary Prone Personality)
- Unconscious Processes
- Video Games and Violent Content
- Virtues and Character Strengths
- Wisdom
- Women and Science, Technology, Engineering, and Math (STEM...
- Women, Psychology of
- Work Well-Being
- Workforce Training Evaluation
- Wundt, Wilhelm