Text Mining
- LAST REVIEWED: 23 August 2022
- LAST MODIFIED: 23 August 2022
- DOI: 10.1093/obo/9780199772810-0295
- LAST REVIEWED: 23 August 2022
- LAST MODIFIED: 23 August 2022
- DOI: 10.1093/obo/9780199772810-0295
Introduction
Text data mining, or simply text mining, encompasses tasks that typically analyze vast amounts of digitized text to detect patterns of use and then extract useful information in the search for knowledge; thus, it is one way of achieving artificial intelligence. In other words, text mining is the process of extracting value from text data. Text mining is grounded on data mining, so both fields of data science share many similarities, e.g., in the use of machine learning algorithms. However, data mining usually deals with structured data sets containing numerical data, whereas text mining aims to process unstructured or semi-structured data mainly in the form of text documents. For this reason, pre-processing techniques in text mining focus on identifying and extracting significant features from text data. Moreover, text mining benefits from the advances in natural language processing, particularly when transforming unstructured text into structured data suitable for analysis. With the exponential growth of data in the Internet era, text mining has attracted much attention as part of efforts to reduce the problem of information overload. Indeed, Web mining, which aims to discover and analyze relevant information from heterogeneous data on the Web as in the case of user-generated content from social media, requires significant advances in text mining technologies within a data fusion framework. This article is organized into two main topics: machine learning models and algorithms, which aim to discover knowledge from new data, and text-mining applications, which illustrate various tasks that can extract information from texts.
Textbooks
Dozens of books about text mining have been published since the mid-2000s. Key concepts and methods of text-mining technologies come from data mining, machine learning, and natural language processing (NLP). Therefore, most of these books aim to provide readers with a cross-disciplinary understanding of this area, as Feldman and Sanger 2007 and Aggarwal and Zhai 2012 point out. Moreover, if the purpose is to engage the reader in experimental research, books such as Hofmann and Chisholm 2016 and Žižka, et al. 2019 adopt a hands-on approach to text mining, and they include examples. However, text-mining textbooks are intended for different types of readers. On the one hand, books such as Bramer 2007, Alpaydin 2016, and Ignatow and Mihalcea 2018 present the fundamentals of data mining, machine learning, and text mining, respectively, making them more accessible to students or researchers in the social sciences and humanities. On the other hand, books such as Jo 2019 and Zong, et al. 2021 are intended for data-science students and researchers with experience in computer science and who possess some mathematical knowledge (e.g., probabilities, linear algebra, and vector calculus).
Aggarwal, Charu C., and Cheng Xiang Zhai, eds. 2012. Mining text data. New York: Springer.
The book covers classical applications of text mining and explores newer aspects resulting from emerging platforms on the Internet, such as mining text streams, translingual mining from text data, text mining in multimedia, or text analytics in social media. A valuable resource for text-mining students and researchers.
Alpaydin, Ethem. 2016. Machine learning: The new AI. Cambridge, MA: MIT Press.
The book provides an overall idea of machine learning and some examples of its application. The book is intended for a general readership, so mathematical or programming details are not discussed.
Bramer, Max. 2007. Principles of data mining. London: Springer.
The book provides a good grounding in basic data-mining algorithms for classification and clustering, which are central to knowledge discovery. This introductory book is suitable for undergraduate or graduate students in nontechnical disciplines, so it is accessible to readers with only a basic knowledge of mathematics.
Feldman, Ronen, and James Sanger. 2007. The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge, UK: Cambridge Univ. Press.
The book presents a comprehensive discussion of text-mining models, techniques, and approaches. Of particular interest is the chapter describing prototypical text-mining solutions built to deal with the vast amount of text data generated in the real-world industry. A book for researchers and professional practitioners interested in text mining.
Hofmann, Markus, and Andrew Chisholm. 2016. Text mining and visualization: Case studies using open-source tools. Boca Raton, FL: CRC Press.
DOI: 10.1201/b19007
An introduction to text mining presenting some of the most popular open-source tools: RapidMiner, KNIME (cited under Software), R, and Python. Each chapter is written so that readers can replicate the implementation of the described use cases.
Ignatow, Gabe, and Rada Mihalcea. 2018. An introduction to text mining: Research design, data collection, and analysis. Thousand Oaks, CA: SAGE.
This introductory guide is a good starting point for students in nontechnical fields (e.g., social sciences and humanities) who want to do research using text-mining tools and data sets.
Jo, Taeho. 2019. Text mining: Concepts, implementation, and big data challenge. Cham, Switzerland: Springer.
DOI: 10.1007/978-3-319-91815-0
Apart from providing fundamental knowledge about text mining, the book is primarily concerned with various aspects related to text classification and clustering. The book, which requires only an elementary level of mathematics, is intended for graduate students and researchers.
Witten, Ian H., Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Data mining: Practical machine learning tools and techniques. London: Morgan Kaufmann.
The book explains a wide variety of machine learning methods and illustrates their working implementations with WEKA (cited under Software).
Žižka, Jan, František Dařena, and Arnošt Svoboda. 2019. Text mining with machine learning: principles and techniques. Boca Raton, FL: CRC Press.
The book aims to familiarize readers with text mining through machine learning methods from a practical approach, thus presenting examples written in the R language. Readers should have programming skills and mathematical knowledge to understand the most common algorithms, use them, and interpret their results.
Zong, Chengqing, Rui Xia, and Jiajun Zhang. 2021. Text data mining. Singapore: Springer.
DOI: 10.1007/978-981-16-0100-2
After introducing fundamental text-mining models and methods, the book describes the main applications. It includes a clear introduction to neural-learning methods for text mining, focusing on distributional semantics through word embeddings, pre-training and fine-tuning transformers, and text classification with deep-learning methods. A valuable book for graduate students and practitioners working on text mining or NLP.
Users without a subscription are not able to see the full content on this page. Please subscribe or login.
How to Subscribe
Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here.
Article
- Acceptability Judgments
- Accessibility Theory in Linguistics
- Acquisition, Second Language, and Bilingualism, Psycholin...
- Adjectives
- Adpositions
- Affixation
- African Linguistics
- Afroasiatic Languages
- Agreement
- Algonquian Linguistics
- Altaic Languages
- Ambiguity, Lexical
- Analogy in Language and Linguistics
- Anaphora
- Animal Communication
- Aphasia
- Applicatives
- Applied Linguistics, Critical
- Arawak Languages
- Argument Structure
- Artificial Languages
- Attention and Salience
- Australian Languages
- Austronesian Linguistics
- Auxiliaries
- Balkans, The Languages of the
- Baudouin de Courtenay, Jan
- Berber Languages and Linguistics
- Bilingualism and Multilingualism
- Biology of Language
- Blocking
- Borrowing, Structural
- Caddoan Languages
- Caucasian Languages
- Causatives
- Celtic Languages
- Celtic Mutations
- Chomsky, Noam
- Chumashan Languages
- Classifiers
- Clauses, Relative
- Clinical Linguistics
- Cognitive Linguistics
- Colonial Place Names
- Comparative Reconstruction in Linguistics
- Comparative-Historical Linguistics
- Complementation
- Complexity, Linguistic
- Compositionality
- Compounding
- Comprehension, Sentence
- Computational Linguistics
- Conditionals
- Conjunctions
- Connectionism
- Consonant Epenthesis
- Constructions, Verb-Particle
- Contrastive Analysis in Linguistics
- Conversation Analysis
- Conversation, Maxims of
- Conversational Implicature
- Cooperative Principle
- Coordination
- Copula
- Creoles
- Creoles, Grammatical Categories in
- Critical Periods
- Cross-Language Speech Perception and Production
- Cyberpragmatics
- Default Semantics
- Definiteness
- Dementia and Language
- Dene (Athabaskan) Languages
- Dené-Yeniseian Hypothesis, The
- Dependencies
- Dependencies, Long Distance
- Derivational Morphology
- Determiners
- Dialectology
- Dialogue
- Diglossia
- Disfluency
- Distinctive Features
- Dravidian Languages
- Ellipsis
- Endangered Languages
- English as a Lingua Franca
- English, Early Modern
- English, Old
- Ergativity
- Eskimo-Aleut
- Euphemisms and Dysphemisms
- Evidentials
- Exemplar-Based Models in Linguistics
- Existential
- Existential Wh-Constructions
- Experimental Linguistics
- Fieldwork
- Fieldwork, Sociolinguistic
- Finite State Languages
- First Language Attrition
- Formulaic Language
- Francoprovençal
- French Grammars
- Frisian
- Gabelentz, Georg von der
- Gender
- Genealogical Classification
- Generative Syntax
- Genetics and Language
- Gestures
- Grammar, Categorial
- Grammar, Cognitive
- Grammar, Construction
- Grammar, Descriptive
- Grammar, Functional Discourse
- Grammars, Phrase Structure
- Grammaticalization
- Harris, Zellig
- Heritage Languages
- History of Linguistics
- History of the English Language
- Hmong-Mien Languages
- Hokan Languages
- Honorifics
- Humor in Language
- Hungarian Vowel Harmony
- Iconicity
- Ideophones
- Idiolect
- Idiom and Phraseology
- Imperatives
- Indefiniteness
- Indo-European Etymology
- Inflected Infinitives
- Information Structure
- Innateness
- Interface Between Phonology and Phonetics
- Interjections
- Intonation
- IPA
- Irony
- Iroquoian Languages
- Islands
- Isolates, Language
- Jakobson, Roman
- Japanese Word Accent
- Jones, Daniel
- Juncture and Boundary
- Khoisan Languages
- Kiowa-Tanoan Languages
- Kra-Dai Languages
- Labov, William
- Language Acquisition
- Language and Law
- Language Contact
- Language Documentation
- Language, Embodiment and
- Language for Specific Purposes/Specialized Communication
- Language, Gender, and Sexuality
- Language Geography
- Language Ideologies and Language Attitudes
- Language in Autism Spectrum Disorders
- Language Nests
- Language Revitalization
- Language Shift
- Language Standardization
- Language, Synesthesia and
- Languages of Africa
- Languages of the Americas, Indigenous
- Languages of the World
- Learnability
- Lexemes
- Lexical Access, Cognitive Mechanisms for
- Lexical Semantics
- Lexical-Functional Grammar
- Lexicography
- Lexicography, Bilingual
- Lexicon
- Linguistic Accommodation
- Linguistic Anthropology
- Linguistic Areas
- Linguistic Landscapes
- Linguistic Prescriptivism
- Linguistic Profiling and Language-Based Discrimination
- Linguistic Relativity
- Linguistics, Educational
- Listening, Second Language
- Literature and Linguistics
- Loanwords
- Machine Translation
- Maintenance, Language
- Mande Languages
- Markedness
- Mass-Count Distinction
- Mathematical Linguistics
- Mayan Languages
- Mental Health Disorders, Language in
- Mental Lexicon, The
- Mesoamerican Languages
- Metaphor
- Metathesis
- Metonymy
- Minority Languages
- Mixed Languages
- Mixe-Zoquean Languages
- Modification
- Mon-Khmer Languages
- Morphological Change
- Morphology
- Morphology, Blending in
- Morphology, Subtractive
- Movement
- Munda Languages
- Muskogean Languages
- Nasals and Nasalization
- Negation
- Niger-Congo Languages
- Non-Pama-Nyungan Languages
- Northeast Caucasian Languages
- Nostratic
- Number
- Numerals
- Oceanic Languages
- Papuan Languages
- Penutian Languages
- Philosophy of Language
- Phonetics
- Phonetics, Acoustic
- Phonetics, Articulatory
- Phonological Research, Psycholinguistic Methodology in
- Phonology
- Phonology, Computational
- Phonology, Early Child
- Pidgins
- Polarity
- Policy and Planning, Language
- Politeness in Language
- Polysemy
- Positive Discourse Analysis
- Possessives, Acquisition of
- Pragmatics, Acquisition of
- Pragmatics, Cognitive
- Pragmatics, Computational
- Pragmatics, Cross-Cultural
- Pragmatics, Developmental
- Pragmatics, Experimental
- Pragmatics, Game Theory in
- Pragmatics, Historical
- Pragmatics, Institutional
- Pragmatics, Second Language
- Pragmatics, Teaching
- Prague Linguistic Circle, The
- Presupposition
- Pronouns
- Psycholinguistics
- Quechuan and Aymaran Languages
- Questions
- Reading, Second-Language
- Reciprocals
- Reduplication
- Reflexives and Reflexivity
- Register and Register Variation
- Relevance Theory
- Representation and Processing of Multi-Word Expressions in...
- Salish Languages
- Sapir, Edward
- Saussure, Ferdinand de
- Second Language Acquisition, Anaphora Resolution in
- Semantic Maps
- Semantic Roles
- Semantic-Pragmatic Change
- Semantics, Cognitive
- Sentence Processing in Monolingual and Bilingual Speakers
- Sign Language Linguistics
- Slang
- Sociolinguistics
- Sociolinguistics, Variationist
- Sociopragmatics
- Sonority
- Sound Change
- South American Indian Languages
- Specific Language Impairment
- Speech, Deceptive
- Speech Perception
- Speech Production
- Speech Synthesis
- Suppletion
- Switch-Reference
- Syllables
- Syncretism
- Synonymy
- Syntactic Change
- Syntactic Knowledge, Children’s Acquisition of
- Tense, Aspect, and Mood
- Text Mining
- Tone
- Tone Sandhi
- Topic
- Transcription
- Transitivity and Voice
- Translanguaging
- Translation
- Trubetzkoy, Nikolai
- Tucanoan Languages
- Tupian Languages
- Typology
- Usage-Based Linguistics
- Uto-Aztecan Languages
- Valency Theory
- Verbs, Serial
- Vocabulary, Second Language
- Voice and Voice Quality
- Vowel Harmony
- Whitney, William Dwight
- Word Classes
- Word Formation in Japanese
- Word Recognition, Spoken
- Word Recognition, Visual
- Word Stress
- Writing, Second Language
- Writing Systems
- Yiddish
- Zapotecan Languages