In This Article Expand or collapse the "in this article" section Text Mining

  • Introduction
  • Textbooks
  • Software

Linguistics Text Mining
by
Carlos Periñán-Pascual
  • LAST REVIEWED: 23 August 2022
  • LAST MODIFIED: 23 August 2022
  • DOI: 10.1093/obo/9780199772810-0295

Introduction

Text data mining, or simply text mining, encompasses tasks that typically analyze vast amounts of digitized text to detect patterns of use and then extract useful information in the search for knowledge; thus, it is one way of achieving artificial intelligence. In other words, text mining is the process of extracting value from text data. Text mining is grounded on data mining, so both fields of data science share many similarities, e.g., in the use of machine learning algorithms. However, data mining usually deals with structured data sets containing numerical data, whereas text mining aims to process unstructured or semi-structured data mainly in the form of text documents. For this reason, pre-processing techniques in text mining focus on identifying and extracting significant features from text data. Moreover, text mining benefits from the advances in natural language processing, particularly when transforming unstructured text into structured data suitable for analysis. With the exponential growth of data in the Internet era, text mining has attracted much attention as part of efforts to reduce the problem of information overload. Indeed, Web mining, which aims to discover and analyze relevant information from heterogeneous data on the Web as in the case of user-generated content from social media, requires significant advances in text mining technologies within a data fusion framework. This article is organized into two main topics: machine learning models and algorithms, which aim to discover knowledge from new data, and text-mining applications, which illustrate various tasks that can extract information from texts.

Textbooks

Dozens of books about text mining have been published since the mid-2000s. Key concepts and methods of text-mining technologies come from data mining, machine learning, and natural language processing (NLP). Therefore, most of these books aim to provide readers with a cross-disciplinary understanding of this area, as Feldman and Sanger 2007 and Aggarwal and Zhai 2012 point out. Moreover, if the purpose is to engage the reader in experimental research, books such as Hofmann and Chisholm 2016 and Žižka, et al. 2019 adopt a hands-on approach to text mining, and they include examples. However, text-mining textbooks are intended for different types of readers. On the one hand, books such as Bramer 2007, Alpaydin 2016, and Ignatow and Mihalcea 2018 present the fundamentals of data mining, machine learning, and text mining, respectively, making them more accessible to students or researchers in the social sciences and humanities. On the other hand, books such as Jo 2019 and Zong, et al. 2021 are intended for data-science students and researchers with experience in computer science and who possess some mathematical knowledge (e.g., probabilities, linear algebra, and vector calculus).

  • Aggarwal, Charu C., and Cheng Xiang Zhai, eds. 2012. Mining text data. New York: Springer.

    The book covers classical applications of text mining and explores newer aspects resulting from emerging platforms on the Internet, such as mining text streams, translingual mining from text data, text mining in multimedia, or text analytics in social media. A valuable resource for text-mining students and researchers.

  • Alpaydin, Ethem. 2016. Machine learning: The new AI. Cambridge, MA: MIT Press.

    The book provides an overall idea of machine learning and some examples of its application. The book is intended for a general readership, so mathematical or programming details are not discussed.

  • Bramer, Max. 2007. Principles of data mining. London: Springer.

    The book provides a good grounding in basic data-mining algorithms for classification and clustering, which are central to knowledge discovery. This introductory book is suitable for undergraduate or graduate students in nontechnical disciplines, so it is accessible to readers with only a basic knowledge of mathematics.

  • Feldman, Ronen, and James Sanger. 2007. The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge, UK: Cambridge Univ. Press.

    The book presents a comprehensive discussion of text-mining models, techniques, and approaches. Of particular interest is the chapter describing prototypical text-mining solutions built to deal with the vast amount of text data generated in the real-world industry. A book for researchers and professional practitioners interested in text mining.

  • Hofmann, Markus, and Andrew Chisholm. 2016. Text mining and visualization: Case studies using open-source tools. Boca Raton, FL: CRC Press.

    DOI: 10.1201/b19007

    An introduction to text mining presenting some of the most popular open-source tools: RapidMiner, KNIME (cited under Software), R, and Python. Each chapter is written so that readers can replicate the implementation of the described use cases.

  • Ignatow, Gabe, and Rada Mihalcea. 2018. An introduction to text mining: Research design, data collection, and analysis. Thousand Oaks, CA: SAGE.

    DOI: 10.4135/9781506336985

    This introductory guide is a good starting point for students in nontechnical fields (e.g., social sciences and humanities) who want to do research using text-mining tools and data sets.

  • Jo, Taeho. 2019. Text mining: Concepts, implementation, and big data challenge. Cham, Switzerland: Springer.

    DOI: 10.1007/978-3-319-91815-0

    Apart from providing fundamental knowledge about text mining, the book is primarily concerned with various aspects related to text classification and clustering. The book, which requires only an elementary level of mathematics, is intended for graduate students and researchers.

  • Witten, Ian H., Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Data mining: Practical machine learning tools and techniques. London: Morgan Kaufmann.

    The book explains a wide variety of machine learning methods and illustrates their working implementations with WEKA (cited under Software).

  • Žižka, Jan, František Dařena, and Arnošt Svoboda. 2019. Text mining with machine learning: principles and techniques. Boca Raton, FL: CRC Press.

    DOI: 10.1201/9780429469275

    The book aims to familiarize readers with text mining through machine learning methods from a practical approach, thus presenting examples written in the R language. Readers should have programming skills and mathematical knowledge to understand the most common algorithms, use them, and interpret their results.

  • Zong, Chengqing, Rui Xia, and Jiajun Zhang. 2021. Text data mining. Singapore: Springer.

    DOI: 10.1007/978-981-16-0100-2

    After introducing fundamental text-mining models and methods, the book describes the main applications. It includes a clear introduction to neural-learning methods for text mining, focusing on distributional semantics through word embeddings, pre-training and fine-tuning transformers, and text classification with deep-learning methods. A valuable book for graduate students and practitioners working on text mining or NLP.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login.

How to Subscribe

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here.

Article

Up

Down