Text Analytics: A Primer

1047

text-analysis

By Kevin Gray and Bing Liu

Marketing scientist Kevin Gray asks Professor Bing Liu to give us a quick snapshot of text analytics. The interview was first published in GreenBook on January 24, 2017.

_____________________________________________________________________

KG: I see “text analytics” and “text mining” used in various ways by marketing researchers and often used interchangeably. What do these terms mean to you?

BL: My understanding is that the two terms mean the same thing. People from academia use the term text mining, especially data mining researchers, while text analytics is mainly used in industry. I seldom see academics use the term text analytics. There is another closely related term, called natural language processing (NLP). Text mining and text analytics usually refer to the application of data mining and machine learning algorithms to text data. NLP covers that and also other more traditional natural language tasks such as machine translation, syntax, semantics, etc. But there is really no clear demarcation between the terms.

KG: Can you give us a brief history of text analytics/mining and how it has evolved over time?

BL: It comes from three research areas: information retrieval, data mining, and natural language processing (NLP). Information retrieval started in the 1970s. It mainly deals with text retrieval. That is, given a query, which can be a few keywords or a full document, we want to find related documents from a text collection or corpus. Web search engines are giant information retrieval systems.

Traditional data mining uses structured data such as database tables. In the late 1990s, researchers started to use text as data, which gave rise to text mining. Early text mining basically applied data mining and machine learning algorithms on text data without using NLP techniques such as parsing, part-of-speech tagging, summarization, etc.

NLP has a much longer history. It started in the 1950s and its objective is to make computers understand human language. As text mining research expanded its scope in the past 10 years or so, it started to use natural language processing techniques such as parsing, part-of-speech tagging, coreference resolution, etc. Judging from topics covered in natural language process conferences, text mining has now become a part of natural language processing. My own research started with traditional data mining. I then worked on sentiment analysis or opinion mining, which led me to natural language processing.

KG: How is it used in marketing research and other fields?

BL: Text analytics has been used widely in marketing and many other fields. I am most familiar with the application of sentiment analysis. In marketing, marketers often want to know consumer opinions about their company’s products and their competitors’ products. Such opinions can be obtained by analyzing online reviews or other forms of social media postings about those products. Based on these opinions, marketers can formulate their marketing messages to suit different segments of the market. Public opinions are also very useful in many other application domains, e.g., stock market prediction, consumer sentiment prediction, political election prediction, etc.

KG: What are the major technical challenges text analytics faces?

BL: It all depends what task one is interested in. Some tasks are done reasonably well, e.g., named entity recognition. But many other tasks still need a lot of improvement in accuracy. The ultimate challenge is natural language understanding. Although researchers have worked on it for a long time, progress has not been great. Current text analytics techniques are still mainly based on traditional linguistics rules and statistical machine learning and data mining algorithms. These methods are still not able to achieve true understanding. Due to this problem, most text analytics tasks still have relatively low accuracy.

KG: What role does Artificial Intelligence play in text analytics?

BL: Advanced text analytics is a part of artificial intelligence (AI). Progress in other AI areas such as machine learning and data mining are making a big impact on text analytics. I would say that the main progress of text analytics in the past twenty years has come from better machine learning techniques.

KG: Are there misperceptions or misunderstandings many people seem to have about text analytics?

BL: I am not aware of any big misperceptions or misunderstandings about text analytics in academia. I am not sure about industry. The only thing that I know is that people can have very high expectations about text analytics, but it is a very challenging problem if you want to do it well and accurately.  

KG: Lastly, looking ahead ten years, what do you think text analytics will be able to do that it cannot do now? Are there some things that will be impossible for text analytics for the foreseeable future?

BL: Let’s talk about natural language processing rather than text analytics, as advanced text analytics requires natural language processing. As machine learning such as deep learning progresses, we will certainly see better text analytics algorithms with much better accuracy than we can achieve today. But understanding natural language like we humans do is very unlikely in the foreseeable future because natural language is highly abstract. Every sentence we write has a great deal of commonsense knowledge behind it that we assume the reader knows. Clearly, a computer program does not know this. Learning, representing, and reasoning about commonsense knowledge is a major challenge.

KG: Thank you, Bing!

__________________________________________________________________

Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy. 

Bing Liu is a full professor of Computer Science at the University of Illinois at Chicago (UIC). He received his PhD in Artificial Intelligence from the University of Edinburgh and is the author of numerous books and articles on sentiment analysis and opinion mining, machine learning and related subjects.