How Automotive Companies Can Use Text Analytics to Stay Ahead Of The Competition

Tarash Jain
7 min readMay 2, 2021

Abstract: Using NLP and Lift Value Analysis to extract insights from online review platforms to drive branding and marketing decisions.

Introduction to Text Mining

With more than 58% of the world’s population having daily access to the internet and low-cost data storage software, the increase in data generation, especially that of unstructured data, has been exponential. In business-settings, 80% of information is originated in an unstructured form, primarily as text. It is, therefore, crucial for companies to understand how to extract value and insights from such data sources and leverage these insights to drive business decisions. This is where Text Analytics comes into play.

Text mining or Text Analytics refers to a range of interdisciplinary artificial intelligence (AI) approaches and technologies used to extract information from a collection of free (unstructured) text documents. Natural Language Processing (NLP) is how a program understands this data and identifies trends and patterns, leading to the discovery of new knowledge. For businesses, Natural Language Processing can help them automate the process of understanding customer feedback and reviews on a large scale and employ these findings to inform business process improvement strategies.

In this study, we explore how Automotive companies can utilize text analytics to examine comments posted in online discussion forums, such as Edmunds.com. We explore and pre-process the text data using NLP systems, then apply Machine Learning and Statistical methods (such as Lift Value Analysis) to interpret the data.

Data Sources

Edmunds.com (https://forums.edmunds.com) is an online resource for automotive information. It includes details of new and used cars (e.g. car prices, dealer and inventory listings, tips and advice on all aspects of car purchases and ownership, etc…). As a hub for all-things cars, Edmunds attracts many visitors who leave comments and opinions.

We focused our analysis on the mid-sized sedan car market. We scraped 5000 reviews using BeautifulSoup

Pre-Processing

Data pre-processing occurred in two stages. First, the data was cleaned, then each word was tagged with its corresponding ‘Parts of Speech’. The details of all pre-processing steps are as follows:

1. Contraction Expansion: A contraction is a word made by shortening and combining two words (e.g. ‘can’t’ = cannot, ‘don’t’ = do not). These contractions were transformed into their expanded form.

2. Tokenization: This is a way of separating a piece of text into smaller units called ‘tokens’. We used the NLTK package to split each review into tokens, taking space as a delimiter, and stored the tokens into a list.

3. Lower Casing: since python distinguishes strings with upper- and lower-cases, all the words were converted into lower case for ease of analysis.

4. Replacing Model Names: We aimed to create distinct categories of cars by name brands. In reviews, certain brands (e.g. Toyota) were referred to by their differing model names (e.g. Prius), rather than the brand name. Therefore, we replaced each model name with the brand name to maintain consistency.

5. Removing Punctuation and Stopwords: Stopwords are words in any language that do not add much meaning to a sentence (e.g., ‘the’, ‘a’, ‘in’). Thus, they were removed.

6. POS Tagging and Lemmatization: Lemmatization considers the context of a word/sentence and converts the word to its meaningful base form (e.g., ‘driving’ becomes ‘drive’). Part-Of-Speech Tagging should be performed prior to lemmatization because it increases the conversion accuracy. For instance, if we only lemmatize, the word ‘forgot’ could potentially remain as ‘forgot’. However, if POS Tagging is applied beforehand, Lemmatization will more likely convert ‘forgot’ to its accurate base form ‘forget’.

7. Remove Duplicate Words : All duplicate words were removed from each review. This is important because the metric used to evaluate the relationship between brands and attributes requires the calculation of word frequency.

Figure 1: Data Pre-Processing

Analytical Approach and Results

1 — Co-occurrence Matrix, Lift Value and MDS

To analyze the relationship between brands, a very useful metric is the Co-Occurrence Matrix. The purpose of this matrix is to compare the number of times a certain brand appears in the same context as another brand. Table 1 shows the co-occurrence matrix of the top 10 most frequently occurring brands in the dataset.

Table 1: The values represent the number of times the brands are mentioned together

One disadvantage of the Co-occurrence Matrix is that it may produce biased results. Co-occurrence matrices are not a perfect measure for similarity, as they do not take into account the relative frequency of each word in the entire dataset. For example, if the co-occurrence of ‘Brand A’ and ‘Word A’ in a comment is 3500 and the co-occurrence of ‘Brand B’ and ‘Word A’ is 2000, we might initially assume that ‘Brand A’ and ‘Word A’ are more closely associated with each other than ‘Brand B’ and ‘Word A’. However, if the number of reviews of Brand A in the entire dataset is 300,000 whereas the number of reviews of Brand B is only 10,000, ‘Brand B’ and ‘Word A’ actually co-occur more frequently and are thus more closely related than ‘Brand A’ and ‘Word A’.

To tackle the issue and normalize the data, we use the concept of Lift Values. Lift is the ratio of the actual co-occurrence of two terms to the frequency with which we would expect to see them together.

The lift between terms A and B can be calculated as:

where P(A) is the probability of the occurrence of term A in a given message, and P(A,B) is the probability that both A and B appear in a given message.

The higher the lift value, the stronger is the relationship between two words. Table 2 shows the lift values of the top 10 most frequently occurring brands:

Table 2: The brand pair with the highest lift value is: Chevrolet and Saturn (5.83)

To further explore the mid-size sedan market structure, the lift values were plotted onto a graph in an effort to identify brand clusters. We employed Multidimensional Scaling (MDS), which is traditional market-structure analysis and visualization tool. Figure 2 depicts the MDS plot of the top 10 car brands in our dataset.

Figure 2: MDS Plot

In an MDS plot, items that are close together are more similar than items that are farther apart. Using this analysis, brand managers can identify the automotive brands that are often associated with their brands and develop a strategy to distinguish themselves from the competitors within their cluster in order to capture substantial market share.

Two brands worth noting are: Saturn and Chevrolet. These brands have the highest lift value (5.83) and are close together in the MDS plot. Incidentally, Saturn and Chevrolet are both manufactured under the same parent company: General Motors. Both brands are considered similar in online reviews, therefore, it is important for their brand managers to establish unique messaging and promotional strategies for each brand, to avoid confusion and cannibalization.

2 — Identifying the most distinctive feature of each brand.

To obtain the vehicle attributes most frequently mentioned reviews, we calculated the frequency distribution of every word in the entire dataset. We then identified the most discussed attributes and grouped similar descriptors into more general categories for easier analysis.

• Performance -> [engine, power, hp, speed, run, transmission]

• Look -> [interior, design, sporty, pretty, quality]

• Cost -> [price, value, warranty]

• Size -> [big, small, midsize, large]

• Gas -> [mileage, mpg, fuel, mile]

The top 5 brands and the top 5 attributes mentioned in the Edmund.com reviews were:

Table 3: Top 5 barnds and top 5 attributes in data

We further conducted a Lift Analysis to explore the relationships between each brand and the top attributes. Each brand was mainly associated with the following attribute:

o Honda — Cost

o Ford — Look

o Toyota — Size

o Hyundai — Cost

o Mazda — Size

Table 4: The values represent the lift values of the top 5 brands and features

Please note that this analysis does not explore the sentiment of the attribute, as such, we are unable to conclude at this stage if the attributes are associated with the brand in a negative or positive way.

Conclusion

With the rapid creation and availability of text data in online platforms, businesses across all industries will benefit greatly from learning how to employ text analytics methodologies to extract meaning from customer feedback in order to better tailor their business processes and strategies.

By using NLP, brand managers have the added advantage of being able to quickly and systematically gather and process large amounts of customer data, which can enhance their marketing strategies, subsequently helping in differentiating them from the competition.

As a future state consideration for this study, we endeavor to explore more advanced text mining techniques such as Sentiment Analysis and Topic Modelling to extract additional insights from text data

References

[1] http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/

[2] http://www.nactem.ac.uk

[3] https://forums.edmunds.com/discussion/7526/general/x/midsize-sedans-2-0

[4] Akiva N, Greitzer E, Krichman Y, Schler J (2008) Mining and visualizing online Web content using BAM: Brand Association Map. Proc. Second Internat. Conf. Weblogs Soc. Media 2008 (Association for the Advancement of Artificial Intelligence, Seattle),170–171.

[5] Mine Your Own Business: Market-Structure Surveillance Through Text Mining by Oded Netzer, Ronen Feldman, Jacob Goldenberg, Moshe Fresko (http://dx.doi.org/10.1287/mksc.1120.0713)

--

--