Emerging Trends and Technologies

Analyzing Amazon Reviews
with Machine Learning

Using machine learning to extract business insights from Amazon reviews

Rob Quin

July 15, 2023

Project Overview

The purpose of this project is to draw insights from customer reviews to provide Amazon sellers and marketers ways to improve their products and marketing.

This project uses a specific type of machine learning called Natural Language Processing (NLP) to assign categories to reviews so that sellers can understand how customers feel about their products without having to read every review.

Three major NLP tools are used in this project - Spacy, OpenAI, and Distilbart.

Using NLP to Process Reviews

Using machine learning tools requires balancing output quality, performance speed, and cost. OpenAI's models are some of the best, however, using them can be expensive at scale, so I'll use free NLP tools for simpler tasks. Spacy processes the reviews prior to sending them to OpenAI's models for categories, and Distilbart applies those categories to the reviews. Both Spacy and Distilbart are free.

NLP Process Flow Diagram

Data Sources

I sourced these Amazon reviews from Rapid API, which has lots of API for all sorts of things like Google Maps data, Yelp data, news data, social media data, and all sorts of data that you could scrape.

The GDELT Project is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world. It's basically a giant database of political events that is updated every 15 minutes with links to news articles. I've been thinking about applying these machine learning ideas to news data, and this might be an interesting project for anyone who wants to dive into machine learning.

https://rapidapi.com/

https://www.gdeltproject.org/

Natural Language Processing - https://www.ibm.com/topics/natural-language-processing

The field of artificial intelligence concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

Speech recognition

Part of speech tagging - nouns, verbs, etc

Named entity recognition - locations, people's names

Sentiment analysis - postive, negative, angry, sad, etc

Text Summarization

Spacy - https://spacy.io/

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

I use Spacy to remove stop words and lemmatize words. Stop words are words like and, or, & the that carry little meaning by themselves. Lemmatize refers to distilling similar words down to a base word, for example, reducing "builds", "building", or "built" to the lemma "build".

Lemmatization

Reducing words to their base form. For example, "changing", "changes", "changed" all become "change".

Stop Words

Common words like "the", "a", "in" that are filtered out during text processing as they add little semantic value.

Tokenization

Breaking text into smaller units (tokens) like words, phrases, or symbols for analysis.

LangChain & Hugging Face

LangChain & Hugging Face do most of the hard work by providing abstractions that distill complex machine learning code into a few lines of easy to read Python code.

I use OpenAI's Davinci 3 model later in this project, which costs around 2 cents per 1000 tokens, including both input and output. Per OpenAI, their average token is around 4 characters.

So if you send a 400 character instruction to OpenAI and it gives you a 200 character response, you will be charged for about 150 tokens ((400+200)/4), or around 0.3 cents. While that number seems small, it adds up if you run this for millions of reviews.

Model Evaluation

You'll want to split the data that you build your models with into a training set, a validation set, and a test set.

Training Set

A training set is a subset of data used to train a model

Validation Set

Used to evaluate the results of the model for further tweaking.

Test Set

Used to test the final results of the model.

The difference between the validation set and the test set is you can tweak your model after viewing the results on the validation set, but not the test set.

The test set provides an unbiased estimate of the model's performance on new, unseen data. It assesses the model's ability to generalize beyond what it has already been exposed to.

Evaluating the Model

You want to evaluate how often the model is right and how intensely it is right or wrong. 95% success sounds good, until we're talking about the probability of an airplane landing. Loss functions are useful for penalizing your model for bad results.

Demo


Here's the Code

https://github.com/nothing-special20/neon-temple-ml-demo/blob/main/amz_ml/Review%20Quality%20Extraction.ipynb