This project is from UCI MSBA - NLP & Applications Class.
Team: Taeho Lee, Jacob Linao, Jiwon Ko, Tiffany Dinh, Chloe Ko
Tool: Python
The content of this post is excerpted from Jacob’s medium page. You can find our full project’s information on here.
Project Description and Objectives
For this project we will use NLP and machine learning to understand the sentiments surrounding the different areas of discussion at Disney Parks. In addition to comprehensive exploratory data analytics, we intend to gain insight into both the granular aspects of Disneyland and the park as a whole in which customers feel strongly about, to the extent of posting a review about it. With this, we hope to develop a data driven story about how Disney leaders can not only fix areas of negative sentiment but also continually improve areas that their visitors love.
A possible pitfall to avoid is treating topic modeling and consumer sentiments the same across all Disneyland segments. Due to three different geographic locations, there are many factors which may contribute to varying insights in consumer reviews. For example, we may see that different cultural and societal norms value offerings may result in this. It would be beneficial to conduct additional external research into Disney’s business strategies to better align with these insights.
We will evaluate our classification results for sentiment based on accuracy scores in order to ensure our model is predicting both positive and negative correctly. For topic modeling, we will focus on ensuring that our topics are interpretable and insightful.
Data
a. Data Acquisition
For this project, we have a valuable and rich dataset of customer reviews from all of the Disneyland parks webscraped from Tripadvisor using Python. The dataset includes 5333 reviews from Disneyland California Adventure, 12312 reviews from Disneyland Hong Kong, and 3230 reviws from Disneyland Tokyo.
b. Cleaning Process
To begin our analysis, we have to first clean the text. Within the reviews, there are a lot of unnecessary filler that we have to remove such as punctuation, emojis, extra whitespace, etc. Essentially anything that does not benefit text and sentiment analysis has to be filtered out. Additionally, this is why removing stopwords are important to remove.
c. Exploratory Descriptive Analysis
The notable aspect of our dataset is that the distribution of customer sentiment is very imbalanced. We derived “Positive” and “Negative” sentiment from the review rating as the following: (ratings of 4 or higher will be classified as positive reviews while reviews rating 3 and under will be negative)
With this engineered criteria, we see that the reviews are significantly more positive than negative. The disparity of sentiment ratings also drastically differ between the regions. The international regions have a lower ratio of negative to positive reviews.
In the charts above we see that, we draw basic insights from review counts and average ratings across time for all the Disneyland locations. It seems that with more reviews correlates with higher average rating. It is quite notable that the pandemic has drastically affected Disneyland. This incites changes to be made using to use review data a source of truth to discover what is wrong. Furthermore, text analysis can be used to investigate areas of negative sentiment, moreover what is making customers unhappy, potentially causing a drop in ratings.
Additionally, from a bird’s eye view we see the overall differences and similarities between positive and negative reviews using review ratings as a means to distinguish. In the wordcloud, we provide a visual representation of text data in which the size of each word indicates its frequency or importance. For the positives we see shows, fireworks, and attractions. On the other hand the negative wordcloud shows people (too much), hour, wait, and staff. Finally the similarities between negative and positive are:
Food: may be a conflicted area of interest, most likely due to varying preferences
Lines: dealbreaker for many, but not for some. Some people are willing to wait in line for rides, others cannot bear the inevitable lines Disneyland is known for
Topic Modeling
LDA or Latent Dirichlet Allocation, is an example of topic model which is utilized to cluster document text to a specific topic. It is a method designed to aid the discover the topics that exist in a corpus. For our analysis, we conducted topic modeling for all the parks and for each of the parks, separately. Our focus was centered on the negatively labeled reviews in order to extract common areas of discussion that Disneyland visitors found universally problematic. By identifying and inferring insight from these topics, we can create business recommendations to mitigate negative sentiment in these areas.
For our topic modeling, we extracted bigrams as features within the topics. Bi-grams are pair of consecutive written units such as letters, syllables, or words. For example, a bi-gram as an extension of the uni-gram, “good”, can be “good-service”. The use of bi-grams helps to contribute better context to the features that make up topics, making it easier to interpret. To also better visualize our LDA topic modeling results, we used a visualization tool called PyLDAvis. In this Python package, each bubble represents a topic. Larger bubble signifies that a higher percentage of the number of reviews in the corpus is about that topic.
Additionally, blue bars represent the overall frequency of each word in the corpus. On the other hand, red bars express the estimated number times a term was generated by a topic. The distance between the bubbles, or topics, represent how distinct the topics are from each other. A greater distance is a indication of a better topics, since there is no overlap of features between them, signifying that the model did an efficient job of clustering topics.
Disneyland Anaheim - Negative Reviews
Taking the first granular look into an individual park, we see the 4 negative topics in the California location. 3 of these topics are rides/attractions in Indiana Jones, Star Wars, and Haunted Mansion. The last topic pertains to Customer Service. Both Indiana Jones and Haunted Mansion are relatively old rides in the Anaheim park, so it might be a good idea to either explore what about these rides customers are having issues with or conduct research into what better options can replace these rides. Especially in the case for Haunted Mansion, it extremely outdated and most fans do not care for it as much because it isn’t the most kid friendly. Instead, a newer theme or upscale renovation may bring more fans to its liking. Lastly, if customer service is a rampant issue at this park, management would need to take further action and evaluate how to train staff better to improve on areas that are perceived by visitors as lackluster.
Disneyland Hong Kong - Negative Reviews
In this park, we see that all 4 topics pertain to entertainment and attractions. Because this is an entirely different location with a unique customer base and market preferences, it would be beneficial to dive deeper into segmenting these reviews based on these topics and further explore what their visitors are saying.
Disneyland Tokyo - Negative Reviews
Finally for Japan, the topics we extracted are Attractions, Speak English, School Holidays, and Walking around. One suggestion for this particular location may be to address tourist guides or staff to better accommodate English speaking guests. One interesting insight may be to point out the difference between Eastern and Western cultures. From the topic modeling, it seems that Western culture may be less focused on attractions and more on inconveniences with the park experience. Western cultures may feel more inclined to voice their troubles with customer service than Eastern cultures.
Conclusion
- Implementation of sentiment analysis and topic modeling will help Disneyland to understand customer’s choices and decision-making. From here Disneyland can utilize this to predict customers purchasing behavior.
- Improve the quality of Disneyland marketing strategies for a more productive and efficient marketing campaign. The insight from sentiment analysis help Disneyland determine which marketing message to deliver to which type of customers.
- Improve the guest experience at Disney’s park to retaining guests, building loyalty, and maintaining loyalty.