Modeling Coffee Reviews with Linear Model

Will Rodman | Final Project | MATH-6040 Linear Models

This project studies data from coffeereview.com, which was obtained through web scraping and published on kaggle.com. The goal is to develop a linear regression model that predicts individual ratings for coffee beans. Additonal quesitons answered about the data include:

Project Sections:

  1. Loading and cleaning the dataset.
  2. Visualizing numerical features.
  3. Visualizing categorical features.
  4. Transforming and testing numerical features.
  5. Fitting Linear Regression

Data source: Link

Sprint 1: Loading and cleaning the dataset.

The CSV files coffee_df and coffee_analysis are two different versions of the data that has been scraped. In order to use features from both datasets, I merge the two datasets and drop the duplicate values.

Each row in the dataset represents an individual users rating for a coffee bean; here is a description for each feature:

Since the roast feature is an ordinal data type, this feature needs to be sorted.Ordered table for the roast feature: | Roast | Level | |---------------|-------| | Light | 1 | | Medium-Light | 2 | | Medium | 3 | | Medium-Dark | 4 | | Dark | 5 |

Sprint 2: Visualizing numeric features.

Now that the dataset is clean, all the features in the dataset can be analyzed. This analysis will determine which features can be used in the linear regression model. Specifically, I will compute the Pearson correlation, print a summary of statistics, as well as plot the feature distributions and create box plot comparisons

I will first start by printing some statistics about the distributions of numerical features.

This table shows that 100g_USD is highly positively skewed, as well as acid, body, and flavor having similar means and standard deviations. Next, I will plot the distributions of these same features using a histogram.

Looking at the 6 histogram plots: the feature 100g_USD appears to be exponentially distributed, while the other 5 distributions appear to center around the mean. The rating feature appears to be the closest to a normal distribution. Next, I will visualize the numerical features using a box plot comparison, where the distributions for 5 numerical features are plotted against the range of rating feature values.

Looking at the 5 box plot comparisons, the box plot of 100g_USD and rating appear to have an exponential relationship, while the other 4 box plot comparisons appear to have a linear relationship. However, these 4 box plots have a S-shaped relationship that indicates the body, flavor, acid and aftertaste distributions are uniform.

Most and Least Correlated Features to Rating: The acid has the highest correlation to rating with a coefficient of 75%. The feature 100g_USD has the least correlation to rating with coefficient of 28%.

High Correlations Among Other Features: Aftertaste and acid have a relatively high correlation coefficient of 45%.

Sprint 3: Visualizing the categorical features.

For the categorical features, I will start by printing the number of observations per category.

Looking at the value counts, the Medium-Light category in roast and the United States category in loc_country both have more observations than all other categories in their respective features combined. Next, I will visualize the categorical features using a box plot comparison, where the distributions for rating are plotted against each category for a feature.

Looking at the rating and roast box plot comparison, there appears to be a relationship where the mean rating decreases as the level of roast increases.

Looking at the rating and loc_country box plot comparison, there does not appear to be significant variance in the means by loc_country.

Sprint 4: Transforming and testing numerical features.

Looking at the plots of the distribution for the numerical features, they do not appear to be normally distributed. I will use the Shapiro-Wilk test for normality and use the Kolmogorov-Smirnov test for uniformity.

Looking at the results of the Shapiro-Wilk test, none of the features come from a normal distribution except for rating. Since 100g_USD appears to be exponentially distributed, I will transform this to normality in order to minimize the outlier values. This will create a better fit for our models in Sprint 5.

This feature is now more normally distributed.

Sprint 5: Fitting Linear Regression

Here, I fit linear models using a variety of feature sets to find a model with the highest achieving F-statistic, R-squared and Log-Likelihood values. First, I will fit a linear regression where rating is the dependent variable and all other nominal features are the independent variables; the fit is done using the least squares method. This linear equation is written as follows:

$ \hat{rating} = \beta_0 + \beta_1 \text{acid} + \beta_2 \text{body} + \beta_3 \text{flavor} + \beta_4 \text{aftertaste} + \beta_5 \text{100g\_USD} + \epsilon \\ $

Model 1:

After fitting this model, the R-squared value is 95%, and the F-statistic is 7277. The $P>|t| = 0.096$ for the 100g_USD feature indicates that there is insufficient statistical significance to suggest that 100g_USD has a meaningful impact on the model. Considering this, I will update the model to exclude the 100g_USD feature.

Model 2:

After fitting the updated model, the R-squared value remains at 95%, and the F-statistic increases to 9086. This means that the model now explains more of the variance in the data while explaining the rating feature just as well. Now I will create a design matrix using the roast feature, to fit a model, which will be used for ANOVA.

Model 3:

Based on the results, all roast categories significantly influence the rating, with Light having the highest coefficient and t-statistic. I'll now combine the second and third models in an attempt to increase the R-squared value.

Model 4:

While this updated model has the same R-squared value of 95% but a lower F-statistic of 5323, when comparing the log Likelihoods for all four models:

  1. Model 1: -606.38
  2. Model 2: -607.78
  3. Model 3: -3214.0
  4. Model 4: -585.20

This model has the highest log likelihood values, so it should be considered to have the best fit.