Introduction

In this case study, we analyzed two data sets provided by our client Budweiser, the Beers dataset contains a list of 2410 US craft beers and Breweries dataset contains 558 US breweries. The purpose of the code is to find statistical characters of IBU and ABV data, also the correlation between them to address questions from Budweriser.

Repository Structure

Case study Repository in GitHub: https://github.com/48120778/EDAforProject1.git
Case study RShiny App: https://data-science-hw.shinyapps.io/BeerShiny/

Files:
Reers.csv: Source beer data.
Breweries.csv: Source breweries data.
EDAforProject1_R5.RMD: R Markdown file with all R code.
EDAforProject1_R5.HTML: HTML file created with Knit.

Initial data profiling on Beers data set

Initial data profiling on Breweries data set

str(BreweriesDF)

## 'data.frame':    558 obs. of  4 variables:
##  $ Brew_ID: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name   : Factor w/ 551 levels "10 Barrel Brewing Company",..: 355 12 266 319 201 136 227 477 59 491 ...
##  $ City   : Factor w/ 384 levels "Abingdon","Abita Springs",..: 228 200 122 299 300 62 91 48 152 136 ...
##  $ State  : Factor w/ 51 levels " AK"," AL"," AR",..: 24 18 20 5 5 41 6 23 23 23 ...

Analysis

1. How many breweries are present in each state?

Number of breweries in 51 states has been calculated and provided as the below table, together with a plot with numbers in order.

# Count breweries in each state
table(BreweriesDF$State)

## 
##  AK  AL  AR  AZ  CA  CO  CT  DC  DE  FL  GA  HI  IA  ID  IL  IN  KS  KY 
##   7   3   2  11  39  47   8   1   2  15   7   4   5   5  18  22   3   4 
##  LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH 
##   5  23   7   9  32  12   9   2   9  19   1   5   3   3   4   2  16  15 
##  OK  OR  PA  RI  SC  SD  TN  TX  UT  VA  VT  WA  WI  WV  WY 
##   6  29  25   5   4   1   3  28   4  16  10  23  20   1   4

2. Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file.

We merged the two data set into one and named it as ‘mergeDF’, the primary key being used is ‘Brewery_id’ from Beer data set, and ‘Brew_ID’ from Breweries data set. We also changed the two columns’ name for clear understanding. The first and last 6 observations were showed there with head/tail command.

The first 6 observations of the merged file:

##   Brewery_id     Beer_name Beer_ID   ABV IBU
## 1          1  Get Together    2692 0.045  50
## 2          1 Maggie's Leap    2691 0.049  26
## 3          1    Wall's End    2690 0.048  19
## 4          1       Pumpion    2689 0.060  38
## 5          1    Stronghold    2688 0.060  25
## 6          1   Parapet ESB    2687 0.056  47
##                                 Style Ounces       Brewery_name
## 1                        American IPA     16 NorthGate Brewing 
## 2                  Milk / Sweet Stout     16 NorthGate Brewing 
## 3                   English Brown Ale     16 NorthGate Brewing 
## 4                         Pumpkin Ale     16 NorthGate Brewing 
## 5                     American Porter     16 NorthGate Brewing 
## 6 Extra Special / Strong Bitter (ESB)     16 NorthGate Brewing 
##          City State
## 1 Minneapolis    MN
## 2 Minneapolis    MN
## 3 Minneapolis    MN
## 4 Minneapolis    MN
## 5 Minneapolis    MN
## 6 Minneapolis    MN

The last 6 observations of the merged file:

##      Brewery_id                 Beer_name Beer_ID   ABV IBU
## 2405        556             Pilsner Ukiah      98 0.055  NA
## 2406        557  Heinnieweisse Weissebier      52 0.049  NA
## 2407        557           Snapperhead IPA      51 0.068  NA
## 2408        557         Moo Thunder Stout      50 0.049  NA
## 2409        557         Porkslap Pale Ale      49 0.043  NA
## 2410        558 Urban Wilderness Pale Ale      30 0.049  NA
##                        Style Ounces                  Brewery_name
## 2405         German Pilsener     12         Ukiah Brewing Company
## 2406              Hefeweizen     12       Butternuts Beer and Ale
## 2407            American IPA     12       Butternuts Beer and Ale
## 2408      Milk / Sweet Stout     12       Butternuts Beer and Ale
## 2409 American Pale Ale (APA)     12       Butternuts Beer and Ale
## 2410        English Pale Ale     12 Sleeping Lady Brewing Company
##               City State
## 2405         Ukiah    CA
## 2406 Garrattsville    NY
## 2407 Garrattsville    NY
## 2408 Garrattsville    NY
## 2409 Garrattsville    NY
## 2410     Anchorage    AK

3. Address the missing values in each column.

There are 62 observations where both ABV and IBU are NA’s, 943 observations where the IBU only are NA’s. We replaced 62 NA’s in ABV with state level median. Replacing 1005 NA’s in IBU with state level median led to an 18% reduction in the accuracy of the correlation model. Predicted values from simple linear regression model are used to replace missing values in IBU to improve the accuracy of the model.

4. Compute the median alcohol content(ABV) and international bitterness unit (IBU) for each state. Plot a bar chart to compare.

Median ABV and IBU data have been sorted and plotted.
The median ABV per state appears fairly consistent with an overall ABV median of 0.056.
Kentucky has the highest median ABV at 0.062 and Utah has the lowest at 0.04.
The median IBU per state appears to change considerably between states with an overall IBU median of 37.
West Virginia has the highest median IBU at 57.5 and Kansas has the lowest at 22.

5. Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?

Maximum ABV and IBU data have been sorted and plotted. The maximum ABV per state appears to vary between states with an overall median of all max values at 0.09. Colorado has the highest Max ABV at 0.128 and Delaware has the lowest at 0.055. The maximum IBU per state appears to vary between states with an overall median of all max values at 92.82. Oregon has the highest Max IBU at 138 and Arkansas has the lowest at 44.11

6. Comment on the summary statistics and distribution of the ABV variable.

The ABV clearly illustrates that the IPA variety has more Alcohol By Volume than other varieties. This is across all can sizes with one exception (can size=19.2) in the “other” type. The ABV per size of can (oz) clearly illustrates that the IPA variety has more Alcohol By Volume than other varieties. This across all can sizes with one exception (can size=19.2).

IPA style beer is the predominant variety when ABV exceeds 0.06. This again indicates that IPA beer tends to have higher alcohol content than other variety.

7. Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.

The point plot below illustrates what appears to a positive linear relationship between the ABB and IBU. The addition of line trend to the point plot below confirms the presence of a positive linear relationship between the ABV and IBU.

The correlation between IBU and ABV is 0.757878, with a p-value < 2e-16, Multiple R-squared is 0.5744, meaning that 57% changes in exploratory variable ABV can explain the changes in response variable IBU.

## 
##  Pearson's product-moment correlation
## 
## data:  tidyDF$IBU and tidyDF$ABV
## t = 57.005, df = 2408, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7403535 0.7743731
## sample estimates:
##      cor 
## 0.757878

## 
## Call:
## lm(formula = tidyDF$IBU ~ tidyDF$ABV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.718  -4.712  -0.035   2.762  93.442 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -33.818      1.372  -24.65   <2e-16 ***
## tidyDF$ABV  1277.567     22.411   57.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.72 on 2408 degrees of freedom
## Multiple R-squared:  0.5744, Adjusted R-squared:  0.5742 
## F-statistic:  3250 on 1 and 2408 DF,  p-value: < 2.2e-16

GGPairs Plot below shows a strong correlation between ABV and IBU across all styles of beers. The strongest correlation is in other beer style category at 0.787 followed by IPA at 0.689.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

8. Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN clustering to investigate this relationship. Provide statistical evidence one way or the other. You can of course assume your audience is comfortable with percentages … KNN is very easy to understand.

A category variable IPA_Ale is created to split all the 2410 beers into three groups: Ale (1007), IPA (571) and Other (832), we only use the first two groups for analysis in this part. A KNN classification model with a 75% split and 50 iteration results in an 80% classification accuracy as shown in the confusion matrix. This was achieve by trying 30 different K values resulting in K=5 as the optimal value as shown.

## 
##   Ale   IPA Other 
##  1007   571   832

## Confusion Matrix and Statistics
## 
##                 
## classifications5 Ale IPA
##              Ale 199  41
##              IPA  42 112
##                                           
##                Accuracy : 0.7893          
##                  95% CI : (0.7457, 0.8286)
##     No Information Rate : 0.6117          
##     P-Value [Acc > NIR] : 3.385e-14       
##                                           
##                   Kappa : 0.5571          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8257          
##             Specificity : 0.7320          
##          Pos Pred Value : 0.8292          
##          Neg Pred Value : 0.7273          
##              Prevalence : 0.6117          
##          Detection Rate : 0.5051          
##    Detection Prevalence : 0.6091          
##       Balanced Accuracy : 0.7789          
##                                           
##        'Positive' Class : Ale             
##

9. Knock their socks off! Find one other useful inference from the data that you feel Budweiser may be able to find value in. You must convince them why it is important and back up your conviction with appropriate statistical evidence.

Based on visual Geo-positional analysis of Breweries presence by state (count) and Alcohol abuse by State (source: US Dept. of Health & Human Services) there appears to be a correlation. The correlation analysis reveals the presence of positive linear relationship between the alcohol abuse and the count of the breweries. Based on this, Budweiser may want to establish an out reach program to educate people about the dangers of alcohol abuse especially in state where Budweiser had significant presence.

## Warning: Column `State` joining factors with different levels, coercing to
## character vector

Correlation between breweries count and alcohol abuse cases:

## [1] 0.570742

Summary

Through this case study, we dealt with the missing values by replacing with median values in ABV, and by predicted values from a simple linear regression model. We performed the statistical analysis on ABV and IBU values, and found the linear correlation between the IBU and ABV. We built a KNN classification model and find the best k value through 50 iterations, with this model we can predict the IPA or Ale beers in 79% accuacy. Finally we run a reach out analysis with external alcohol abuse data set, and found the correlation between the breweries count and alcohol abuse count. Recommendations have been provided to Budweiser.

Case Study: US Craft Beer and Brewery

Hao Wang, Alex Salamah

October 2019