ISS608-VAA
  • ✌️ Hands-on Exercises
    • Hands-on Exercise 1
    • Hands-on Exercise 2
    • Hands-on Exercise 3a
    • Hands-on Exercise 3b
    • Hands-on Exercise 4a
    • Hands-on Exercise 4b
    • Hands-on Exercise 4c
    • Hands-on Exercise 4d
    • Hands-on Exercise 5
    • Hands-on Exercise 6
    • Hands-on Exercise 7a
    • Hands-on Exercise 7b
    • Hands-on Exercise 7c
    • Hands-on Exercise 8
    • Hands-on Exercise 9
    • Hands-on Exercise 10
  • 👨🏻‍🏫In-class Exercises
    • In-class Exercise 1
    • In-class Exercise 6
    • In-class Exercise 7
    • In-class Exercise 9
  • 🏠 Take-home Exercises
    • Take-home Exercise 1
    • Take-home Exercise 2
    • Take-home Exercise 3
    • Take-home Exercise 4
  • Home
  • My VAA Journey
  • Dictionary

On this page

  • 1 Overview
  • 2 Getting Started
    • 2.1 Installing and loading the required libraries
    • 2.2 Data Set
      • 2.2.1 Loading the Data Set
      • 2.2.2 Checking Data Health
      • 2.2.3 Descriptive Statistics
  • 3 Exploratory Data Analysis
    • 3.1 Univariate Analysis: Looking at BMI
    • 3.2 Looking at BMI through individual dimensions
      • 3.2.1 Country dimension with Violin+Boxplot
      • 3.2.2 Year dimension with Violin+Boxplot
    • 3.3 Looking at BMI through both dimensions
      • 3.3.1 Plotting a Coplot plot to determine if there is a third variable influence
  • 4 Time-series Analysis of Big Mac Index through time and countries
    • 4.1 Looking at overall trends with newsslopegraph()
    • 4.2 Looking at variations in patterns with ggHoriPlot()
    • 4.3 Adding GDP per capita to track the movement of BMI through the years, countries (colour) and population (size) with a Bubble Plot
  • 5 Multivariate Analysis of Big Mac Index through time and countries
    • 5.1 Plotting a Parallel Coordinates plot to see relationships between variables
    • 5.2 Plotting a Correlation plot to see correlations between variables
  • 6 Panel Data Regression
    • 6.1 Homogenous Panel Data model
    • 6.2 Heterogeneous Panel Data models
      • 6.2.1 Testing for Heterogeneity across Year and Country
      • 6.2.2 Choosing which model is appropriate
      • 6.2.3 Testing between FE and OLS models
      • 6.2.4 FE Model Diagnostics: Plotting predicted vs expected values
  • 7 Conclusion
  • 8 Storyboard
    • 8.1 Dashboard (Summary)
    • 8.2 Panel Data Analysis (Data Exploration)
    • 8.3 Panel Data Regression (Modelling)
    • 8.4 Conclusion

Take-home Exercise 4

  • Show All Code
  • Hide All Code

  • View Source

Take-home Exercise 4: Prototyping Modules for Visual Analytics Shiny Application

Author

Victoria Neo

Published

February 29, 2024

Modified

March 16, 2024

1 Overview

As mentioned in my group’s proposal,

The Big Mac Index (BMI) was invented by The Economist in 1986 as a light-hearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), with the notion that in the long run, exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries. The BMI has become a global standard, included in several economic textbooks and the subject of dozens of academic studies. Thus, we would like to visually examine the phenomenon of inflation co-movement by looking at the similarities in BMI prices between countries through time, isolate groups with similar characteristics, and determine if the BMI is a suitable proxy to inflation prices.

There is a significant gap in making complex economic data accessible to the general public and providing advanced analysis tools for professionals. Existing visualization tools, including government dashboards, often fail to cater effectively to both audiences. To address this, we propose an R Shiny-based interactive tool that leverages publicly available data, the Big Mac Index to simplify the understanding of global financial trends for laypersons while offering in-depth data exploration and modelling capabilities for analysts.

2 Getting Started

2.1 Installing and loading the required libraries

  • AER: for functions used in Applied Econometrics with R

  • CGPfunctions: for using newggslopegraph() for creating slopegraphs

  • corrplot: provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.

  • dplyr: for manipulating, concatenating dataframes

  • gplot: for using coplot() for creating coplots

  • ggExtra: for adding marginal histograms to ggplot2

  • ggHoriPlot: for plotting horizon plots

  • ggiraphExtra: for using ggPredict() for creating interactive multiple linear regression plots

  • ggpmisc: for using stat_poly_eq() for creating linear regression equations

  • ggstatsplot: is an extension of {ggplot2} package for creating graphics with details from statistical tests

  • kableExtra, for creating and manipulating complex tables

  • lubridate: for using dmy() function to parse the Date field into appropriate Date data type in R

  • packHV: for using hist_boxplot() for creating histograms with boxplots

  • patchwork: for combining multiple ggplot2 graphs into one figure

  • parallelPlot: for plotting Parallel Coordinates plots

  • plm: for creating linear regression plots for panel data

  • purrr: for handling lists and functional programming

  • naniar: for using miss_vis() function to check data for missing values

  • plotly: to make interactive graphs

  • tidyverse, a family of modern R packages specially designed to support data science, analysis and communication task including creating static statistical graphs

code block
pacman::p_load(AER, CGPfunctions, 
               corrplot, dplyr, 
               DT, gplots,
               ggExtra, ggHoriPlot, 
               ggiraphExtra, ggpmisc,
               ggthemes, ggplot2,
               ggstatsplot,
               kableExtra, lubridate,
               packHV,
               patchwork, parallelPlot, plm,
               purrr, naniar,
               plotly, tidyverse
               ) 

2.2 Data Set

2.2.1 Loading the Data Set

We load the dataset for Big Mac Index that has been combined with other data (e.g. GDP, Employment rates etc.) and the country dataset as seen here in our data preparation. We left join the country dataset to the Big Mac Index dataset.

code block
indicator_data <- read_csv("data/countries_with_complete_data.csv")
code block
country_data <- read_csv("data/country_all.csv")
code block
bmi_data <- left_join(indicator_data, country_data, by = c("country"))

2.2.2 Checking Data Health

We now check the health of our dataset by using:

  • datatable() from the DT package to view the dataframe more interactively,

  • glimpse() to look at the structure of the dataframe, data types of the columns, and some values of the dataframe,

  • duplicate() to check the dataframe for any duplicated entries using duplicate(), and

  • vis_miss() to check the state of missing values in the dataset.

  • describe() to produce a summary of all the variables.

  • datatable()
  • glimpse()
  • duplicate()
  • vis_miss()
  • describe()
code block
datatable(bmi_data, 
          class= "compact",
          rownames = FALSE,
          width="100%", 
          options = list(pageLength = 10,scrollX=T))
code block
glimpse(bmi_data)
Rows: 560
Columns: 16
$ country        <chr> "Argentina", "Australia", "Brazil", "United Kingdom", "…
$ year           <dbl> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2…
$ currency_code  <chr> "ARS", "AUD", "BRL", "GBP", "CAD", "CLP", "CNY", "CZK",…
$ bmi_localprice <dbl> 2.50, 3.00, 3.60, 1.99, 3.33, 1400.00, 10.50, 56.28, 24…
$ bmi_usd_price  <dbl> 0.7987220, 1.6129032, 1.5384615, 2.8855000, 2.1210191, …
$ bot            <dbl> 17889369.4, -13999280.9, 699372.1, -74417775.4, 2565609…
$ gdp            <dbl> 9.772400e+10, 3.960000e+11, 5.100000e+11, 1.790000e+12,…
$ gdp_per_capita <dbl> 2579.4888, 20301.8432, 2824.7154, 30077.7414, 24255.338…
$ inflation      <dbl> 25.8700000, 2.9815745, 8.4501644, 1.5204025, 2.2583944,…
$ unemployment   <dbl> 19.590, 6.380, 10.640, 5.040, 7.660, 10.170, 4.240, 7.2…
$ hdi            <dbl> 0.785, 0.902, 0.692, 0.873, 0.894, 0.775, 0.604, 0.825,…
$ population     <dbl> 37885024, 19475842, 180476690, 59355690, 31322400, 1569…
$ continent      <chr> "South America", "Oceania", "South America", "Europe", …
$ g20            <chr> "g20", "g20", "g20", "g20", "g20", "non_g20", "g20", "n…
$ eu             <chr> "non_eu", "non_eu", "non_eu", "non_eu", "non_eu", "non_…
$ brics          <chr> "non_brics", "non_brics", "brics", "non_brics", "non_br…
code block
bmi_data[duplicated(bmi_data),]
# A tibble: 0 × 16
# ℹ 16 variables: country <chr>, year <dbl>, currency_code <chr>,
#   bmi_localprice <dbl>, bmi_usd_price <dbl>, bot <dbl>, gdp <dbl>,
#   gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
#   population <dbl>, continent <chr>, g20 <chr>, eu <chr>, brics <chr>
code block
vis_miss(bmi_data)

code block
Hmisc::describe(bmi_data)
bmi_data 

 16  Variables      560  Observations
--------------------------------------------------------------------------------
country 
       n  missing distinct 
     560        0       28 

lowest : Argentina      Australia      Brazil         Canada         Chile         
highest: Switzerland    Thailand       Turkey         United Kingdom United States 
--------------------------------------------------------------------------------
year 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     560        0       20    0.998     2012    6.662     2003     2004 
     .25      .50      .75      .90      .95 
    2007     2012     2016     2019     2020 
                                                                           
Value      2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Frequency    28   28   28   28   28   28   28   28   28   28   28   28   28
Proportion 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
                                             
Value      2015 2016 2017 2018 2019 2020 2021
Frequency    28   28   28   28   28   28   28
Proportion 0.05 0.05 0.05 0.05 0.05 0.05 0.05

For the frequency table, variable is rounded to the nearest 0
--------------------------------------------------------------------------------
currency_code 
       n  missing distinct 
     560        0       28 

lowest : ARS AUD BRL CAD CHF, highest: SGD THB TRY USD ZAR
--------------------------------------------------------------------------------
bmi_localprice 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     560        0      325        1    22010    43701    3.200    3.995 
     .25      .50      .75      .90      .95 
   6.300   17.975   82.750 1607.000 4305.000 

lowest : 1.88    1.94    1.99    2.29    2.35   
highest: 34000   35750   3750000 3950000 4e+06  
--------------------------------------------------------------------------------
bmi_usd_price 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     560        0      556        1     3.22    1.419    1.478    1.705 
     .25      .50      .75      .90      .95 
   2.298    2.992    4.004    4.931    5.508 

lowest : 0.798722 0.889908 1.19565  1.23435  1.2381  
highest: 7.14246  7.29067  7.54367  7.63985  8.06302 
--------------------------------------------------------------------------------
bot 
         n    missing   distinct       Info       Mean        Gmd        .05 
       560          0        560          1     -3e+07   2.04e+08 -555062993 
       .10        .25        .50        .75        .90        .95 
-159597229  -29797699    5005506   27140101   92751378  168155924 

lowest : -1587781424 -1313373911 -1289591960 -1245669866 -1131717011
highest:   859877503   870672209   874903278   936513897   963828693
--------------------------------------------------------------------------------
gdp 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      560         0       434         1 1.664e+12 2.364e+12 1.149e+11 1.467e+11 
      .25       .50       .75       .90       .95 
2.450e+11 4.765e+11 1.335e+12 3.095e+12 9.616e+12 

lowest : 5.477755e+10 5.873103e+10 6.662773e+10 6.676870e+10 6.760892e+10
highest: 1.950000e+13 2.050000e+13 2.110000e+13 2.140000e+13 2.330000e+13
--------------------------------------------------------------------------------
gdp_per_capita 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     560        0      560        1    24770    23377     2429     3571 
     .25      .50      .75      .90      .95 
    7200    14139    42106    56294    62549 

lowest : 888.901 1037.17 1049.39 1052.41 1122.86
highest: 85897.8 87304.3 88725   90476.8 93446.4
--------------------------------------------------------------------------------
inflation 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     560        0      559        1    3.824    4.166  -0.2345   0.3521 
     .25      .50      .75      .90      .95 
  1.3878   2.5262   4.3918   7.9808  10.7854 

lowest : -2.98314 -2.6738  -1.35284 -1.14391 -1.1387 
highest: 34.28    42.02    44.9641  48.41    53.55   
--------------------------------------------------------------------------------
unemployment 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     560        0      417        1    6.353    3.923    2.550    3.205 
     .25      .50      .75      .90      .95 
   3.850    5.160    7.532   10.332   13.671 

lowest : 0.25  0.58  0.6   0.62  0.66 , highest: 24.02 24.22 24.34 25.54 28.77
--------------------------------------------------------------------------------
hdi 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     560        0      264        1   0.8294   0.1051   0.6650   0.6947 
     .25      .50      .75      .90      .95 
  0.7540   0.8455   0.9113   0.9331   0.9420 

lowest : 0.604 0.612 0.615 0.619 0.621, highest: 0.954 0.956 0.957 0.959 0.962
--------------------------------------------------------------------------------
population 
        n   missing  distinct      Info      Mean       Gmd       .05       .10 
      560         0       560         1 113915841 160414963   5161892   5851385 
      .25       .50       .75       .90       .95 
 10266192  38623703 106775573 229104377 319508109 

lowest :    3948582    4019944    4081415    4132782    4176799
highest: 1410276000 1417069400 1421864100 1424929800 1425893500
--------------------------------------------------------------------------------
continent 
       n  missing distinct 
     560        0        6 
                                                                  
Value             Africa          Asia        Europe North America
Frequency             20           200           160            60
Proportion         0.036         0.357         0.286         0.107
                                      
Value            Oceania South America
Frequency             40            80
Proportion         0.071         0.143
--------------------------------------------------------------------------------
g20 
       n  missing distinct 
     560        0        2 
                          
Value          g20 non_g20
Frequency      260     300
Proportion   0.464   0.536
--------------------------------------------------------------------------------
eu 
       n  missing distinct 
     560        0        2 
                        
Value          eu non_eu
Frequency     100    460
Proportion  0.179  0.821
--------------------------------------------------------------------------------
brics 
       n  missing distinct 
     560        0        2 
                              
Value          brics non_brics
Frequency         80       480
Proportion     0.143     0.857
--------------------------------------------------------------------------------

2.2.3 Descriptive Statistics

We now summarize our dataset by using:

  • a function for categorical variables

  • summary() for continuous variables

  • fun_cat for categorical variables
  • Categorical variables
  • Continuous variables
  • bmi_usdprice by year
  • bmi_usdprice by country
code block
fun_cat <- function(x) {
  # Count the number of missing values
  nmiss <- sum(is.na(x))
  # Frequency
  n     <- table(x)
  # Proportion
  p     <- prop.table(n)
  # Putting it together
  OUT   <- cbind(n, p)
  # Add nmiss, but first pad to have the right number of rows
  nmiss <- c(nmiss, rep(NA, nrow(OUT)-1))
  OUT   <- cbind(OUT, nmiss)
  return(OUT)
}
code block
fun_cat(bmi_data$country)
                n          p nmiss
Argentina      20 0.03571429     0
Australia      20 0.03571429    NA
Brazil         20 0.03571429    NA
Canada         20 0.03571429    NA
Chile          20 0.03571429    NA
China          20 0.03571429    NA
Czech Rep.     20 0.03571429    NA
Denmark        20 0.03571429    NA
Hong Kong      20 0.03571429    NA
Hungary        20 0.03571429    NA
Indonesia      20 0.03571429    NA
Japan          20 0.03571429    NA
Korea          20 0.03571429    NA
Malaysia       20 0.03571429    NA
Mexico         20 0.03571429    NA
New Zealand    20 0.03571429    NA
Peru           20 0.03571429    NA
Philippines    20 0.03571429    NA
Poland         20 0.03571429    NA
Russia         20 0.03571429    NA
Singapore      20 0.03571429    NA
South Africa   20 0.03571429    NA
Sweden         20 0.03571429    NA
Switzerland    20 0.03571429    NA
Thailand       20 0.03571429    NA
Turkey         20 0.03571429    NA
United Kingdom 20 0.03571429    NA
United States  20 0.03571429    NA
code block
fun_cat(bmi_data$currency_code)
     n          p nmiss
ARS 20 0.03571429     0
AUD 20 0.03571429    NA
BRL 20 0.03571429    NA
CAD 20 0.03571429    NA
CHF 20 0.03571429    NA
CLP 20 0.03571429    NA
CNY 20 0.03571429    NA
CZK 20 0.03571429    NA
DKK 20 0.03571429    NA
GBP 20 0.03571429    NA
HKD 20 0.03571429    NA
HUF 20 0.03571429    NA
IDR 20 0.03571429    NA
JPY 20 0.03571429    NA
KRW 20 0.03571429    NA
MXN 20 0.03571429    NA
MYR 20 0.03571429    NA
NZD 20 0.03571429    NA
PEN 20 0.03571429    NA
PHP 20 0.03571429    NA
PLN 20 0.03571429    NA
RUB 20 0.03571429    NA
SEK 20 0.03571429    NA
SGD 20 0.03571429    NA
THB 20 0.03571429    NA
TRY 20 0.03571429    NA
USD 20 0.03571429    NA
ZAR 20 0.03571429    NA
code block
fun_cat(bmi_data$year)
      n    p nmiss
2002 28 0.05     0
2003 28 0.05    NA
2004 28 0.05    NA
2005 28 0.05    NA
2006 28 0.05    NA
2007 28 0.05    NA
2008 28 0.05    NA
2009 28 0.05    NA
2010 28 0.05    NA
2011 28 0.05    NA
2012 28 0.05    NA
2013 28 0.05    NA
2014 28 0.05    NA
2015 28 0.05    NA
2016 28 0.05    NA
2017 28 0.05    NA
2018 28 0.05    NA
2019 28 0.05    NA
2020 28 0.05    NA
2021 28 0.05    NA
code block
fun_cat(bmi_data$continent)
                n          p nmiss
Africa         20 0.03571429     0
Asia          200 0.35714286    NA
Europe        160 0.28571429    NA
North America  60 0.10714286    NA
Oceania        40 0.07142857    NA
South America  80 0.14285714    NA
code block
fun_cat(bmi_data$g20)
          n         p nmiss
g20     260 0.4642857     0
non_g20 300 0.5357143    NA
code block
fun_cat(bmi_data$brics)
            n         p nmiss
brics      80 0.1428571     0
non_brics 480 0.8571429    NA
code block
fun_cat(bmi_data$eu)
         n         p nmiss
eu     100 0.1785714     0
non_eu 460 0.8214286    NA
code block
cont_data <- data.frame(bmi_data$bmi_localprice, bmi_data$bmi_usd_price, bmi_data$bot, bmi_data$gdp, bmi_data$gdp_per_capita, bmi_data$inflation, bmi_data$hdi, bmi_data$population)

lapply(cont_data, summary)
$bmi_data.bmi_localprice
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2       6      18   22010      83 4000000 

$bmi_data.bmi_usd_price
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7987  2.2978  2.9915  3.2198  4.0036  8.0630 

$bmi_data.bot
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-1.588e+09 -2.980e+07  5.006e+06 -3.000e+07  2.714e+07  9.638e+08 

$bmi_data.gdp
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.478e+10 2.450e+11 4.765e+11 1.664e+12 1.335e+12 2.330e+13 

$bmi_data.gdp_per_capita
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  888.9  7199.9 14138.8 24770.2 42106.0 93446.4 

$bmi_data.inflation
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -2.983   1.388   2.526   3.824   4.392  53.550 

$bmi_data.hdi
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.6040  0.7540  0.8455  0.8294  0.9113  0.9620 

$bmi_data.population
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
3.949e+06 1.027e+07 3.862e+07 1.139e+08 1.068e+08 1.426e+09 
code block
# Assuming 'bmi_data' is your data frame
result_list <- lapply(
  split(bmi_data, bmi_data$year),
  function(sub_data) summary(sub_data$bmi_usd_price)
)

result_list
$`2002`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7987  1.3816  1.6921  1.9122  2.3643  3.7952 

$`2003`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.196   1.480   1.956   2.142   2.298   4.599 

$`2004`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.234   1.606   2.157   2.288   2.577   4.884 

$`2005`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.269   1.615   2.440   2.466   2.660   5.050 

$`2006`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.302   1.593   2.579   2.541   2.990   4.931 

$`2007`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.415   2.097   2.697   2.754   3.090   5.046 

$`2008`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.699   2.465   3.278   3.469   4.378   6.375 

$`2009`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.716   2.139   3.047   3.089   3.621   5.985 

$`2010`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.831   2.315   3.184   3.309   3.838   6.304 

$`2011`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.938   2.772   3.713   3.952   4.519   8.063 

$`2012`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.124   2.576   3.628   3.676   4.278   6.806 

$`2013`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.191   2.906   3.769   3.886   4.453   7.125 

$`2014`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.160   2.884   3.518   3.723   4.498   7.142 

$`2015`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.364   2.882   3.348   3.618   4.331   7.544 

$`2016`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.527   2.632   3.084   3.246   3.786   6.443 

$`2017`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.789   2.629   3.301   3.431   4.195   6.350 

$`2018`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.276   2.789   3.766   3.833   4.535   6.765 

$`2019`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.652   2.549   3.657   3.555   4.296   6.622 

$`2020`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.154   2.770   3.560   3.651   4.420   6.713 

$`2021`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.809   2.886   3.863   3.854   4.547   7.291 
code block
# Assuming 'bmi_data' is your data frame
result_list <- lapply(
  split(bmi_data, bmi_data$country),
  function(sub_data) summary(sub_data$bmi_usd_price)
)

result_list
$Argentina
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7987  1.7920  2.9321  2.8030  3.6693  4.8397 

$Australia
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.613   2.667   4.127   3.711   4.532   4.985 

$Brazil
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.482   2.935   4.639   4.061   5.140   6.162 

$Canada
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.121   3.069   4.327   4.046   5.031   5.394 

$Chile
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.955   2.972   3.268   3.304   3.916   4.346 

$China
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.196   1.387   2.356   2.215   2.784   3.459 

$`Czech Rep.`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.655   2.556   3.461   3.200   3.769   4.558 

$Denmark
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.953   4.464   4.862   4.846   5.375   5.994 

$`Hong Kong`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.400   1.547   2.031   2.039   2.476   2.643 

$Hungary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.688   2.686   3.045   3.149   3.666   4.644 

$Indonesia
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.530   1.823   2.241   2.166   2.409   2.860 

$Japan
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.015   2.337   3.200   3.074   3.522   4.160 

$Korea
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.377   2.715   3.301   3.270   3.704   4.115 

$Malaysia
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.326   1.545   1.980   1.931   2.290   2.579 

$Mexico
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.069   2.470   2.656   2.626   2.750   3.350 

$`New Zealand`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.607   3.149   3.980   3.709   4.346   4.874 

$Peru
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.283   2.761   3.189   3.112   3.400   3.907 

$Philippines
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.234   1.690   2.657   2.315   2.791   3.670 

$Poland
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.460   2.247   2.533   2.537   2.949   3.514 

$Russia
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.250   1.515   1.945   1.958   2.361   2.702 

$Singapore
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.813   2.306   3.398   3.205   3.787   4.433 

$`South Africa`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.8899  2.0453  2.1739  2.1354  2.2983  2.8735 

$Sweden
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.500   4.478   5.353   5.266   6.164   7.640 

$Switzerland
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.795   5.049   6.401   6.191   6.885   8.063 

$Thailand
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.270   1.695   2.405   2.516   3.155   4.248 

$Turkey
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.997   2.706   3.148   3.198   3.762   4.777 

$`United Kingdom`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.885   3.614   3.863   3.908   4.379   4.627 

$`United States`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.350   2.835   3.725   3.666   4.343   4.890 

3 Exploratory Data Analysis

code block
flowchart TB
  A[Year]
  B[Country]
  C[Big Mac Index price]
  D[Inflation rate]
  E[GDP per capita]
  F[Balance of Trade]
  G[Unemployment rate]
  H[Human Development Index]
  I[Population size]
  
  A --> C
  B --> C
  C --> D
  C --> E
  C --> F
  C --> G
  C --> H
  C --> I

flowchart TB
  A[Year]
  B[Country]
  C[Big Mac Index price]
  D[Inflation rate]
  E[GDP per capita]
  F[Balance of Trade]
  G[Unemployment rate]
  H[Human Development Index]
  I[Population size]
  
  A --> C
  B --> C
  C --> D
  C --> E
  C --> F
  C --> G
  C --> H
  C --> I

Some questions of interest regarding the problem statement of inflation co-movement by BMI are:

  • what type of variation occurs within BMI prices,

  • what type of covariation occurs between BMI prices by dimensions.

To do this, we will look at the following variables:

  1. bmi_usd_price (in order to compare across currencies, we use the BMI prices converted to USD)

by dimensions

  1. country

  2. year

3.1 Univariate Analysis: Looking at BMI

code block
par(mfrow = c(1, 2))

hist_boxplot(bmi_data$bmi_usd_price,col="#69b3a2",freq=TRUE,  xlab = "Prices in USD", main = "Histogram of Big Mac Index prices in USD")
hist_boxplot(bmi_data$bmi_usd_price,col="#69b3a2",freq=FALSE,density=TRUE, xlab = "Prices in USD", main = "")

Insights

BMI is asymmetrically distributed and skewed to the right. Most of the Big Mac prices (adjusted to USD) are close to 3USD - this makes sense as McDonald’s positions itself as an ultra-affordable fast food chain. Its mean (3.2198) is higher than its median (2.9915). But this also means that it has outliers with very expensive Big Mac prices.

Questions:

  • Why are the Big Mac prices close to 3USD?

  • Given McDonald’s strategy to be ultra-affordable, why is there a large spread in Big Mac prices and outliers on the right?

3.2 Looking at BMI through individual dimensions

3.2.1 Country dimension with Violin+Boxplot

code block
bmi_data %>% 
  
  # Calculate mean of bmi_usd_price for each country
  group_by(country) %>%
  mutate(median_bmi = median(bmi_usd_price)) %>%
  ungroup() %>%
  
  # Reorder the countries based on the mean of bmi_usd_price
  mutate(country = reorder(country, median_bmi)) %>%
  
  # Add a column called 'type': do we want to highlight the group or not?
  mutate(type = ifelse(country=="United States", "Highlighted", "Normal")) %>%
  
  # Build the boxplot. 
  ggplot(aes(x = country, y = bmi_usd_price, fill = type, alpha = type)) + 
    geom_violin(width=1.0) +
    geom_boxplot(width=0.3, color="black", alpha=0.8) +
    geom_jitter(color = "black", size = 0.4, alpha = 0.9) +
    scale_fill_manual(values = c("#69b3a2", "grey")) +
    scale_alpha_manual(values = c(1, 0.1)) +
    theme(legend.position = "none") +
    labs(y = "Prices in USD",    
       x = "",                  
       title = "Boxplot of Big Mac Index Prices in USD across countries") +
    coord_flip() +
  # Add mean and median lines
    geom_hline(aes(yintercept = mean(bmi_usd_price), linetype = "Mean"), 
               color = "darkblue", linetype = "dashed", size = 0.5) +
    geom_hline(aes(yintercept = median(bmi_usd_price), linetype = "Median"), 
               color = "darkred", linetype = "dashed", size = 0.5) +
  
    # Add text labels for mean and median
    geom_text(aes(x = mean(bmi_usd_price), label = paste("Mean: ", round(mean(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust =3.5, hjust = 1, color = "darkblue") +
    geom_text(aes(x = median(bmi_usd_price), label = paste("Median: ", round(median(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust = 4, hjust = 1, color = "darkred") +
  
    # Customize legend
    scale_linetype_manual(name = "Lines", values = c("Mean" = "dashed", "Median" = "dashed"),
                          labels = c("Mean", "Median")) +
    guides(fill = guide_legend(override.aes = list(alpha = c(1, 0.1)))) 

Insights

Some of the outliers pointed out in the previous data visualisations can be explained by the country dimension. Highly developed countries such as Switzerland, Sweden and Denmark have the top 3 most expensive Big Mac prices. However, Hong Kong, with also a highly developed economy, has the third lowest Big Mac price. In the boxplot above, there is no clear difference between developed and developing countries (even Philippines is more expensive than Hong Kong).

The above boxplot also shows that there is variation within and between countries - there are some countries with greater variability e.g. Argentina and a number of countries with median BMI prices higher than the overall median BMI price such as Switzerland, Brazil and Canada while countries like Mexico and South Africa (somewhat bimodal) did not see much fluctuations in their BMI prices.

Questions:

  • Why do some countries have greater variability in their BMI prices?

  • Why is there a difference in BMI prices between developed and developing countries?

    • especially for Hong Kong

3.2.2 Year dimension with Violin+Boxplot

code block
bmi_data %>% 
  mutate(year = factor(year)) %>%
  # Add a column called 'type': do we want to highlight the group or not?
  mutate( type=ifelse(year %in% c("2002", "2021"),"Highlighted","Normal")) %>%
  
  # Build the boxplot. In the 'fill' argument, give this column
  ggplot( aes(x=year, y=bmi_usd_price, fill=type, alpha=type)) + 
    geom_violin(width=1.0) +
    geom_boxplot(width=0.3, color="black", alpha=0.8) +
    geom_jitter(color = "black", size = 0.4, alpha = 0.9) +
    scale_fill_manual(values=c("#69b3a2", "grey")) +
    scale_alpha_manual(values=c(1,0.1)) +
    theme(legend.position = "none") +
    labs(y = "Prices in USD",    
       x = "",                  
       title = "Boxplot of Big Mac Index Prices in USD across years") +
    coord_flip() +
 # Add mean and median lines
    geom_hline(aes(yintercept = mean(bmi_usd_price), linetype = "Mean"), 
               color = "darkblue", linetype = "dashed", size = 0.5) +
    geom_hline(aes(yintercept = median(bmi_usd_price), linetype = "Median"), 
               color = "darkred", linetype = "dashed", size = 0.5) +
  
    # Add text labels for mean and median
    geom_text(aes(x = mean(bmi_usd_price), label = paste("Mean: ", round(mean(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust =3.5, hjust = 1, color = "darkblue") +
    geom_text(aes(x = median(bmi_usd_price), label = paste("Median: ", round(median(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust = 4, hjust = 1, color = "darkred") +
  
    # Customize legend
    scale_linetype_manual(name = "Lines", values = c("Mean" = "dashed", "Median" = "dashed"),
                          labels = c("Mean", "Median")) +
    guides(fill = guide_legend(override.aes = list(alpha = c(1, 0.1)))) 

Insights

From the boxplot above, we can see that Big Mac prices have risen through the years - 2021 is the most expensive year for Big Mac prices. Moreover from 2002 - 2007, according to the violin plot, it seemed that the prices were mostly concentrated but there is more spread since 2008. Despite the blip from 2013 - 2016 and 2018 - 2020, which saw falling Big Mac prices, as a whole, they have increased since 2002 which saw the lowest prices for Big Mac.

Questions:

  • Why was there a dip in Big Mac prices from 2013 to 2016 and 2018 to 2020?

  • Why are Big Mac prices rising as a whole?

3.3 Looking at BMI through both dimensions

3.3.1 Plotting a Coplot plot to determine if there is a third variable influence

The BMI price in USD is plotted on the y axis and the years are on the x axis. The 28 plots show the relationship between these two variables by each country. The bars at the top indicate the countries position from left to right, starting from the bottom row.

code block
coplot(bmi_usd_price ~ year|country, data =  bmi_data,         
       panel = function(x, y, ...) {         
         points(x, y, ...)         
         abline(lm(y ~ x), col = "blue")})

Note

Insights

From the coplots above, there is an interaction between years and countries. Conditionally on country, the relationship between BMI price in USD and year looks roughly linear - the regression results for most countries indicate a positive relationship throughout the years. However, countries like Hong Kong(11), Mexico(15), Russia(20) and South Africa(22) have almost flat regression lines that suggest a weak or almost no relationship between the variables. Moreover, Turkey(26) is the only country with a negative relationship.

4 Time-series Analysis of Big Mac Index through time and countries

4.1 Looking at overall trends with newsslopegraph()

code block
bmi_data %>% 
  mutate(year = factor(year)) %>%
  filter(year %in% c(2002, 2011, 2021 )) %>%
  newggslopegraph(year, bmi_usd_price, country,
                Title = "Big Mac Index (USD) across 28 countries",
                SubTitle = "At 10 Years Intervals from 2002-2021",
                Caption = "Source: The Economist",
                YTextSize = 2.5,
                LineThickness = .6,
                LineColor = c(rep("gray",8), "red", rep("gray",18), "#69b3a2"),
    WiderLabels = TRUE)

Insights

From the slopegraph above, we can see that Big Mac prices increased for all countries in 2011 but later increased for some and decreased for others. Switzerland consistently had the most expensive Big Mac prices through the years while Argentina had the cheapest Big Mac in 2002 but is in 15th position (out of 28) in 2021. Most interesting is Hong Kong, that saw a consistent rise in Big Mac process, did not experience sharp changes, and had actually the cheapest Big Mac in 2011.

Moreover, in 2002, the prices between countries were much closer in spread but this spread increased through the years.

Questions:

  • Why did the spread in Big Mac prices between countries increase through the years?

  • Why did Big Mac prices increase in 2011 for all countries?

  • Why did some countries experience a consistent increase in Big Mac prices while others experienced fluctuations?

4.2 Looking at variations in patterns with ggHoriPlot()

Toggle the different tabsets to see the different horizon plots.

  • using United States mean bmi_usd_price as the baseline
  • using mean bmi_usd_price as the baseline
  • using median bmi_usd_price as the baseline
code block
mean_us_bmi_price <- bmi_data %>%
  filter(country == "United States") %>%
  summarise(mean_bmi_usd_price = mean(bmi_usd_price, na.rm = TRUE))

mean_us_bmi_price
# A tibble: 1 × 1
  mean_bmi_usd_price
               <dbl>
1               3.67
code block
bmi_data %>% 
  filter(year >= "2002") %>%
  ggplot() +
  geom_horizon(aes(x = year, y=bmi_usd_price), 
               origin = 3.666, 
               horizonscale = 6)+
  facet_grid(`country`~.) +
    theme_few() +
  scale_fill_hcl(palette = 'RdBu') +
  theme(panel.spacing.y=unit(0, "lines"), strip.text.y = element_text(
    size = 5, angle = 0, hjust = 0),
    legend.position = 'none',
    axis.text.y = element_blank(),
    axis.text.x = element_text(size=7),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
    ) 

If US mean bmi_usd_price is used as the origin, Switzerland (and shortly after) Denmark have always had stronger prices than the overall mean Big Mac Index price for USA. China, Hong Kong, Indonesian, Malaysia, Mexico, Philippines, Poland, Russia, South Africa and Turkey have always been weaker in Big Mac prices than the USA.

code block
bmi_data %>% 
  filter(year >= "2002") %>%
  ggplot() +
  geom_horizon(aes(x = year, y=bmi_usd_price), 
               origin = "mean", 
               horizonscale = 6)+
  facet_grid(`country`~.) +
    theme_few() +
  scale_fill_hcl(palette = 'RdBu') +
  theme(panel.spacing.y=unit(0, "lines"), strip.text.y = element_text(
    size = 5, angle = 0, hjust = 0),
    legend.position = 'none',
    axis.text.y = element_blank(),
    axis.text.x = element_text(size=7),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
    ) 

code block
bmi_data %>% 
  filter(year >= "2002") %>%
  ggplot() +
  geom_horizon(aes(x = year, y=bmi_usd_price), 
               origin = "median", 
               horizonscale = 6)+
  facet_grid(`country`~.) +
    theme_few() +
  scale_fill_hcl(palette = 'RdBu') +
  theme(panel.spacing.y=unit(0, "lines"), strip.text.y = element_text(
    size = 5, angle = 0, hjust = 0),
    legend.position = 'none',
    axis.text.y = element_blank(),
    axis.text.x = element_text(size=7),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
    ) 

Insights

The horizon plot (using median bmi_usd_price as the baseline) showed that all countries were initially below the median Big Mac price and all experienced some increase in price. China, Hong Kong, Korea, Singapore, Thailand and United States had a huge increase in Big Mac prices around 2018 to 2022. However, Turkey experienced a large drop in recent years while Russia experienced huge increase and than a large drop around 2015 and only had a smaller increase later on.

Questions:

  • Why do some countries experience a significant increase in Big Mac prices together over the same period?

  • Why are there small sharp peaks for some years and long peaks for others?

  • What other factors account for this co-movement in Big Mac prices? GDP per capita?

4.3 Adding GDP per capita to track the movement of BMI through the years, countries (colour) and population (size) with a Bubble Plot

code block
bmi_data %>%
  plot_ly(x = ~gdp_per_capita, 
          y = ~bmi_usd_price, 
          size = ~population, 
          color = ~country,
          sizes = c(20, 150),
          frame = ~year, 
          text = ~paste(country, "<br>Population: ", gdp_per_capita, "<br>Big Mac Price: USD$", bmi_usd_price,"<br>GDP per Capita: $", gdp_per_capita, "<br>Inflation: ", inflation, "%"), 
          hoverinfo = "text",
          type = 'scatter',
          mode = 'markers'
          ) %>%
  layout(showlegend = FALSE,
    xaxis = list(title = "GDP per Capita"),
    yaxis = list(title = "Price in USD"),
    title = "A Bubbleplot of Big Mac Index (USD) across 28 countries (colour) with population (size) through the years")
Insights

At the start, it seemed like there was a relationship between GDP per capita and BMI prices - the higher one’s GDP per capita was, the higher the Big Mac price in USD. Switzerland with the highest GDP per capita also saw the most expensive Big Mac prices. The only exception is Hong Kong that had a small inflation rate throughout. For those with higher GDP per capita, their inflation rates were almost the same. While countries with higher GDP per capita above 40k saw their GDP per capita increasing with Big Mac prices consistently above USD$3 since 2010, countries with GDP per capita below 20k saw small increases in their GDP per capita, still remained below 20k while Big Mac prices had a wider variability. Also, most of the countries in our study are of the similar population size.

Questions:

  • There seems to be two different trajectory of BMI prices for countries with different GDP per capita. Are there other factors influencing this? Are there other trajectories that can be seen?

  • We have been mostly looking at country and year effects. Are there other variables that affect BMI as a whole?

5 Multivariate Analysis of Big Mac Index through time and countries

5.1 Plotting a Parallel Coordinates plot to see relationships between variables

code block
pc <- bmi_data %>%
  select( gdp_per_capita, bot, inflation, unemployment, hdi, population, bmi_usd_price)
histoVisibility <- rep(TRUE, ncol(pc))
parallelPlot(pc,
             histoVisibility = histoVisibility)
Note

Insights

Based on the parallel coordinates plot, there appears to be interesting patterns. For example, countries lower gdp_per-capita, higher balance of trade (net export - net import) with low inflation and unemployment, and big population sizes had low Big Mac Index prices. Countries with high gdp_per_capita with high human development index (HDI) and small population sizes had very high Big Mac Index prices.

Questions:

  • What are the correlations between variables with BMI in USD prices?

5.2 Plotting a Correlation plot to see correlations between variables

code block
columns_of_interest <- c("gdp_per_capita", "bot", "inflation", "unemployment", "hdi", "population", "bmi_usd_price")
bmi.cor <- cor(bmi_data[, columns_of_interest])

bmi.sig = cor.mtest(bmi.cor, conf.level= .95)
corrplot(bmi.cor,
         method = "number",
         type = "lower",
         diag = FALSE,
         tl.col = "black",
         tl.srt = 45,
         p.mat = bmi.sig$p,
         sig.level = .05)

Note

Insights

GDP per capita, inflation and Human Development Index are significantly correlated to Big Mac Index prices. Interestingly, even though we had initially posited that inflation and Big Mac prices are correlated, it appears that they are not - inflation is not correlated to Big Mac Index prices. However, we cannot tell if there is a year/country effect - there is no information on the correlation even though though these variables are constant in the time dimension and across countries.

6 Panel Data Regression

Given the time-series (by 20 years) and cross sectionality (by 28 countries) of our data, panel data regression is a good technique to analyze the relationship between our independent variables and Big Mac Index price.

6.1 Homogenous Panel Data model

From the article Introduction to the Fundamentals of Panel Data,

Homogeneous (or pooled) panel data models assume that the model parameters are common across individuals.

Thus, the basic OLS regression model does not consider heterogeneity across countries or years.

code block
ols <- lm(bmi_usd_price ~ gdp_per_capita + hdi, data = bmi_data)
summary(ols)

Call:
lm(formula = bmi_usd_price ~ gdp_per_capita + hdi, data = bmi_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8586 -0.5952 -0.0552  0.5463  3.3407 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.401e+00  5.434e-01   6.259 7.74e-10 ***
gdp_per_capita  4.921e-05  3.152e-06  15.608  < 2e-16 ***
hdi            -1.688e+00  7.302e-01  -2.312   0.0211 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8813 on 557 degrees of freedom
Multiple R-squared:  0.528, Adjusted R-squared:  0.5263 
F-statistic: 311.6 on 2 and 557 DF,  p-value: < 2.2e-16
code block
ggPredict(ols,interactive=TRUE)
Note

Insights

Each point represents observations of GDP per capita, HDI and BMI price in USD - the multiple regression results indicate a positive relationship throughout the years.. However, this cannot account for omitted unobservable factors that exist from country to country and year to year.

6.2 Heterogeneous Panel Data models

From the article Introduction to the Fundamentals of Panel Data,

Heterogeneous models allow for any or all of the model parameters to vary across individuals. Fixed effects and random effects models are both examples of heterogeneous panel data models.

6.2.1 Testing for Heterogeneity across Year and Country

By Year

code block
plotmeans(bmi_usd_price~year, data = bmi_data, n.label=FALSE,
          xlab="Years",
          ylab="Big Mac Index price (USD)")

By Country

code block
plotmeans(bmi_usd_price~country, data = bmi_data, n.label=FALSE,
          xlab="Years",
          ylab="Big Mac Index price (USD)")

Note

Insights

We can see from the above plots that there is no homogeneity across years and countries. Hence, we should use heterogeneous panel data models.

6.2.2 Choosing which model is appropriate

Definitions of different models taken from the articles A Guide to Panel Data Regression: Theoretics and Implementation with Pythonand Panel Data Using R: Fixed-effects and Random-effects:

  1. Fixed Effects model

    The FE-model determines individual effects of unobserved, independent variables as constant (“fix“) over time. Within FE-models, the relationship between unobserved, independent variables and the IVs (i.e. endogeneity) can be existent.

    Moreover,

    FE remove the effect of those time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable.

    Another important assumption of the FE model is that those time-invariant characteristics are unique to the individual and should not be correlated with other individual characteristics. Each entity is different therefore the entity’s error term and the constant (which captures individual characteristics) should not be correlated with the others. If the error terms are correlated, then FE is no suitable since inferences may not be correct and you need to model that relationship (probably using random-effects).

  2. Random Effects model

    RE-models determine individual effects of unobserved, independent variables as random variables over time. They are able to “switch” between OLS and FE and hence, can focus on both dependencies between and within individuals. 

    Moreover,

    An advantage of random effects is that you can include time invariant variables (i.e. gender). In the fixed effects model these variables are absorbed by the intercept. The cost is the possibility of inconsistent estimators, of the assumption is inappropriate.

Using the two statistically significant correlated variables of GDP per capita and HDI, we will create the panel regression to analyze the relationship with Big Mac Index prices.

  • Fixed effects model
  • Random effects model
code block
fixed <- plm(bmi_usd_price ~ gdp_per_capita+hdi, data=bmi_data, index=c("country", "year"), model="within")  #fixed model
summary(fixed)
Oneway (individual) effect Within Model

Call:
plm(formula = bmi_usd_price ~ gdp_per_capita + hdi, data = bmi_data, 
    model = "within", index = c("country", "year"))

Balanced Panel: n = 28, T = 20, N = 560

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-1.9226068 -0.2728133  0.0086767  0.2729549  1.7410149 

Coefficients:
                 Estimate Std. Error t-value  Pr(>|t|)    
gdp_per_capita 7.6110e-05 3.7399e-06 20.3507 < 2.2e-16 ***
hdi            6.1167e+00 8.9549e-01  6.8305 2.328e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    346.67
Residual Sum of Squares: 131.29
R-Squared:      0.62129
Adj. R-Squared: 0.60057
F-statistic: 434.742 on 2 and 530 DF, p-value: < 2.22e-16
code block
random <- plm(bmi_usd_price ~ gdp_per_capita+hdi, data=bmi_data, index=c("country", "year"), model="random")  #fixed model
summary(random)
Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = bmi_usd_price ~ gdp_per_capita + hdi, data = bmi_data, 
    model = "random", index = c("country", "year"))

Balanced Panel: n = 28, T = 20, N = 560

Effects:
                 var std.dev share
idiosyncratic 0.2477  0.4977  0.36
individual    0.4405  0.6637  0.64
theta: 0.8346

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-1.651317 -0.330083 -0.001211  0.268572  2.165632 

Coefficients:
                  Estimate  Std. Error z-value  Pr(>|z|)    
(Intercept)    -1.8743e+00  7.0452e-01 -2.6604  0.007804 ** 
gdp_per_capita  6.9024e-05  3.7777e-06 18.2714 < 2.2e-16 ***
hdi             4.0807e+00  8.9766e-01  4.5459  5.47e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Total Sum of Squares:    362.26
Residual Sum of Squares: 157.1
R-Squared:      0.56634
Adj. R-Squared: 0.56478
Chisq: 727.402 on 2 DF, p-value: < 2.22e-16

As seen above, FE and RE models’ independent variables (GDP per capita and HDI) both have significant influence on BMI prices. Moreover, both models are good to use as their p-values are below alpha = 0.05. The Adjusted R-square variables are high with FE model having a higher value at 60% which means it represents the portion of the total variation in BMI prices that is explained by GDP per capita and HDI.

To determine between FE and RE models, we can use the Hausman-Test:

By running the Hausman-Test, the null hypothesis is that the covariance between IV(s) and alpha is zero. If this is the case, then RE is preferred over FE. If the null hypothesis is not true, we must go with the FE-model.

code block
phtest(fixed,random) #Hausman test

    Hausman Test

data:  bmi_usd_price ~ gdp_per_capita + hdi
chisq = 140.39, df = 2, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent
Note

Insights

Given that the p-value is significant (<0.05), we reject the null hypothesis and can conclude that the fixed effects model is better.

6.2.3 Testing between FE and OLS models

To determine between FE and OLS (simple regression) models, we can use the pFtest to see which one is a better fitted model.

code block
pFtest(fixed, ols)

    F test for individual effects

data:  bmi_usd_price ~ gdp_per_capita + hdi
F = 45.056, df1 = 27, df2 = 530, p-value < 2.2e-16
alternative hypothesis: significant effects
Note

Given that the null hypothesis is that OLS is better fit model than Fixed Effect, the p-value is really small which means we reject the null hypothesis and can conclude that a FE model is a better fit.

6.2.4 FE Model Diagnostics: Plotting predicted vs expected values

After conducting panel data regression, plotting predicted vs expected values allows us to visually assess the accuracy of your model’s predictions. The plot helps us identify any patterns or discrepancies between the predicted values from the Fixed Effects regression model and the actual observed values. It helps us validate the performance of the model and ensure it captures the underlying relationships effectively.

code block
## produce a dataset of prediction, added to the group means
bmi_means <- bmi_data %>% 
  mutate(y = bmi_usd_price) %>% 
  group_by(country) %>% 
  transmute(y_mean = mean(y),
            y = y, 
            year = year) %>% 
  ungroup() %>% 
  mutate(y_pred = predict(fixed) + y_mean) %>% 
  select(-y_mean)

## plot it
bmi_means %>% 
  gather(type, value, y, y_pred) %>% 
  ggplot(aes(x = year, y = value, linetype = type))+
  geom_line() +
  facet_wrap(~country) +
  ggtitle("Visualising in-sample predictions across 28 countries")

Note

Insights

As we can see from the plots above, even though the Adjusted Rsquared for this model is 0.60, its predicted values are consistently higher than the observed y values. This means the model is not effectively capturing the relationships at each panel. Perhaps this poor model performance can be improved with clustering the countries - two different types of clusters (network clustering and time-series clustering) will be provided by my groupmates Firdaus and Jiayi respectively and added into the final Shiny app.

7 Conclusion

In conclusion, we used visualizations to explore the Big Mac Index prices for 28 countries across 20 years of 2002 to 2021. Visually, using univariate analysis, our plots depicted different trajectories in BMI prices across countries and a general upward trend across the years. When both dimensions were combined, we realised that there was an interaction effect between years and countries, and that there were 3 types of relationships ( positive, weak/zero, negative) between BMI prices and years when conditioned on countries.

This was further expanded in the visual time-series analysis that showed several outlier countries such as Hong Kong, Mexico, Russia, South Africa and Turkey. We could also see that when GDP per Capita was plotted against BMI prices, there seemed to be two groups of countries with different trajectories.

Thus, we conducted multivariate analysis to check which variables influenced Big Mac Index prices (putting aside year and country effects). There were interesting patterns e.g. those with large population sizes tended to have lower Big Mac Index prices. However, when statistically tested in a correlation plot, only GDP per capita and Human Development Index were statistically significantly correlated with BMI prices.

Despite our initial claim that the Big Mac Index is a proxy for inflation, our visual and statistical multivariate analyses did not support this claim.

With these variables, we conducted panel data regression analysis to confirm our analysis. Given the heterogeneity of data across years and countries, we cannot use homogeneous panel data models like OLS regression. Instead, through statistical analysis, Fixed Effect regression model is most suitable to explain the variation in BMI prices is explained by GDP per capita and HDI when we account for year and country effects.

8 Storyboard

For my part that comprised exploratory data analysis with univariate and time-series analysis, and confirmatory data analysis with panel data regression, my part in the Shiny web application is developed into the main dashboard, panel data analysis page under the Data Exploration analysis, and panel data regression under the Modelling page.

8.1 Dashboard (Summary)

The first section will consist of a homepage tab that allows users to select a country and year that the user is interested in. With a chloropleth map as the main focus (please refer to my groupmate Firdaus’s Take-home Exercise 4), a lollipop graph and violin+boxplot graph will be placed next to the chloropleth graph. Through this, the user is able to obtain a basic understanding on the individual country’s economic information, their BMI prices in relation to other countries and their BMI prices in relation to other years giving them the overall feel on how the BMI prices vary between countries and years and other information on their economic measures (that are later used for analysis).

Also, upload and export buttons are included for users to import data and export data of a fixed format. With the upload feature, any data uploaded by users will be automatically transformed and immediately analysed in the application. The export button allows the analysis and forecasted results to be exported in data or report format.

Proposed layout of Dashboard

The figure below shows some of the interactive features that will be incorporated into the dashboard. Users will have an overview of the BMI prices during their selected year with four key indicators.

Proposed interactive features of Dashboard

8.2 Panel Data Analysis (Data Exploration)

This section will be created using the ExPanD package that allows for interactive panel data exploration and analyses. Throughout the application, users will be able to select what variables they are interested in and tweak the plots to explore different parameters. After this section, the user will have a clearer understanding on the variables that influence BMI prices and the effects of time and countries on these relationships.

8.3 Panel Data Regression (Modelling)

This section will be also be created using the ExPanD package that allows for interactive panel data exploration and analyses. In this section, it showcases the statistical analysis on the variables that can explain the variation in BMI prices. Users will be able to vary the (multiple) variables in the panel regression analysis and see the results in the table. There will also be options to change the type of panel data regression and to select different clusters e.g. clusters from my groupmates Firdaus’ and Jiayi’s clustering analysis. At the bottom, ExPanD does not allow for custom addition of other plots so I will manually paste the predicted vs observed values plots according to no cluster, network cluster and time-series cluster.p;[

8.4 Conclusion

This section shows the initial prototype of the web application that we are planning to build to analyse and predict Big Mac Index prices.

  1. Dashboard for Analysis and Prediction: The web application will serve as a dashboard where users can perform exploratory analysis, modelling and predicting of Big Mac Index prices.

  2. Customizable Filters: The dashboard will include various filters such as selected periods and countries used for panel data analysis. These filters will allow users to customize their analysis based on their specific needs and preferences.

  3. Personalized User Experience: The inclusion of customizable filters adds a personal touch to the user experience, allowing users to tailor their analysis to suit their individual requirements.

Overall, the web application aims to provide users with a powerful tool for analyzing and predicting Big Mac Index prices, leveraging customizable analysis features to enhance user experiences and provide valuable insights.

Source Code
---
title: "Take-home Exercise 4"
subtitle: "Take-home Exercise 4: [Prototyping Modules for Visual Analytics Shiny Application](https://isss608-ay2023-24jan.netlify.app/take-home_ex/th_ex04)" 
author: "Victoria Neo"
date: 02/29/2024
date-modified: last-modified
format:
  html:
    code-fold: true
    code-summary: "code block"
    code-tools: true
    code-copy: true
execute:
  eval: true
  echo: true
  freeze: true
  warning: false
  message: false
---

![](images/bigmac.gif){fig-align="left" width="630"}

# 1 Overview

As mentioned in my group's [proposal](https://isss608-24jan-group1.netlify.app/proposal/proposal),

The [Big Mac Index](https://en.wikipedia.org/wiki/Big_Mac_Index) (BMI) was invented by [The Economist](https://www.economist.com/big-mac-index) in 1986 as a light-hearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), with the notion that in the long run, exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries. The BMI has become a global standard, included in several economic textbooks and the subject of dozens of academic studies. Thus, we would like to visually examine the phenomenon of inflation co-movement by looking at the similarities in BMI prices between countries through time, isolate groups with similar characteristics, and determine if the BMI is a suitable proxy to inflation prices.

There is a significant gap in making complex economic data accessible to the general public and providing advanced analysis tools for professionals. Existing visualization tools, including government dashboards, often fail to cater effectively to both audiences. To address this, we propose an R Shiny-based interactive tool that leverages publicly available data, the Big Mac Index to simplify the understanding of global financial trends for laypersons while offering in-depth data exploration and modelling capabilities for analysts.

# 2 **Getting Started**

## 2.1 Installing and loading the required libraries

-   [**AER**](https://cran.r-project.org/web/packages/AER/AER.pdf): for functions used in Applied Econometrics with R

-   **CGPfunctions**: for using `newggslopegraph()` for creating slopegraphs

-   [**corrplot**](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html): provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.

-   [**dplyr**](https://dplyr.tidyverse.org/): for manipulating, concatenating dataframes

-   [**gplot**](https://cran.r-project.org/web/packages/gplots/gplots.pdf): for using `coplot()` for creating coplots

-   [**ggExtra**](https://cran.r-project.org/web/packages/ggExtra/vignettes/ggExtra.html): for adding marginal histograms to ggplot2

-   [**ggHoriPlot**](https://rivasiker.github.io/ggHoriPlot/): for plotting horizon plots

-   [**ggiraphExtra**](https://cran.r-project.org/web/packages/ggiraphExtra/index.html): for using ggPredict`()` for creating interactive multiple linear regression plots

-   [**ggpmisc**](https://cran.r-project.org/web/packages/ggpmisc/vignettes/model-based-annotations.html)**:** for using `stat_poly_eq()` for creating linear regression equations

-   [**ggstatsplot**](https://indrajeetpatil.github.io/ggstatsplot/): is an extension of [`{ggplot2}`](https://github.com/tidyverse/ggplot2) package for creating graphics with details from statistical tests

-   [**kableExtra**](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html)**,** for creating and manipulating complex tables

-   [**lubridate**](https://lubridate.tidyverse.org/): for using [`dmy()`](https://lubridate.tidyverse.org/reference/ymd.html) function to parse the Date field into appropriate Date data type in R

-   [**packHV**](https://cran.r-project.org/web/packages/packHV/packHV.pdf): for using `hist_boxplot()` for creating histograms with boxplots

-   [**patchwork**](https://cran.r-project.org/web/packages/patchwork/vignettes/patchwork.html)**:** for combining multiple ggplot2 graphs into one figure

-   [**parallelPlot**](https://cran.r-project.org/web/packages/parallelPlot/parallelPlot.pdf): for plotting Parallel Coordinates plots

-   [**plm**](https://cran.r-project.org/web/packages/plm/plm.pdf): for creating linear regression plots for panel data

-   [**purrr**](https://purrr.tidyverse.org/): for handling lists and functional programming

-   [**naniar**](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html): for using `miss_vis()` function to check data for missing values

-   [**plotly**](https://plotly.com/r/): to make interactive graphs

-   [**tidyverse**](https://www.tidyverse.org/), a family of modern R packages specially designed to support data science, analysis and communication task including creating static statistical graphs

```{r}
pacman::p_load(AER, CGPfunctions, 
               corrplot, dplyr, 
               DT, gplots,
               ggExtra, ggHoriPlot, 
               ggiraphExtra, ggpmisc,
               ggthemes, ggplot2,
               ggstatsplot,
               kableExtra, lubridate,
               packHV,
               patchwork, parallelPlot, plm,
               purrr, naniar,
               plotly, tidyverse
               ) 
```

## 2.2 **Data Set**

### 2.2.1 Loading the Data Set

We load the dataset for Big Mac Index that has been combined with other data (e.g. GDP, Employment rates etc.) and the country dataset as seen here in our [data preparation](https://isss608-24jan-group1.netlify.app/prototype/data_preparation/data_preparation). We left join the country dataset to the Big Mac Index dataset.

```{r}
indicator_data <- read_csv("data/countries_with_complete_data.csv")
```

```{r}
country_data <- read_csv("data/country_all.csv")
```

```{r}
bmi_data <- left_join(indicator_data, country_data, by = c("country"))
```

### 2.2.2 Checking Data Health

We now check the health of our dataset by using:

-   `datatable()` from the DT package to view the dataframe more interactively,

-   `glimpse()` to look at the structure of the dataframe, data types of the columns, and some values of the dataframe,

-   `duplicate()` to check the dataframe for any duplicated entries using *duplicate()*, and

-   `vis_miss()` to check the state of missing values in the dataset.

-   `describe()` to produce a summary of all the variables.

::: panel-tabset
## **datatable()**

```{r}
datatable(bmi_data, 
          class= "compact",
          rownames = FALSE,
          width="100%", 
          options = list(pageLength = 10,scrollX=T))
```

## **glimpse()**

```{r}
glimpse(bmi_data)
```

## **duplicate()**

```{r}
bmi_data[duplicated(bmi_data),]
```

## **vis_miss()**

```{r}
vis_miss(bmi_data)
```

## **describe()**

```{r}
Hmisc::describe(bmi_data)
```
:::

### 2.2.3 Descriptive Statistics

We now summarize our dataset by using:

-   a function for categorical variables

-   `summary()` for continuous variables

::: panel-tabset
## **fun_cat for categorical variables**

```{r}
fun_cat <- function(x) {
  # Count the number of missing values
  nmiss <- sum(is.na(x))
  # Frequency
  n     <- table(x)
  # Proportion
  p     <- prop.table(n)
  # Putting it together
  OUT   <- cbind(n, p)
  # Add nmiss, but first pad to have the right number of rows
  nmiss <- c(nmiss, rep(NA, nrow(OUT)-1))
  OUT   <- cbind(OUT, nmiss)
  return(OUT)
}
```

## **Categorical variables**

```{r}
fun_cat(bmi_data$country)
fun_cat(bmi_data$currency_code)
fun_cat(bmi_data$year)
fun_cat(bmi_data$continent)
fun_cat(bmi_data$g20)
fun_cat(bmi_data$brics)
fun_cat(bmi_data$eu)
```

## **Continuous variables**

```{r}
cont_data <- data.frame(bmi_data$bmi_localprice, bmi_data$bmi_usd_price, bmi_data$bot, bmi_data$gdp, bmi_data$gdp_per_capita, bmi_data$inflation, bmi_data$hdi, bmi_data$population)

lapply(cont_data, summary)

```

## **bmi_usdprice by year**

```{r}
# Assuming 'bmi_data' is your data frame
result_list <- lapply(
  split(bmi_data, bmi_data$year),
  function(sub_data) summary(sub_data$bmi_usd_price)
)

result_list


```

## **bmi_usdprice by country**

```{r}
# Assuming 'bmi_data' is your data frame
result_list <- lapply(
  split(bmi_data, bmi_data$country),
  function(sub_data) summary(sub_data$bmi_usd_price)
)

result_list


```
:::

# 3 **Exploratory Data Analysis**

```{mermaid}
flowchart TB
  A[Year]
  B[Country]
  C[Big Mac Index price]
  D[Inflation rate]
  E[GDP per capita]
  F[Balance of Trade]
  G[Unemployment rate]
  H[Human Development Index]
  I[Population size]
  
  A --> C
  B --> C
  C --> D
  C --> E
  C --> F
  C --> G
  C --> H
  C --> I

```

Some questions of interest regarding the problem statement of inflation co-movement by BMI are:

-   what type of variation occurs within BMI prices,

-   what type of covariation occurs between BMI prices by dimensions.

To do this, we will look at the following variables:

1.  bmi_usd_price (in order to compare across currencies, we use the BMI prices converted to USD)

by dimensions

1.  country

2.  year

## 3.1 Univariate Analysis: Looking at BMI

```{r}
par(mfrow = c(1, 2))

hist_boxplot(bmi_data$bmi_usd_price,col="#69b3a2",freq=TRUE,  xlab = "Prices in USD", main = "Histogram of Big Mac Index prices in USD")
hist_boxplot(bmi_data$bmi_usd_price,col="#69b3a2",freq=FALSE,density=TRUE, xlab = "Prices in USD", main = "")

```

::: callout-note
### **Insights**

BMI is asymmetrically distributed and skewed to the right. Most of the Big Mac prices (adjusted to USD) are close to 3USD - this makes sense as McDonald's positions itself as an ultra-affordable fast food chain. Its mean (3.2198) is higher than its median (2.9915). But this also means that it has outliers with very expensive Big Mac prices.

Questions:

-   Why are the Big Mac prices close to 3USD?

-   Given McDonald's strategy to be ultra-affordable, why is there a large spread in Big Mac prices and outliers on the right?
:::

## 3.2 Looking at BMI through individual dimensions

### 3.2.1 Country dimension **with Violin+Boxplot**

```{r, fig.height = 8}

bmi_data %>% 
  
  # Calculate mean of bmi_usd_price for each country
  group_by(country) %>%
  mutate(median_bmi = median(bmi_usd_price)) %>%
  ungroup() %>%
  
  # Reorder the countries based on the mean of bmi_usd_price
  mutate(country = reorder(country, median_bmi)) %>%
  
  # Add a column called 'type': do we want to highlight the group or not?
  mutate(type = ifelse(country=="United States", "Highlighted", "Normal")) %>%
  
  # Build the boxplot. 
  ggplot(aes(x = country, y = bmi_usd_price, fill = type, alpha = type)) + 
    geom_violin(width=1.0) +
    geom_boxplot(width=0.3, color="black", alpha=0.8) +
    geom_jitter(color = "black", size = 0.4, alpha = 0.9) +
    scale_fill_manual(values = c("#69b3a2", "grey")) +
    scale_alpha_manual(values = c(1, 0.1)) +
    theme(legend.position = "none") +
    labs(y = "Prices in USD",    
       x = "",                  
       title = "Boxplot of Big Mac Index Prices in USD across countries") +
    coord_flip() +
  # Add mean and median lines
    geom_hline(aes(yintercept = mean(bmi_usd_price), linetype = "Mean"), 
               color = "darkblue", linetype = "dashed", size = 0.5) +
    geom_hline(aes(yintercept = median(bmi_usd_price), linetype = "Median"), 
               color = "darkred", linetype = "dashed", size = 0.5) +
  
    # Add text labels for mean and median
    geom_text(aes(x = mean(bmi_usd_price), label = paste("Mean: ", round(mean(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust =3.5, hjust = 1, color = "darkblue") +
    geom_text(aes(x = median(bmi_usd_price), label = paste("Median: ", round(median(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust = 4, hjust = 1, color = "darkred") +
  
    # Customize legend
    scale_linetype_manual(name = "Lines", values = c("Mean" = "dashed", "Median" = "dashed"),
                          labels = c("Mean", "Median")) +
    guides(fill = guide_legend(override.aes = list(alpha = c(1, 0.1)))) 

```

::: callout-note
### **Insights**

Some of the outliers pointed out in the previous data visualisations can be explained by the country dimension. Highly developed countries such as Switzerland, Sweden and Denmark have the top 3 most expensive Big Mac prices. However, Hong Kong, with also a highly developed economy, has the third lowest Big Mac price. In the boxplot above, there is no clear difference between developed and developing countries (even Philippines is more expensive than Hong Kong).

The above boxplot also shows that there is variation within and between countries - there are some countries with greater variability e.g. Argentina and a number of countries with median BMI prices higher than the overall median BMI price such as Switzerland, Brazil and Canada while countries like Mexico and South Africa (somewhat bimodal) did not see much fluctuations in their BMI prices.

Questions:

-   Why do some countries have greater variability in their BMI prices?

-   Why is there a difference in BMI prices between developed and developing countries?

    -   especially for Hong Kong
:::

### 3.2.2 Year dimension with Violin+Boxplot

```{r, fig.height = 8}
bmi_data %>% 
  mutate(year = factor(year)) %>%
  # Add a column called 'type': do we want to highlight the group or not?
  mutate( type=ifelse(year %in% c("2002", "2021"),"Highlighted","Normal")) %>%
  
  # Build the boxplot. In the 'fill' argument, give this column
  ggplot( aes(x=year, y=bmi_usd_price, fill=type, alpha=type)) + 
    geom_violin(width=1.0) +
    geom_boxplot(width=0.3, color="black", alpha=0.8) +
    geom_jitter(color = "black", size = 0.4, alpha = 0.9) +
    scale_fill_manual(values=c("#69b3a2", "grey")) +
    scale_alpha_manual(values=c(1,0.1)) +
    theme(legend.position = "none") +
    labs(y = "Prices in USD",    
       x = "",                  
       title = "Boxplot of Big Mac Index Prices in USD across years") +
    coord_flip() +
 # Add mean and median lines
    geom_hline(aes(yintercept = mean(bmi_usd_price), linetype = "Mean"), 
               color = "darkblue", linetype = "dashed", size = 0.5) +
    geom_hline(aes(yintercept = median(bmi_usd_price), linetype = "Median"), 
               color = "darkred", linetype = "dashed", size = 0.5) +
  
    # Add text labels for mean and median
    geom_text(aes(x = mean(bmi_usd_price), label = paste("Mean: ", round(mean(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust =3.5, hjust = 1, color = "darkblue") +
    geom_text(aes(x = median(bmi_usd_price), label = paste("Median: ", round(median(bmi_usd_price), 2))),
              y = max(bmi_data$bmi_usd_price, na.rm = TRUE), size = 3, vjust = 4, hjust = 1, color = "darkred") +
  
    # Customize legend
    scale_linetype_manual(name = "Lines", values = c("Mean" = "dashed", "Median" = "dashed"),
                          labels = c("Mean", "Median")) +
    guides(fill = guide_legend(override.aes = list(alpha = c(1, 0.1)))) 
```

::: callout-note
### **Insights**

From the boxplot above, we can see that Big Mac prices have risen through the years - 2021 is the most expensive year for Big Mac prices. Moreover from 2002 - 2007, according to the violin plot, it seemed that the prices were mostly concentrated but there is more spread since 2008. Despite the blip from 2013 - 2016 and 2018 - 2020, which saw falling Big Mac prices, as a whole, they have increased since 2002 which saw the lowest prices for Big Mac.

Questions:

-   Why was there a dip in Big Mac prices from 2013 to 2016 and 2018 to 2020?

-   Why are Big Mac prices rising as a whole?
:::

## 3.3 Looking at BMI through both dimensions

### 3.3.1 Plotting a Coplot plot to determine if there is a third variable influence

The BMI price in USD is plotted on the y axis and the years are on the x axis. The 28 plots show the relationship between these two variables by each country. The bars at the top indicate the countries position from left to right, starting from the bottom row.

```{r, fig.width = 12}
coplot(bmi_usd_price ~ year|country, data =  bmi_data,         
       panel = function(x, y, ...) {         
         points(x, y, ...)         
         abline(lm(y ~ x), col = "blue")})
```

::: callout-note
**Insights**

From the coplots above, there is an interaction between years and countries. Conditionally on country, the relationship between BMI price in USD and year looks roughly linear - the regression results for most countries indicate a positive relationship throughout the years. However, countries like Hong Kong(11), Mexico(15), Russia(20) and South Africa(22) have almost flat regression lines that suggest a weak or almost no relationship between the variables. Moreover, Turkey(26) is the only country with a negative relationship.
:::

# 4 **Time-series Analysis of Big Mac Index through time and countries**

## 4.1 Looking at overall trends with `newsslopegraph()`

```{r, fig.width= 10, fig.height = 15}
bmi_data %>% 
  mutate(year = factor(year)) %>%
  filter(year %in% c(2002, 2011, 2021 )) %>%
  newggslopegraph(year, bmi_usd_price, country,
                Title = "Big Mac Index (USD) across 28 countries",
                SubTitle = "At 10 Years Intervals from 2002-2021",
                Caption = "Source: The Economist",
                YTextSize = 2.5,
                LineThickness = .6,
                LineColor = c(rep("gray",8), "red", rep("gray",18), "#69b3a2"),
    WiderLabels = TRUE)
```

::: callout-note
### **Insights**

From the slopegraph above, we can see that Big Mac prices increased for all countries in 2011 but later increased for some and decreased for others. Switzerland consistently had the most expensive Big Mac prices through the years while Argentina had the cheapest Big Mac in 2002 but is in 15th position (out of 28) in 2021. Most interesting is Hong Kong, that saw a consistent rise in Big Mac process, did not experience sharp changes, and had actually the cheapest Big Mac in 2011.

Moreover, in 2002, the prices between countries were much closer in spread but this spread increased through the years.

Questions:

-   Why did the spread in Big Mac prices between countries increase through the years?

-   Why did Big Mac prices increase in 2011 for all countries?

-   Why did some countries experience a consistent increase in Big Mac prices while others experienced fluctuations?
:::

## 4.2 Looking at variations in patterns with `ggHoriPlot()`

Toggle the different tabsets to see the different horizon plots.

::: panel-tabset
## **using United States mean bmi_usd_price as the baseline**

```{r}
mean_us_bmi_price <- bmi_data %>%
  filter(country == "United States") %>%
  summarise(mean_bmi_usd_price = mean(bmi_usd_price, na.rm = TRUE))

mean_us_bmi_price

bmi_data %>% 
  filter(year >= "2002") %>%
  ggplot() +
  geom_horizon(aes(x = year, y=bmi_usd_price), 
               origin = 3.666, 
               horizonscale = 6)+
  facet_grid(`country`~.) +
    theme_few() +
  scale_fill_hcl(palette = 'RdBu') +
  theme(panel.spacing.y=unit(0, "lines"), strip.text.y = element_text(
    size = 5, angle = 0, hjust = 0),
    legend.position = 'none',
    axis.text.y = element_blank(),
    axis.text.x = element_text(size=7),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
    ) 
```

If US mean bmi_usd_price is used as the origin, Switzerland (and shortly after) Denmark have always had stronger prices than the overall mean Big Mac Index price for USA. China, Hong Kong, Indonesian, Malaysia, Mexico, Philippines, Poland, Russia, South Africa and Turkey have always been weaker in Big Mac prices than the USA.

## **using mean bmi_usd_price as the baseline**

```{r}
bmi_data %>% 
  filter(year >= "2002") %>%
  ggplot() +
  geom_horizon(aes(x = year, y=bmi_usd_price), 
               origin = "mean", 
               horizonscale = 6)+
  facet_grid(`country`~.) +
    theme_few() +
  scale_fill_hcl(palette = 'RdBu') +
  theme(panel.spacing.y=unit(0, "lines"), strip.text.y = element_text(
    size = 5, angle = 0, hjust = 0),
    legend.position = 'none',
    axis.text.y = element_blank(),
    axis.text.x = element_text(size=7),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
    ) 
```

## **using median bmi_usd_price as the baseline**

```{r}
bmi_data %>% 
  filter(year >= "2002") %>%
  ggplot() +
  geom_horizon(aes(x = year, y=bmi_usd_price), 
               origin = "median", 
               horizonscale = 6)+
  facet_grid(`country`~.) +
    theme_few() +
  scale_fill_hcl(palette = 'RdBu') +
  theme(panel.spacing.y=unit(0, "lines"), strip.text.y = element_text(
    size = 5, angle = 0, hjust = 0),
    legend.position = 'none',
    axis.text.y = element_blank(),
    axis.text.x = element_text(size=7),
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
    ) 
```
:::

::: callout-note
### **Insights**

The horizon plot (**using median bmi_usd_price as the baseline)** showed that all countries were initially below the median Big Mac price and all experienced some increase in price. China, Hong Kong, Korea, Singapore, Thailand and United States had a huge increase in Big Mac prices around 2018 to 2022. However, Turkey experienced a large drop in recent years while Russia experienced huge increase and than a large drop around 2015 and only had a smaller increase later on.

Questions:

-   Why do some countries experience a significant increase in Big Mac prices together over the same period?

-   Why are there small sharp peaks for some years and long peaks for others?

-   What other factors account for this co-movement in Big Mac prices? GDP per capita?
:::

## 4.3 Adding GDP per capita to track the movement of BMI through the years, countries (colour) and population (size) with a Bubble Plot

```{r}
bmi_data %>%
  plot_ly(x = ~gdp_per_capita, 
          y = ~bmi_usd_price, 
          size = ~population, 
          color = ~country,
          sizes = c(20, 150),
          frame = ~year, 
          text = ~paste(country, "<br>Population: ", gdp_per_capita, "<br>Big Mac Price: USD$", bmi_usd_price,"<br>GDP per Capita: $", gdp_per_capita, "<br>Inflation: ", inflation, "%"), 
          hoverinfo = "text",
          type = 'scatter',
          mode = 'markers'
          ) %>%
  layout(showlegend = FALSE,
    xaxis = list(title = "GDP per Capita"),
    yaxis = list(title = "Price in USD"),
    title = "A Bubbleplot of Big Mac Index (USD) across 28 countries (colour) with population (size) through the years")
```

::: callout-note
### **Insights**

At the start, it seemed like there was a relationship between GDP per capita and BMI prices - the higher one's GDP per capita was, the higher the Big Mac price in USD. Switzerland with the highest GDP per capita also saw the most expensive Big Mac prices. The only exception is Hong Kong that had a small inflation rate throughout. For those with higher GDP per capita, their inflation rates were almost the same. While countries with higher GDP per capita above 40k saw their GDP per capita increasing with Big Mac prices consistently above USD\$3 since 2010, countries with GDP per capita below 20k saw small increases in their GDP per capita, still remained below 20k while Big Mac prices had a wider variability. Also, most of the countries in our study are of the similar population size.

Questions:

-   There seems to be two different trajectory of BMI prices for countries with different GDP per capita. Are there other factors influencing this? Are there other trajectories that can be seen?

-   We have been mostly looking at country and year effects. Are there other variables that affect BMI as a whole?
:::

# 5 **Multivariate Analysis of Big Mac Index through time and countries**

## 5.1 Plotting a Parallel Coordinates plot to see relationships between variables

```{r, fig.width=12, fig.height=10}
pc <- bmi_data %>%
  select( gdp_per_capita, bot, inflation, unemployment, hdi, population, bmi_usd_price)
histoVisibility <- rep(TRUE, ncol(pc))
parallelPlot(pc,
             histoVisibility = histoVisibility)
```

::: callout-note
**Insights**

Based on the parallel coordinates plot, there appears to be interesting patterns. For example, countries lower gdp_per-capita, higher balance of trade (net export - net import) with low inflation and unemployment, and big population sizes had low Big Mac Index prices. Countries with high gdp_per_capita with high human development index (HDI) and small population sizes had very high Big Mac Index prices.

Questions:

-   What are the correlations between variables with BMI in USD prices?
:::

## 5.2 Plotting a Correlation plot to see correlations between variables

```{r, fig.width = 12}
columns_of_interest <- c("gdp_per_capita", "bot", "inflation", "unemployment", "hdi", "population", "bmi_usd_price")
bmi.cor <- cor(bmi_data[, columns_of_interest])

bmi.sig = cor.mtest(bmi.cor, conf.level= .95)
corrplot(bmi.cor,
         method = "number",
         type = "lower",
         diag = FALSE,
         tl.col = "black",
         tl.srt = 45,
         p.mat = bmi.sig$p,
         sig.level = .05)
```

::: callout-note
**Insights**

GDP per capita, inflation and Human Development Index are significantly correlated to Big Mac Index prices. Interestingly, even though we had initially posited that inflation and Big Mac prices are correlated, it appears that they are not - inflation is not correlated to Big Mac Index prices. However, we cannot tell if there is a year/country effect - there is no information on the correlation even though though these variables are constant in the time dimension and across countries.
:::

# 6 Panel Data Regression

Given the time-series (by 20 years) and cross sectionality (by 28 countries) of our data, panel data regression is a good technique to analyze the relationship between our independent variables and Big Mac Index price.

## 6.1 Homogenous Panel Data model

From the article [**Introduction to the Fundamentals of Panel Data**](https://www.aptech.com/blog/introduction-to-the-fundamentals-of-panel-data/),

> Homogeneous (or pooled) panel data models assume that the model parameters are common across individuals.

Thus, the basic OLS regression model does not consider heterogeneity across countries or years.

```{r}
ols <- lm(bmi_usd_price ~ gdp_per_capita + hdi, data = bmi_data)
summary(ols)
```

```{r, fig.width = 12, fig.height = 10}
ggPredict(ols,interactive=TRUE)
```

::: callout-note
**Insights**

Each point represents observations of GDP per capita, HDI and BMI price in USD - the multiple regression results indicate a positive relationship throughout the years.. However, this cannot account for omitted unobservable factors that exist from country to country and year to year.
:::

## 6.2 Heterogeneous Panel Data models

From the article [**Introduction to the Fundamentals of Panel Data**](https://www.aptech.com/blog/introduction-to-the-fundamentals-of-panel-data/),

> Heterogeneous models allow for any or all of the model parameters to vary across individuals. Fixed effects and random effects models are both examples of heterogeneous panel data models.

### 6.2.1 Testing for Heterogeneity across Year and Country

**By Year**

```{r}
plotmeans(bmi_usd_price~year, data = bmi_data, n.label=FALSE,
          xlab="Years",
          ylab="Big Mac Index price (USD)")
```

**By Country**

```{r}
plotmeans(bmi_usd_price~country, data = bmi_data, n.label=FALSE,
          xlab="Years",
          ylab="Big Mac Index price (USD)")
```

::: callout-note
**Insights**

We can see from the above plots that there is no homogeneity across years and countries. Hence, we should use heterogeneous panel data models.
:::

### 6.2.2 Choosing which model is appropriate

Definitions of different models taken from the articles [**A Guide to Panel Data Regression: Theoretics and Implementation with Python**](https://towardsdatascience.com/a-guide-to-panel-data-regression-theoretics-and-implementation-with-python-4c84c5055cf8)and [**Panel Data Using R: Fixed-effects and Random-effects**](https://libguides.princeton.edu/R-Panel):

1.  Fixed Effects model

    > The FE-model determines individual effects of unobserved, independent variables as constant (“fix“) over time. Within FE-models, the relationship between unobserved, independent variables and the IVs (i.e. endogeneity) can be existent.
    >
    > ![](images/clipboard-2889706925.png)
    >
    > Moreover,
    >
    > **FE remove the effect of those time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable.**
    >
    > Another important assumption of the FE model is that those time-invariant characteristics are unique to the individual and should not be correlated with other individual characteristics. Each entity is different therefore the entity’s error term and the constant (which captures individual characteristics) should not be correlated with the others. If the error terms are correlated, then FE is no suitable since inferences may not be correct and you need to model that relationship (probably using random-effects).

2.  Random Effects model

    > RE-models determine individual effects of unobserved, independent variables as random variables over time. They are able to “switch” between OLS and FE and hence, can focus on both dependencies **between** and **within** **individuals**. 
    >
    > ![](images/clipboard-493286578.png)
    >
    > Moreover,
    >
    > An advantage of random effects is that you can include time invariant variables (i.e. gender). In the fixed effects model these variables are absorbed by the intercept. The cost is the possibility of inconsistent estimators, of the assumption is inappropriate.

Using the two statistically significant correlated variables of GDP per capita and HDI, we will create the panel regression to analyze the relationship with Big Mac Index prices.

::: panel-tabset
## **Fixed effects model**

```{r}
fixed <- plm(bmi_usd_price ~ gdp_per_capita+hdi, data=bmi_data, index=c("country", "year"), model="within")  #fixed model
summary(fixed)
```

## **Random effects model**

```{r}
random <- plm(bmi_usd_price ~ gdp_per_capita+hdi, data=bmi_data, index=c("country", "year"), model="random")  #fixed model
summary(random)
```
:::

As seen above, FE and RE models' independent variables (GDP per capita and HDI) both have significant influence on BMI prices. Moreover, both models are good to use as their p-values are below alpha = 0.05. The Adjusted R-square variables are high with FE model having a higher value at 60% which means it represents the portion of the total variation in BMI prices that is explained by GDP per capita and HDI.

To determine between FE and RE models, we can use the Hausman-Test:

> By running the Hausman-Test, the null hypothesis is that the covariance between IV(s) and *alpha* is zero. If this is the case, then RE is preferred over FE. If the null hypothesis is not true, we must go with the FE-model.

```{r}
phtest(fixed,random) #Hausman test
```

::: callout-note
**Insights**

Given that the p-value is significant (\<0.05), we reject the null hypothesis and can conclude that the fixed effects model is better.
:::

### 6.2.3 Testing between FE and OLS models

To determine between FE and OLS (simple regression) models, we can use the pFtest to see which one is a better fitted model.

```{r}
pFtest(fixed, ols)
```

::: callout-note
Given that the null hypothesis is that OLS is better fit model than Fixed Effect, the p-value is really small which means we reject the null hypothesis and can conclude that a FE model is a better fit.
:::

### 6.2.4 FE Model Diagnostics: Plotting predicted vs expected values

After conducting panel data regression, plotting predicted vs expected values allows us to visually assess the accuracy of your model's predictions. The plot helps us identify any patterns or discrepancies between the predicted values from the Fixed Effects regression model and the actual observed values. It helps us validate the performance of the model and ensure it captures the underlying relationships effectively.

```{r, fig.width=12}
## produce a dataset of prediction, added to the group means
bmi_means <- bmi_data %>% 
  mutate(y = bmi_usd_price) %>% 
  group_by(country) %>% 
  transmute(y_mean = mean(y),
            y = y, 
            year = year) %>% 
  ungroup() %>% 
  mutate(y_pred = predict(fixed) + y_mean) %>% 
  select(-y_mean)

## plot it
bmi_means %>% 
  gather(type, value, y, y_pred) %>% 
  ggplot(aes(x = year, y = value, linetype = type))+
  geom_line() +
  facet_wrap(~country) +
  ggtitle("Visualising in-sample predictions across 28 countries")
```

::: callout-note
**Insights**

As we can see from the plots above, even though the Adjusted Rsquared for this model is 0.60, its predicted values are consistently higher than the observed y values. This means the model is not effectively capturing the relationships at each panel. Perhaps this poor model performance can be improved with clustering the countries - two different types of clusters (network clustering and time-series clustering) will be provided by my groupmates Firdaus and Jiayi respectively and added into the final Shiny app.
:::

# 7 Conclusion

In conclusion, we used visualizations to explore the Big Mac Index prices for 28 countries across 20 years of 2002 to 2021. Visually, using univariate analysis, our plots depicted different trajectories in BMI prices across countries and a general upward trend across the years. When both dimensions were combined, we realised that there was an interaction effect between years and countries, and that there were 3 types of relationships ( positive, weak/zero, negative) between BMI prices and years when conditioned on countries.

This was further expanded in the visual time-series analysis that showed several outlier countries such as Hong Kong, Mexico, Russia, South Africa and Turkey. We could also see that when GDP per Capita was plotted against BMI prices, there seemed to be two groups of countries with different trajectories.

Thus, we conducted multivariate analysis to check which variables influenced Big Mac Index prices (putting aside year and country effects). There were interesting patterns e.g. those with large population sizes tended to have lower Big Mac Index prices. However, when statistically tested in a correlation plot, only GDP per capita and Human Development Index were statistically significantly correlated with BMI prices.

Despite our initial claim that the Big Mac Index is a proxy for inflation, our visual and statistical multivariate analyses did not support this claim.

With these variables, we conducted panel data regression analysis to confirm our analysis. Given the heterogeneity of data across years and countries, we cannot use homogeneous panel data models like OLS regression. Instead, through statistical analysis, Fixed Effect regression model is most suitable to explain the variation in BMI prices is explained by GDP per capita and HDI when we account for year and country effects.

# 8 Storyboard

For my part that comprised exploratory data analysis with univariate and time-series analysis, and confirmatory data analysis with panel data regression, my part in the Shiny web application is developed into the main dashboard, panel data analysis page under the Data Exploration analysis, and panel data regression under the Modelling page.

## 8.1 Dashboard (Summary)

The first section will consist of a homepage tab that allows users to select a country and year that the user is interested in. With a chloropleth map as the main focus (please refer to my groupmate Firdaus's Take-home Exercise 4), a lollipop graph and violin+boxplot graph will be placed next to the chloropleth graph. Through this, the user is able to obtain a basic understanding on the individual country's economic information, their BMI prices in relation to other countries and their BMI prices in relation to other years giving them the overall feel on how the BMI prices vary between countries and years and other information on their economic measures (that are later used for analysis).

Also, upload and export buttons are included for users to import data and export data of a fixed format. With the upload feature, any data uploaded by users will be automatically transformed and immediately analysed in the application. The export button allows the analysis and forecasted results to be exported in data or report format.

![Proposed layout of Dashboard](images/clipboard-3472528936.png){fig-align="left"}

The figure below shows some of the interactive features that will be incorporated into the dashboard. Users will have an overview of the BMI prices during their selected year with four key indicators.

![Proposed interactive features of Dashboard](images/clipboard-2052266158.png)

## 8.2 Panel Data Analysis (Data Exploration)

This section will be created using the ExPanD package that allows for interactive panel data exploration and analyses. Throughout the application, users will be able to select what variables they are interested in and tweak the plots to explore different parameters. After this section, the user will have a clearer understanding on the variables that influence BMI prices and the effects of time and countries on these relationships.

![](images/clipboard-578988398.png)

![](images/clipboard-1648077134.png)

![](images/clipboard-745088425.png)

![](images/clipboard-1057339020.png)

![](images/clipboard-3733328162.png)

## 8.3 Panel Data Regression (Modelling)

This section will be also be created using the ExPanD package that allows for interactive panel data exploration and analyses. In this section, it showcases the statistical analysis on the variables that can explain the variation in BMI prices. Users will be able to vary the (multiple) variables in the panel regression analysis and see the results in the table. There will also be options to change the type of panel data regression and to select different clusters e.g. clusters from my groupmates Firdaus' and Jiayi's clustering analysis. At the bottom, ExPanD does not allow for custom addition of other plots so I will manually paste the predicted vs observed values plots according to no cluster, network cluster and time-series cluster.p;\[

![](images/clipboard-1735006144.png)

![](images/clipboard-3838105402.png)

## 8.4 Conclusion

This section shows the initial prototype of the web application that we are planning to build to analyse and predict Big Mac Index prices.

1.  **Dashboard for Analysis and Prediction**: The web application will serve as a dashboard where users can perform exploratory analysis, modelling and predicting of Big Mac Index prices.

2.  **Customizable Filters**: The dashboard will include various filters such as selected periods and countries used for panel data analysis. These filters will allow users to customize their analysis based on their specific needs and preferences.

3.  **Personalized User Experience**: The inclusion of customizable filters adds a personal touch to the user experience, allowing users to tailor their analysis to suit their individual requirements.

Overall, the web application aims to provide users with a powerful tool for analyzing and predicting Big Mac Index prices, leveraging customizable analysis features to enhance user experiences and provide valuable insights.

##