ANALYSIS OF BELLABEAT – A WELLNESS TECHNOLOGY COMPANY

(A CASE STUDY USING R)

COMPANY OVERVIEW

Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.

Bellabeat was founded by Urška Sršen and Sando Mur in 2013. It is a high-tech company that manufactures health-focused smart products.

Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.

Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively, using Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter.

The Products

  • Bellabeat App: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

PHASE ONE (ASK)

Problem Definition

The company wants to unlock new growth opportunities with their products.

The Stakeholders

The main stakeholders for this projects are: * Urška Sršen, she is Bellabeat’s cofounder and Chief Creative Officer * Sando Mur, a mathematician and Bellabeat’s cofounder; also a key member of the Bellabeat executive team * Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Deliverable

Analyze smart device data to gain insight into how consumers are using their smart devices. The insights discovered should help guide marketing strategy for the company.

An analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. Analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, a high-level recommendations for how these trends can inform Bellabeat marketing strategy is required.

PHASE TWO (PREPARE)

Dataset Used

A Kaggle dataset FitBit Fitness Tracker Data: a public domain dataset made available by Mobius, is used for this analysis. This Kaggle data set contains personal fitness tracker from thirty fitbit users.

Data Credibility

Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

It is a reliable dataset, though it may be insufficient. It is segmented into several tables with different aspects of the data of the device.

Data Storage, Organization & Sorting

There are 18 CSV files. Each file represents different data tracked by fitbit. Each data have multiple rows. Each user have a unique ID.

For this analysis, i will focus on daily timeframe since my analysis is focused on detecting high-level trends in usage, and not in the performance of the users. the daily activity and the sleep data are most suitable for my analysis. { daily activity merged, hourly steps, sleep day }

Tools To Use

I will use R for this project.

So i need to install and load these R libraries: tidyverse, here, skimr, janitor reshape, scales

# Installing Libraries

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("scales")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("reshape2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
#Importing Libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggpubr)
#Importing the datasets
daily_activities <- read_csv("dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv("sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv("hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Inspect each dataset
head(daily_activities)
head(daily_sleep)
head(hourly_steps)
str(daily_activities)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/16" "4/13/16" "4/14/16" "4/15/16" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(daily_sleep)
## spc_tbl_ [413 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(hourly_steps)
## spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id          : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : num [1:22099] 373 160 151 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityHour = col_character(),
##   ..   StepTotal = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
#Find out how many unique users in these datasets
n_unique(daily_activities$Id) : 33
## [1] 33
n_unique(daily_sleep$Id) : 24
## [1] 24
n_unique(hourly_steps$Id) : 33
## [1] 33

PHASE THREE (PROCESS)

Data Cleaning

#check for duplicates
sum(duplicated(daily_activities))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 3
sum(duplicated(hourly_steps))
## [1] 0

Daily_Sleep dataset has 3 duplicates out of the 413 records and needs to be removed.

#removing duplicates
daily_activities <- daily_activities %>%
  distinct() %>%
  drop_na()

daily_sleep <- daily_sleep %>%
  distinct() %>%
  drop_na()

hourly_steps <- hourly_steps %>%
  distinct() %>%
  drop_na()
#reconfirm duplicate
sum(duplicated(daily_sleep))
## [1] 0

Data Formating

#Ensure Date format is consistent daily_activities <- daily_activities %>% mutate(ActivityDate = as_date(as.character(ActivityDate), format =“%m/%d/%Y”)) daily_sleep <- daily_sleep %>% mutate(SleepDay = as_date(as.character(SleepDay), format =“%m/%d/%Y”))

#Consistent Date column name will be helpful
daily_activities <- daily_activities %>%
  rename(date = ActivityDate)
daily_sleep <- daily_sleep %>%
  rename(date = SleepDay)
#Split Date and time in daily_sleep
#daily_sleepx<-daily_sleep
daily_sleep$date <- as.Date(daily_sleep$date, "%m/%d/%Y")
daily_sleep$Time <- format(as.POSIXct(daily_sleep$date), format ="%H:%M:%S %p")
#we want to see the correllation between daily activities and daily sleep, so we merge the dataset
daily_data <- merge(daily_activities, daily_sleep, by=c("Id"))

PHASE FOUR (ANALYZE & SHARE)

We don’t have any demographic variables from our sample dataset to work with, so we may want to determine the category of users from our available data.

With reference to https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/

We will classify the users by activity considering the daily amount of steps taken as follows:

  • Sedentary Active - less than 5000 steps a day
  • Low Active - Btw 5000 and 7499 steps a day
  • Relatively Active - btw 7500 and 9999 steps a day
  • Very Active - more than 10000 a day
  • Highly Active is more than 12,500

Let’s calculate the daily average steps per user:

#Get Averages
daily_average <- daily_data %>%
  group_by(Id) %>%
  summarise(ave_daily_steps = mean(TotalSteps), ave_daily_calories = mean(Calories), ave_daily_sleep = mean(TotalMinutesAsleep))
 
head(daily_average)
#classifying the users
user_category <- daily_average %>%
  mutate(user_category = case_when(
    ave_daily_steps < 5000 ~ "Sedentary Active",
    ave_daily_steps >= 5000 & ave_daily_steps < 7500 ~ "Low Active",
    ave_daily_steps >= 7500 & ave_daily_steps < 10000 ~ "Relatively Active",
    ave_daily_steps >= 10000 & ave_daily_steps < 12500 ~ "Very Active",
    ave_daily_steps > 12500 ~ "Highly Active"
  ))

head(user_category)
#we will now get the percentages of each category into a new data frame

user_category_percent <- user_category %>%
  group_by(user_category) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(user_category) %>%
  summarise(total_percent = total / totals) %>%
  mutate(labels = scales::percent(total_percent))

user_category_percent$user_category <- factor(user_category_percent$user_category, levels = c("Highly Active", "Very Active", "Relatively Active", "Low Active", "Sedentary Active"))

head(user_category_percent)
#Here is the distribution of category 
user_category_percent %>%
  ggplot(aes(x=user_category,y=total_percent, fill=user_category)) +
  geom_col() +
  labs(title="User Category Distribution")

#Calories burned by Steps per User Category
user_category %>%
  ggplot(aes(x=ave_daily_steps,y=ave_daily_calories, fill=ave_daily_steps, colour =user_category)) +
  geom_boxplot() +
  facet_wrap(~user_category) +
  labs(title="Calories burned by Steps per User Category")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

#To see hat day of the week the users are most active and what day of the week they sleep more.

weekdays_activities <- daily_data %>%
  mutate(weekday =  weekdays(as.Date(daily_data$date.x)))

weekdays_activities$weekday <- ordered(weekdays_activities$weekday, levels = c("Monday","Tuesday","Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

weekdays_activities <- weekdays_activities %>%
  group_by(weekday) %>%
  summarise(daily_steps = mean(TotalSteps), daily_sleep = mean(TotalMinutesAsleep))

head(weekdays_activities)
ggarrange(
  ggplot(weekdays_activities) +
    geom_col(aes(weekday, daily_steps,fill = weekday)) +
             geom_hline(yintercept = 7500) +
               labs(title = "Daily Steps per Weekday", x="", y="") +
               theme(axis.text.x =element_text(angle = 45, vjust=0.5, hjust= 1)),
             ggplot(weekdays_activities, aes(weekday, daily_sleep/60, fill = weekday)) +
               geom_col() +
                        geom_hline(yintercept = 8) +
                          labs(title = "Hours of Sleep per Weekday", x="", y="") +
                          theme(axis.text.x =element_text(angle = 45, vjust=0.5, hjust= 1))
               )

From the above charts, we notice that uses walk the recommended 7500 steps daily except on Tuesdays and Fridays. Also, users do not sleep up to the recommended 8 hours per day.

#We need to determine if there is any correlation between daily steps and daily sleep hours as well as as daily steps and calories burnt

ggarrange(
  ggplot(daily_data, aes(x=TotalSteps, y=TotalMinutesAsleep/60)) +
    geom_point() +
    geom_smooth(method="loess", formula="y ~ x", color= "red") +
    labs(title = "Daily Steps vs Hours of Sleep", x = "Daily Steps", y="Hours of Sleep") +
    theme(panel.background = element_blank(),
          plot.title = element_text(size=13)),
  ggplot(daily_data, aes(x=TotalSteps, y=Calories)) +
    geom_point() +
    geom_smooth(method="loess", formula="y ~ x",color = "red") +
    labs(title = "Daily Steps vs Calories", x = "Daily Steps", y= "Calories") +
    theme(plot.title = element_text(size=13))
)

Obviously, no correlation in the first chart, but there is in the second, that is the Daily steps and the calories burnt. The hours of sleep of users does not have any correlation with their use of Bellabeat Fitbit devices.

PHASE FIVE (ACT)

After a the analysis, some insights were found that would help influence Bellabeat marketing strategy.

Recommendations:

  • Bellabeat APP can be made to reccommend sleep hours to Bellabeat users that want to improve their sleep.
  • Bellabeat can suggest some ideas for low-calories lunch to users
  • Users who work full time jobs and spend a lot of time at the computer/meeting focused work.