STAT 19000: Project 5 — Fall 2021
Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the "apply" suite. In this project we will use vectorized functions to solve a variety of data-driven problems.
Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.
Scope: r, data.frames, recycling, factors, if/else, for loops, apply suite
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/youtube/*.{csv,json}
Questions
Question 1
Read the dataset USvideos.csv
into a data.frame called us_youtube
. The dataset contains YouTube trending videos between 2017 and 2018.
The dataset has two columns that refer to time: The column |
When working with dates, it is important to use tools specifically for this purpose (rather, than using string manipulation, for example). We’ve provided you with the code below. The provided code uses the lubridate
package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out the official cheatsheet in case you’d like to learn more about the package.
Run the code below to extract to create two new columns: trending_year
and publish_year
.
library(lubridate)
# convert columns to date formats
us_youtube$trending_date <- ydm(us_youtube$trending_date)
us_youtube$publish_time <- ymd_hms(us_youtube$publish_time)
# extract the trending_year and publish_year
us_youtube$trending_year <- year(us_youtube$trending_date)
us_youtube$publish_year <- year(us_youtube$publish_time)
unique(us_youtube$trending_year)
unique(us_youtube$publish_year)
Take a look at our newly created columns. What type are the new columns? In the provided code, which (if any) of the 4 functions are vectorized?
Now, duplicate the functionality of the provided code using only the following functions: as.numeric
, substr
, and regular vectorized operations like +
, -
, *
and /
. Which was easier?
Relevant topics: read.csv, typeof
-
Code used to solve this problem.
-
Output from running the code.
Question 2
While some great content certainly comes out of the United States, we have a lot of other great content from other countries. Plus, the size of the data is reasonable to combine into a single data.frame.
Look in the following directory: /depot/datamine/data/youtube
. You will find files that look like this:
CAvideos.csv DEvideos.csv USvideos.csv ...
You will notice how each dataset follows the same naming convention. Each file starts with the country code, US
, DE
, CA
, etc, followed immediately by "videos.csv".
Use a loop and the given vector to systematically combine the data into a new data.frame called yt
.
countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')
In a loop, loop through each of the values in countries
. Use the paste0
function to create a string that is the absolute path to each of the files. So, for example, the following would represent the steps to perform in the first loop.
-
In our first loop we have the value
CA
. -
We would use
paste0
to create a string containing the absolute path of the corresponding dataset:/depot/datamine/data/youtube/CAvideos.csv
. -
Then, we would then use that string as an argument to the
read.csv
function to read in the data into a data.frame. -
Then, we would add the new column
country_code
to the data.frame with the valueCA
repeated for each row. -
Finally, you would use the rbind function to combine the new data.frame with the previous data.frame.
In the end, you will end up with a single data.frame called yt
, that contains the data for every country in the dataset. yt
will also have a column called country_code
that contains the country code for each row, so we know where the data originated.
When combining data, it is important that we don’t lose any data in the process. If we slapped together all of the data from each of the datasets into a single file named |
In order to prevent this loss of data, create a new column called country_code
that includes this information in the dataset rather than in the filename.
Print a list of the columns in yt
, in addition, print the dimensions of yt
. Finally, create the trending_year
and publish_year
columns for yt
.
# Dr Ward summarizes how to perform Question 2 in the video.
# Here is the analogous code for this question.
# We know that all of this is new for you.
# That is why we are guiding you through this question!
getdataframe <- function(mycountry) {
myDF <- read.csv(paste0("/depot/datamine/data/youtube/", mycountry, "videos.csv"))
myDF$country_code <- mycountry
return(myDF)
}
countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')
myresults <- lapply(countries, getdataframe)
yt <- do.call(rbind, myresults)
Relevant topics: read.csv, paste0, rbind, dim, colnames
-
Code used to solve this problem.
-
Output from running the code.
Question 3
From this point on, unless specified, use the |
Which YouTube video took the longest time to trend from the time it was published? How many years did it take to trend?
Relevant topics: which.max, indexing
-
Code used to solve this problem.
-
Output from running the code.
-
Name of the YouTube video, and how long it took to trend.
-
(Optional) Did you watch the video prior to the project? If so, what do you think about it?
Question 4
We are interested in seeing whether or not there is a difference in views between videos with ratings enabled vs. those with ratings disabled.
Calculate the average number of views for videos with ratings enabled and those with ratings disabled. Anecdotally, does it look like disabling the ratings helps or hurts the views?
You can use |
You may need to take a careful look at the |
Relevant topics: mean, tapply indexing
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Create two new columns in yt
:
-
balance
: the difference betweenlikes
anddislikes
for a given video. -
positive_balance
: an indicator variable that isTRUE
ifbalance
is greater than zero, andFALSE
otherwise.
How many videos have a positive balance?
Relevant topics: sum
-
Code used to solve this problem.
-
Output from running the code.
Question 6
Compare videos with a positive positive_balance
to those with a non-positive positive_balance
. Make this comparison based on the comment_count
and the views
of the videos.
To make a comparison, pick a statistic to summarize and compare comment_count
and views
. Examples of statistics include: mean
, median
, max
, min
, var
, and sd
.
You can pick more than one statistic to compare, if you want, and each column may have its own statistic(s) to summarize it.
Relevant topics: tapply, mean, sum, var, sd, max, min, median
-
Code used to solve this problem.
-
Output from running the code.
-
1-2 sentences explaining what statistic you chose to summarize each column, and why.
-
1-2 sentences comparing videos with positive balance and non-positive balance based on
comment_count
andviews
. Is the result surprising to you?
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |