Ch. 16: Dates and times
Key questions:
- 16.2.4. #3
- 16.3.4. #1, 4, 5
- 16.4.5. #4
today
get current datenow
get current date-timeymd_hms
one example of straight-forward set-of of functions that take either strings or unquoted numbers and output dates or date-timesmake_datetime
create date-time from individual components, e.g. make_datetime(year, month, day, hour, minute)as_date_time
andas_date
let you switch between date-time and dates, e.g.as_datetime(today())
oras_date(now())
- Accessor functions let you pull out components from an existing date-time:
year
,month
,mday
,yday
,wday
,hour
,minute
,second
month
andwday
havelabel = TRUE
to pull the abbreviated name rather than the number, and pull full name withabbr = FALSE
- You can also use these to set particular components
year(datetime) <- 2020
update
allows you to specify multiple values at one time, e.g.update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
- When values are too big they roll-over e.g.
update(ymd("2015-02-01"), mday = 30)
will become ‘2015-03-02’
- When values are too big they roll-over e.g.
- Rounding functions to nearest unit of time
floor_date
,round_date
,ceiling_date
as.duration
convert diff-time to a duration- Durations (can add and multiply):
dseconds
,dhours
,ddays
,dweeks
,dyears
- Periods (can add and multiply), more likely to do what you expect than duration:
seconds
,minutes
,hours
,days
,weeks
,months
- Interval is a duration with a starting point, making it precise and possible to determine EXACT length
- e.g.
(today() %--% next_year) / ddays(1)
to find exact duration
- e.g.
Sys.timezone
to see what R thinks your current time zone istz =
arg inymd_hms
let’s you change printing behavior (not underlying value, as assumes UTC unless changed)with_tz
allows you to print an existing date-time object to a specific other timezoneforce_tz
when have an object that’s been labeled with wrong time-zone and need to fix it
16.2: Creating date/times
Note that 1 in date-times is treated as 1 - second in numeric contexts, so example below sets binwidth = 86400
to specify 1 day
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate_at(c("dep_time", "arr_time", "sched_dep_time", "sched_arr_time"), ~make_datetime_100(year, month, day, .)) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400)
16.2.4
What happens if you parse a string that contains invalid dates?
ymd(c("2010-10-10", "bananas"))
## Warning: 1 failed to parse.
## [1] "2010-10-10" NA
- Outputs an NA and sends warning of number that failed to parse
What does the
tzone
argument totoday()
do? Why is it important?- Let’s you specify timezones, may be different days depending on location
today(tzone = "MST")
## [1] "2019-06-05"
now(tzone = "MST")
## [1] "2019-06-05 16:27:06 MST"
Use the appropriate lubridate function to parse each of the following dates:
d1 <- "January 1, 2010" d2 <- "2015-Mar-07" d3 <- "06-Jun-2017" d4 <- c("August 19 (2015)", "July 1 (2015)") d5 <- "12/30/14" # Dec 30, 2014
mdy(d1)
## [1] "2010-01-01"
ymd(d2)
## [1] "2015-03-07"
dmy(d3)
## [1] "2017-06-06"
mdy(d4)
## [1] "2015-08-19" "2015-07-01"
mdy(d5)
## [1] "2014-12-30"
16.3: Date-time components
This allows you to plot the number of flights per week
flights_dt %>%
count(week = floor_date(dep_time, "week")) %>%
ggplot(aes(week, n)) +
geom_line()
16.3.4
How does the distribution of flight times within a day change over the course of the year?
Median flight time by day
flights_dt %>% transmute(quarter_dep = quarter(dep_time) %>% factor(), day_dep = as_date(dep_time), dep_time = as.hms(dep_time)) %>% group_by(quarter_dep, day_dep) %>% summarise(day_median = median(dep_time)) %>% ungroup() %>% ggplot(aes(x = day_dep, y = day_median)) + geom_line(aes(colour = quarter_dep, group = 1)) + labs(title = "Median flight times by day, coloured by quarter", subtitle = "Typical flight times change with daylight savings times")+ geom_vline(xintercept = ymd("20130310"), linetype = 2)+ geom_vline(xintercept = ymd("20131103"), linetype = 2)
- First couple and last couple months tend to have slightly earlier start times
Quantiles of flight times by month
flights_dt %>% transmute(month_dep = month(dep_time, label = TRUE), quarter_dep = quarter(dep_time) %>% factor(), wk_dep = week(dep_time), dep_time = as.hms(dep_time)) %>% group_by(month_dep, wk_dep) %>% ungroup() %>% ggplot(aes(x = month_dep, y = dep_time, group = month_dep)) + geom_boxplot()
Reinforces prior plot, shows that first couple and last couple months of year tend to have slightly higher proportion of flights earlier in day
Last week of the year have a lower proportion of late flights, and a higher proportion of morning flights
See 16.3.4.1 for a few other plots I looked at.
Compare
dep_time
,sched_dep_time
anddep_delay
. Are they consistent? Explain your findings.flights_dt %>% mutate(dep_delay_check = (dep_time - sched_dep_time) / dminutes(1), same = dep_delay == dep_delay_check, difference = dep_delay_check - dep_delay) %>% filter(abs(difference) > 0)
## # A tibble: 1,205 x 12 ## origin dest dep_delay arr_delay dep_time sched_dep_time ## <chr> <chr> <dbl> <dbl> <dttm> <dttm> ## 1 JFK BWI 853 851 2013-01-01 08:48:00 2013-01-01 18:35:00 ## 2 JFK SJU 43 36 2013-01-02 00:42:00 2013-01-02 23:59:00 ## 3 JFK SYR 156 154 2013-01-02 01:26:00 2013-01-02 22:50:00 ## 4 JFK SJU 33 22 2013-01-03 00:32:00 2013-01-03 23:59:00 ## 5 JFK BUF 185 172 2013-01-03 00:50:00 2013-01-03 21:45:00 ## 6 JFK BQN 156 143 2013-01-03 02:35:00 2013-01-03 23:59:00 ## 7 JFK SJU 26 23 2013-01-04 00:25:00 2013-01-04 23:59:00 ## 8 JFK PWM 141 125 2013-01-04 01:06:00 2013-01-04 22:45:00 ## 9 JFK PSE 15 18 2013-01-05 00:14:00 2013-01-05 23:59:00 ## 10 JFK FLL 127 130 2013-01-05 00:37:00 2013-01-05 22:30:00 ## # ... with 1,195 more rows, and 6 more variables: arr_time <dttm>, ## # sched_arr_time <dttm>, air_time <dbl>, dep_delay_check <dbl>, ## # same <lgl>, difference <dbl>
- They are except in the case when it goes over a day, the day is not pushed forward so it counts it as being 24 hours off
Compare
air_time
with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)flights_dt %>% mutate(air_time_check = (arr_time - dep_time) / dminutes(1)) %>% select(air_time_check, air_time, dep_time, arr_time, everything())
## # A tibble: 328,063 x 10 ## air_time_check air_time dep_time arr_time origin ## <dbl> <dbl> <dttm> <dttm> <chr> ## 1 193 227 2013-01-01 05:17:00 2013-01-01 08:30:00 EWR ## 2 197 227 2013-01-01 05:33:00 2013-01-01 08:50:00 LGA ## 3 221 160 2013-01-01 05:42:00 2013-01-01 09:23:00 JFK ## 4 260 183 2013-01-01 05:44:00 2013-01-01 10:04:00 JFK ## 5 138 116 2013-01-01 05:54:00 2013-01-01 08:12:00 LGA ## 6 106 150 2013-01-01 05:54:00 2013-01-01 07:40:00 EWR ## 7 198 158 2013-01-01 05:55:00 2013-01-01 09:13:00 EWR ## 8 72 53 2013-01-01 05:57:00 2013-01-01 07:09:00 LGA ## 9 161 140 2013-01-01 05:57:00 2013-01-01 08:38:00 JFK ## 10 115 138 2013-01-01 05:58:00 2013-01-01 07:53:00 LGA ## # ... with 328,053 more rows, and 5 more variables: dest <chr>, ## # dep_delay <dbl>, arr_delay <dbl>, sched_dep_time <dttm>, ## # sched_arr_time <dttm>
- Initial check is off, so need to take into account the time-zone and difference from NYC, so join timezone document
flights_dt %>% left_join(select(nycflights13::airports, dest = faa, tz), by = "dest") %>% mutate(arr_time_new = arr_time - dhours(tz + 5)) %>% mutate(air_time_tz = (arr_time_new - dep_time) / dminutes(1), diff_Airtime = air_time_tz - air_time) %>% select( origin, dest, tz, contains("time"), -(contains("sched")))
## # A tibble: 328,063 x 9 ## origin dest tz dep_time arr_time air_time ## <chr> <chr> <dbl> <dttm> <dttm> <dbl> ## 1 EWR IAH -6 2013-01-01 05:17:00 2013-01-01 08:30:00 227 ## 2 LGA IAH -6 2013-01-01 05:33:00 2013-01-01 08:50:00 227 ## 3 JFK MIA -5 2013-01-01 05:42:00 2013-01-01 09:23:00 160 ## 4 JFK BQN NA 2013-01-01 05:44:00 2013-01-01 10:04:00 183 ## 5 LGA ATL -5 2013-01-01 05:54:00 2013-01-01 08:12:00 116 ## 6 EWR ORD -6 2013-01-01 05:54:00 2013-01-01 07:40:00 150 ## 7 EWR FLL -5 2013-01-01 05:55:00 2013-01-01 09:13:00 158 ## 8 LGA IAD -5 2013-01-01 05:57:00 2013-01-01 07:09:00 53 ## 9 JFK MCO -5 2013-01-01 05:57:00 2013-01-01 08:38:00 140 ## 10 LGA ORD -6 2013-01-01 05:58:00 2013-01-01 07:53:00 138 ## # ... with 328,053 more rows, and 3 more variables: arr_time_new <dttm>, ## # air_time_tz <dbl>, diff_Airtime <dbl>
How does the average delay time change over the course of a day? Should you use
dep_time
orsched_dep_time
? Why?flights_dt %>% mutate(sched_dep_time = as.hms(floor_date(sched_dep_time, "30 mins"))) %>% group_by(sched_dep_time) %>% summarise(delay_mean = mean(arr_delay, na.rm = TRUE), n = n(), n_na = sum(is.na(arr_delay)) / n, delay_median = median(arr_delay, na.rm = TRUE)) %>% ggplot(aes(x = sched_dep_time, y = delay_mean, size = n)) + geom_point()
- It goes-up throughout the day
- Use
sched_dep_time
because it has the correct day
On what day of the week should you leave if you want to minimise the chance of a delay?
flights_dt %>% mutate(weekday = wday(sched_dep_time, label = TRUE)) %>% group_by(weekday) %>% summarise(prop_delay = sum(dep_delay > 0) / n())
## # A tibble: 7 x 2 ## weekday prop_delay ## <ord> <dbl> ## 1 Sun 0.383 ## 2 Mon 0.401 ## 3 Tue 0.364 ## 4 Wed 0.372 ## 5 Thu 0.431 ## 6 Fri 0.425 ## 7 Sat 0.348
- wknd has a slightly lower proportion of flights delayed (Thursday has the worst)
What makes the distribution of
diamonds$carat
andflights$sched_dep_time
similar?ggplot(diamonds, aes(x = carat)) + geom_histogram(bins = 500)+ labs(title = "Distribution of carat in diamonds dataset") ggplot(flights, aes(x = as.hms(sched_dep_time))) + geom_histogram(bins = 24*6)+ labs(title = "Distribution of scheduled departure times in flights dataset")
- Both have gaps and peaks at ‘attractive’ values
Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.
mutate(flights_dt, mins_dep = minute(dep_time), mins_sched = minute(sched_dep_time), delayed = dep_delay > 0) %>% group_by(mins_dep) %>% summarise(prop_delayed = sum(delayed) / n()) %>% ggplot(aes(x = mins_dep, y = prop_delayed)) + geom_line()
- Consistent with above hypothesis
16.4: Time spans
- durations, which represent an exact number of seconds.
- periods, which represent human units like weeks and months.
- intervals, which represent a starting and ending point.
Periods example, using durations to fix oddity of problem when flight arrives overnight
flights_dt <- flights_dt %>%
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(overnight * 1),
sched_arr_time = sched_arr_time + days(overnight * 1)
)
Intervals example to get precise number of days dependent on specific time
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 366
To find out how many periods fall in an interval, need to use integer division
(today() %--% next_year) %/% days(1)
## Note: method with signature 'Timespan#Timespan' chosen for function '%/%',
## target signature 'Interval#Period'.
## "Interval#ANY", "ANY#Period" would also be valid
## [1] 366
16.4.5
Why is there
months()
but nodmonths()
?- the duration varies from month to month
Explain
days(overnight * 1)
to someone who has just started learning R. How does it work?- this used in the example above makes it such that if
overnight
is TRUE, it will return the same time period but one day ahead, if false, does not change (as is adding 0 days)
- this used in the example above makes it such that if
- Create a vector of dates giving the first day of every month in 2015.
x <- ymd("2015-01-01") mons <- c(0:11) (x + months(mons)) %>% wday(label = TRUE)
## [1] Thu Sun Sun Wed Fri Mon Wed Sat Tue Thu Sun Tue ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
- Create a vector of dates giving the first day of every month in the current year.
x <- today() %>% update(month = 1, mday = 1) mons <- c(0:11) (x + months(mons)) %>% wday(label=TRUE)
## [1] Tue Fri Fri Mon Wed Sat Mon Thu Sun Tue Fri Sun ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
Write a function that given your birthday (as a date), returns how old you are in years.
birthday_age <- function(birthday) { (ymd(birthday) %--% today()) %/% years(1) } birthday_age("1989-09-07")
## [1] 29
Why can’t
(today() %--% (today() + years(1)) / months(1)
work?- Can’t add and subtract intervals
Appendix
16.3.4.1
Weekly flight proportions by 4 hour blocks
flights_dt %>%
transmute(month_dep = month(dep_time, label = TRUE),
wk_dep = week(dep_time),
dep_time_4hrs = floor_date(dep_time, "4 hours"),
hour_dep_4hrs = hour(dep_time_4hrs) %>% factor) %>%
count(wk_dep, hour_dep_4hrs) %>%
group_by(wk_dep) %>%
mutate(wk_tot = sum(n),
wk_prop = round(n / wk_tot, 3)) %>%
ungroup() %>%
ggplot(aes(x = wk_dep, y = wk_prop)) +
geom_col(aes(fill = hour_dep_4hrs))
Weekly median fight time
flights_dt %>%
transmute(quarter_dep = quarter(dep_time) %>% factor(),
day_dep = as_date(dep_time),
wk_dep = floor_date(dep_time, "1 week") %>% as_date,
dep_time = as.hms(dep_time)) %>%
group_by(quarter_dep, wk_dep) %>%
summarise(wk_median = median(dep_time)) %>%
ungroup() %>%
mutate(wk_median = as.hms(wk_median)) %>%
ggplot(aes(x = wk_dep, y = wk_median)) +
geom_line(aes(colour = quarter_dep, group = 1))
Proportion of flights in each hour, by quarter
flights_dt %>%
transmute(quarter_dep = quarter(dep_time) %>% factor(),
hour_dep = hour(dep_time)) %>%
count(quarter_dep, hour_dep) %>%
group_by(quarter_dep) %>%
mutate(quarter_tot = sum(n),
quarter_prop = round(n / quarter_tot, 3)) %>%
ungroup() %>%
ggplot(aes(x = hour_dep, y = quarter_prop)) +
geom_line(aes(colour = quarter_dep))
- Q1 seems to be a little more extreme at the local maximas
Look at proportion of flights by hour faceted by each month
flights_dt %>%
transmute(month_dep = month(dep_time, label = TRUE),
hour_dep = hour(dep_time)) %>%
count(month_dep, hour_dep) %>%
group_by(month_dep) %>%
mutate(month_tot = sum(n),
month_prop = round(n / month_tot, 3)) %>%
ungroup() %>%
ggplot(aes(x = hour_dep, y = month_prop)) +
geom_line() +
facet_wrap( ~ month_dep)
16.3.4.3
- Perhaps these are flights where landed in different location…
flights_dt %>%
mutate(arr_delay_test = (arr_time - sched_arr_time) / dminutes(1)) %>%
select( origin, dest, dep_delay, arr_delay, arr_delay_test, contains("time")) %>%
filter(is.na(arr_delay))
## # A tibble: 717 x 10
## origin dest dep_delay arr_delay arr_delay_test dep_time
## <chr> <chr> <dbl> <dbl> <dbl> <dttm>
## 1 LGA XNA -5 NA 89 2013-01-01 15:25:00
## 2 EWR STL 29 NA 195 2013-01-01 15:28:00
## 3 LGA XNA -5 NA 98 2013-01-01 17:40:00
## 4 EWR SAN 29 NA 108 2013-01-01 18:07:00
## 5 JFK DFW 59 NA -1282 2013-01-01 19:39:00
## 6 EWR TUL 22 NA 111 2013-01-01 19:52:00
## 7 EWR XNA 43 NA 148 2013-01-02 09:05:00
## 8 LGA GRR 120 NA 179 2013-01-02 11:25:00
## 9 JFK DFW 8 NA 102 2013-01-02 18:48:00
## 10 EWR MCI 85 NA 177 2013-01-02 18:49:00
## # ... with 707 more rows, and 4 more variables: sched_dep_time <dttm>,
## # arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
16.3.4.4
Below started looking at proportions…
mutate(flights_dt,
dep_old = dep_time,
sched_old = sched_dep_time,
dep_time = floor_date(dep_time, "5 minutes"),
sched_dep_time = floor_date(sched_dep_time, "5 minutes"),
mins_dep = minute(dep_time),
mins_sched = minute(sched_dep_time),
delayed = dep_delay > 0) %>%
group_by(mins_dep, mins_sched) %>%
summarise(num_delayed = sum(delayed),
num = n(),
prop_delayed = num_delayed / num) %>%
group_by(mins_dep) %>%
mutate(num_tot = sum(num),
prop_sched = num / num_tot,
sched_dep_diff = mins_dep - mins_sched) %>%
ungroup() %>%
ggplot(aes(x = mins_dep, y = prop_sched, fill = factor(mins_sched))) +
geom_col()+
labs(title = "Proportion of early flights by minute scheduled v. minute departed")
mutate(flights_dt,
dep_old = dep_time,
sched_old = sched_dep_time,
# dep_time = floor_date(dep_time, "5 minutes"),
# sched_dep_time = floor_date(sched_dep_time, "5 minutes"),
mins_dep = minute(dep_time),
mins_sched = minute(sched_dep_time),
early_less10 = dep_delay >= -10) %>%
filter(dep_delay < 0) %>%
group_by(mins_dep) %>%
summarise(num = n(),
sum_recent10 = sum(early_less10),
prop_recent10 = sum_recent10 / num) %>%
ungroup() %>%
ggplot(aes(x = mins_dep, y = prop_recent10)) +
geom_line()+
labs(title = "proportion of early flights that were scheduled to leave within 10 mins of when they did")