Ch. 11: Data import
Key questions:
- 11.2.2 # 1, 5
read_csv()
reads comma delimited files,read_csv2()
reads semicolon separated files (common in countries where,
is used as the decimal place),read_tsv()
reads tab delimited files, andread_delim()
reads in files with any delimiter.read_fwf()
reads fixed width files. You can specify fields either by their widths withfwf_widths()
or their position withfwf_positions()
.read_table()
reads a common variation of fixed width files where columns are separated by white space.data.table::fread
, good for raw speedread_log()
reads Apache style log files. (But also check out webreadr which is built on top ofread_log()
and provides many more helpful tools.)
read_log(readr_example("example.log"))
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character(),
## X4 = col_character(),
## X5 = col_character(),
## X6 = col_integer(),
## X7 = col_integer()
## )
## # A tibble: 2 x 7
## X1 X2 X3 X4 X5 X6 X7
## <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 172.21.~ <NA> "Microsoft~ 08/Apr/2001~ GET /scripts/iisadmi~ 200 3401
## 2 127.0.0~ <NA> frank 10/Oct/2000~ GET /apache_pb.gif H~ 200 2326
# readr_example finds the correct path associated with the package by doing the following:
## system.file("extdata", path, package = "readr", mustWork = TRUE)
parse_*()
: take character vector and return more specialized vectorparse_logical
,parse_integer
,parse_double
,parse_number
24,parse_character
,parse_factor
(haslevels
as an argument),parse_datetime
,parse_date
,parse_time
(these last three have an arg offormat`)
locale
argument for use in parse functions to affect formatting and to pass into argumentlocale = locale(<arg> = "<value>"))
- for double, e.g.
locale = locale(decimal_mark = ",")
- for number, e.g.
locale = locale(grouping_mark = ".")
- for character, e.g.
locale = locale(encoding = "Latin1")
- for dates, e.g.
locale = locale(lang = "fr")
(to see built-in language options usedate_name_langs
and can create own withdate_names
)
- for double, e.g.
problems
: returns problems on importcharToRaw
will show underlying representation of a character25guess_encoding
: can guess encoding – generally would use this withcharToRaw
and helps avoid figuring out encoding by hand
x1 <- "El Ni\xf1o was particularly bad this year"
guess_encoding(charToRaw(x1))
## # A tibble: 2 x 2
## encoding confidence
## <chr> <dbl>
## 1 ISO-8859-1 0.46
## 2 ISO-8859-9 0.23
- If defaults don’t work (primarily for dates, times, numbers) can use following to specify parsing:
Year
:
%Y
(4 digits). :%y
(2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
- Month
%m
(2 digits).%b
(abbreviated name, like “Jan”).%B
(full name, “January”).- Day
%d
(2 digits).%e
(optional leading space).- Time
%H
0-23 hour.%I
0-12, must be used with%p
.%p
AM/PM indicator.%M
minutes.%S
integer seconds.%OS
real seconds.%Z
Time zone (as name, e.g.America/Chicago
). Beware of abbreviations: if you’re American, note that “EST” is a Canadian time zone that does not have daylight savings time. It is not Eastern Standard Time! We’ll come back to this [time zones].%z
(as offset from UTC, e.g.+0800
).- Non-digits
%.
skips one non-digit character.%*
skips any number of non-digits.
guess_parser
: returns what readr would think the character vector you provide it should be parsed intoparse_guess
: uses readr’s guess of the vector type to parse the columncol_*
: counterpoint toparse_*
functions except for use when data is in a file rather than a string already loaded in R (as needed forparse_*
)cols
: use this to pass in thecol_*
types,col_types = cols( x = col_double(), y = col_date() )
- to read in all columns as character use
col_types = cols(.default = col_character())
- can set
n_max
to smallish number if reading in large file and still debugging parsing issues
- Recommend always input
cols
, if you want to be strict when loading in data setstop_for_problems
read_lines
read into character vector of lines (use when having major issues)read_file
read in as character vector of length 1 (use when having major issues)read_rds
reads in R’s custom binary format26feather::read_feather
: fast binary file format shared across languages27- writing files^[readr functions will encodes strings in UTF-8 and saves dates and date-times in ISO8601):
write_csv
,write_tsv
,write_excel_csv
,write_rds
28, *feather::read_feather
- other packages for reading-in / writing data:
haven
,readxl
,DBI
,odbc
,jsonlite
,xml2
,rio
11.2: Getting started
11.2.2.
1. What function would you use to read a file where fields were separated with “|”?
read_delim
for example:
read_delim("a|b|c\n1|2|3", delim = "|")
## # A tibble: 1 x 3
## a b c
## <int> <int> <int>
## 1 1 2 3
2. Apart from file
, skip
, and comment
, what other arguments do read_csv()
and read_tsv()
have in common?29
col_names
, col_types
, locale
, na
, quoted_na
, quote
, trim_ws
, skip
, n_max
, guess_max
, progress
3. What are the most important arguments to read_fwf()
?
widths
4. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or ’. By convention, read_csv()
assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim()
instead. What arguments do you need to specify to read the following text into a data frame?
"x,y\n1,'a,b'"
## [1] "x,y\n1,'a,b'"
read_delim("x,y\n1,'a,b'", delim = ",", quote = "'")
## # A tibble: 1 x 2
## x y
## <int> <chr>
## 1 1 a,b
5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
- needs 3rd column header, skips 3rd argument on each line, corrected:
read_csv("a,b\n1,2\n3,4\n5,6")
- needs 3rd column header, skips 3rd argument on each line, corrected:
read_csv("a,b,c\n1,2\n1,2,3,4")
- missing 3rd value on 2nd line so currently makes NA, corrected:
read_csv("a,b,c\n1,2, 1\n2,3,4")
- missing 3rd value on 2nd line so currently makes NA, corrected:
read_csv("a,b\n\"1")
- 2nd value missing and 2nd quote mark missing (though quotes are unnecessary), corrected:
read_csv("a,b\n\"1\",\"2\"")
- 2nd value missing and 2nd quote mark missing (though quotes are unnecessary), corrected:
read_csv("a,b\n1,2\na,b")
- Have character and numeric types,
read_csv("a;b\n1;3")
- need to make read_csv2() because is seperated by semicolons, corrected:
read_csv2("a;b\n1;3")
- need to make read_csv2() because is seperated by semicolons, corrected:
11.3: Parsing a vector
11.3.5.
1. What are the most important arguments to locale()
?
- It depends on the
parse_*
type, e.g.- for double, e.g.
locale = locale(decimal_mark = ",")
- for number, e.g.
locale = locale(grouping_mark = ".")
- for character, e.g.
locale = locale(encoding = "Latin1")
- for dates, e.g.
locale = locale(lang = "fr")
- for double, e.g.
- Below are a few examples for double and number
parse_double("1.23")
## [1] 1.23
parse_double("1,23", locale = locale(decimal_mark=","))
## [1] 1.23
parse_number("the cost is $125.34, it's a good deal") #Slightly different than book, captures decimal
## [1] 125.34
parse_number("$123,456,789")
## [1] 123456789
parse_number("$123.456.789")
## [1] 123.456
parse_number("$123.456.789", locale = locale(grouping_mark = "."))#used in europe
## [1] 123456789
parse_number("$123'456'789", locale = locale(grouping_mark = "'"))#used in Switzerland
## [1] 123456789
2. What happens if you try and set decimal_mark
and grouping_mark
to the same character? What happens to the default value of grouping_mark
when you set decimal_mark
to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?
- can’t set both to be same–if you change one, other automatically changes
parse_number("$135.435,45", locale = locale(grouping_mark = ".", decimal_mark = ","))
## [1] 135435.4
parse_number("$135.435,45", locale = locale(grouping_mark = "."))
## [1] 135435.4
3. I didn’t discuss the date_format
and time_format
options to locale()
. What do they do? Construct an example that shows when they might be useful.
date_format
andtime_format
inlocale()
let you set the default date and time formats
parse_date("31 january 2015", format = "%d %B %Y")
## [1] "2015-01-31"
parse_date("31 january 2015", locale = locale(date_format = "%d %B %Y"))
## [1] "2015-01-31"
#let's you change it in locale()
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
- I live in the US.
5. What’s the difference between read_csv()
and read_csv2()
?
* Second expects semicolons
6. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
- Europe tends to use “%d-%m-%Y”
- Asia tends to use “%d.%m.%Y”
7. Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
t3 <- "11:::15:10.12 PM"
Solutions:
parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
parse_date(d3, "%d%.%b-%Y") #could use this alternatively
## [1] "2017-06-06"
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
parse_time(t1, "%H%M")
## 17:05:00
parse_time(t2, "%I:%M:%OS %p")
## 23:15:10.12
parse_time(t3, "%I%*%M:%OS %p")
## 23:15:10.12
can be helpful for dealing with parsing currencies or percentages for example…↩
e.g. helpful for use w/
parse_char
↩readRDS
is base R version↩though does not support list-cols↩
base R version is
saveRDS
↩*
skip = n
,comment = #
any line that starts w/ input to comment will be skipped,col_names = FALSE
or perhapsc("x", "y", "z")
,na = "."
↩