Ch. 14: Strings
Key questions:
- 14.2.5. #3, 6
- 14.3.2.1. #2
- 14.3.3.1 #1, #2
writeLines
: see raw contents of a string (prints each string in a vector on a new line)str_length
: number of characters in a stringstr_c
: combine two or more strings- use
collapse
arg to make vector of strings to single string
- use
str_replace_na
: printNA
as “NA”str_sub
:start
andend
args to specify position to remove (or replace), can use negative numbers as well to represent from backstr_to_lower
,str_to_upper
,str_to_upper
: for changing string caselocale
arg (to handle slight differences in characters)
str_order
,str_sort
: more robust version oforder
andsort
which take allow alocale
argumentstr_view
,str_view_all
: shows how character and regular expression match\d
: matches any digit.\s
: matches any whitespace (e.g. space, tab, newline).[abc]
: matches a, b, or c.[^abc]
: matches anything except a, b, or c.{n}
: exactly n{n,}
: n or more{,m}
: at most m{n,m}
: between n and mstr_detect
: returns logical vector ofTRUE
/FALSE
valuesstr_subset
: subset ofTRUE
values fromstr_detect
str_count
: number of matches in a stringstr_extract
: extract actual text of a matchstr_extract_all
: returns list with all matchessimplify = TRUE
returns a matrix
str_match
: similar tostr_extract
but gives each individual component of match in a matrix, rather than a character vector (also have astr_match_all
)tidyr::extract
: likestr_match
but name columns with matches which are moved into new columnsstr_replace
,str_replace_all
: replace matches with new stringsstr_split
split a string into pieces – default is individual words (returns list)simplify = TRUE
again will return a matrix
boundary
use to specify level of split, e.g.str_view_all(x, boundary("word"))
str_locate
,str_locate_all
: gives starting an dending positions of each matchregex
use in match to specify more options, e.g.str_view(bananas, regex("banana", ignore_case = TRUE))
multiline = TRUE
allows^
and$
to match start and end of each line (rather than of string)comments = TRUE
allows you to add comments on a complex regular expressiondotall = TRUE
allows.
to match more than just letters e.g.\\n
fixed
,coll
related alternatives toregex
apropos
searches all objects available from global environment (e.g. say you can’t remember function name)dir
: lists all files in a directorypattern
arg takes a regex
stringi
more comprehensive package thanstringr
(~5x as many funs)
14.2: String basics
Use wrteLines
to show what string ‘This string has a \n new line’ looks like printed.
string_exp <- 'This string has a \n new line'
print(string_exp)
## [1] "This string has a \n new line"
writeLines(string_exp)
## This string has a
## new line
To see full list of specifal characters:
?'"'
Objects of length 0 are silently dropped. This is particularly useful in conjunction with if
:
name <- "Bryan"
time_of_day <- "morning"
birthday <- FALSE
str_c(
"Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
## [1] "Good morning Bryan."
Collapse vectors into single string
str_c(c("x", "y", "z"), c("a", "b", "c"), collapse = ", ")
## [1] "xa, yb, zc"
Can use assignment form of str_sub()
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
## [1] "apple" "banana" "pear"
str_pad
looks interesting
str_pad("the dogs come for you.", width = 40, pad = ",", side = "both") #must specify width =, side = default is left
## [1] ",,,,,,,,,the dogs come for you.,,,,,,,,,"
14.2.5
In code that doesn’t use stringr, you’ll often see
paste()
andpaste0()
. What’s the difference between the two functions?paste0()
has nosep
argument and just appends any value provided like another string vector.- They differ from
str_c()
in that they automatically convertNA
values to character.
paste("a", "b", "c", c("x", "y"), sep = "-")
## [1] "a-b-c-x" "a-b-c-y"
paste0("a", "b", "c", c("x", "y"), sep = "-")
## [1] "abcx-" "abcy-"
What
stringr
function are they equivalent to?paste()
andpaste0()
are similar tostr_c()
though are different in how they handle NAs (see below). They also will return a warning when recycling vectors whose legth do not have a common factor.paste(c("a", "b", "x"), c("x", "y"), sep = "-")
## [1] "a-x" "b-y" "x-x"
str_c(c("a", "b", "x"), c("x", "y"), sep = "-")
## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE): ## longer object length is not a multiple of shorter object length
## [1] "a-x" "b-y" "x-x"
How do the functions differ in their handling of
NA
?paste(c("a", "b"), c(NA, "y"), sep = "-")
## [1] "a-NA" "b-y"
str_c(c("a", "b"), c(NA, "y"), sep = "-")
## [1] NA "b-y"
In your own words, describe the difference between the
sep
andcollapse
arguments tostr_c()
.sep
puts characters between items within a vector, collapse puts a character between vectors being collapsedUse
str_length()
andstr_sub()
to extract the middle character from a string.x <- "world" str_sub(x, start = ceiling(str_length(x) / 2), end = ceiling(str_length(x) / 2))
## [1] "r"
What will you do if the string has an even number of characters?
In this circumstance the above solution would take the anterior middle value, below is a solution that would return both middle values.
x <- "worlds" str_sub(x, ceiling(str_length(x) / 2 + 1), start = ceiling(str_length(x) / 2 + 1))
## [1] "l"
str_sub(x, start = ifelse(str_length(x) %% 2 == 0, floor(str_length(x) / 2), ceiling(str_length(x) / 2 )), end = floor(str_length(x) / 2) + 1)
## [1] "rl"
What does
str_wrap()
do? When might you want to use it?- Use
indent
for first line,exdent
for others
- could use
str_wrap()
for editing of documents etc., settingwidth = 1
will give each word its own line
str_wrap("Tonight, we dine in Hell.", width = 10, indent = 0, exdent = 3) %>% writeLines()
## Tonight, ## we dine in ## Hell.
- Use
What does
str_trim()
do? What’s the opposite ofstr_trim()
? Removes whitespace from beginning and end of character,side
argument specifies which sidestr_trim(" so much white space ", side = "right") # (default is 'both')
## [1] " so much white space"
Write a function that turns (e.g.) a vector
c("a", "b", "c")
into the stringa, b, and c
. Think carefully about what it should do if given a vector of length 0, 1, or 2.vec_to_string <- function(x) { #If 1 or 0 length vector if (length(x) < 2) return(x) comma <- ifelse(length(x) > 2, ", ", " ") b <- str_c(x, collapse = comma) #replace ',' with 'and' in last str_sub(b,-(str_length(x)[length(x)] + 1), -(str_length(x)[length(x)] + 1)) <- " and " return(b) } x <- c("a", "b", "c", "d") vec_to_string(x)
## [1] "a, b, c, and d"
14.3: Matching patterns w/ regex
x <- c("apple", "banana", "pear")
str_view(x, "an")
To match a literal \
need \\\\
because both string and regex will escape it.
x <- "a\\b"
writeLines(x)
## a\b
str_view(x,"\\\\")
Using \b
to set boundary between words (not used often)
apropos("\\bsum\\b")
## [1] "contr.sum" "sum"
apropos("^(sum)$")
## [1] "sum"
Other special characters:
\d
: matches any digit.\s
: matches any whitespace (e.g. space, tab, newline).[abc]
: matches a, b, or c.[^abc]
: matches anything except a, b, or c.
Controlling number of times:
?
: 0 or 1+
: 1 or more*
: 0 or more{n}
: exactly n{n,}
: n or more{,m}
: at most m{n,m}
: between n and m
By default these matches are “greedy”: they will match the longest string possible. You can make them “lazy”, matching the shortest string possible by putting a ?
after them. This is an advanced feature of regular expressions, but it’s useful to know that it exists:
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, 'C{2,3}')
str_view(x, 'C{2,3}?')
14.3.1.1
Explain why each of these strings don’t match a
\
:"\"
,"\\"
,"\\\"
."\"
-> leaves open quote string because escapes quote"\\"
, -> escapes second\
so left with blank"\\\"
-> third\
escapes quote so left with open quote as wellHow would you match the sequence
"'\
?x <- "alfred\"'\\goes" writeLines(x)
## alfred"'\goes
str_view(x, "\\\"'\\\\")
What patterns will the regular expression
\..\..\..
match?Would match 6 character string of following form “(dot)(anychar)(dot)(anychar)(dot)(anychar)”
x <- c("alf.r.e.dd.ss..lsdf.d.kj") str_view(x, pattern = "\\..\\..\\..")
How would you represent it as a string?
x_pattern <- "\\..\\..\\.." writeLines(x_pattern)
## \..\..\..
14.3.2.1
How would you match the literal string
"$^$"
?x <- "so it goes $^$ here" str_view(x, "\\$\\^\\$")
Given the corpus of common words in
stringr::words
, create regular expressions that find all words that:- Start with “y”.
str_view(stringr::words, "^y", match = TRUE)
- End with “x”
str_view(stringr::words, "x$", match = TRUE)
- Are exactly three letters long. (Don’t cheat by using
str_length()
!)
str_view(stringr::words, "^...$", match = TRUE)
- Have seven letters or more.
str_view(stringr::words, ".......", match = TRUE)
Since this list is long, you might want to use the
match
argument tostr_view()
to show only the matching or non-matching words.
14.3.3.1
Create regular expressions to find all words that:
- Start with a vowel.
str_view(stringr::words, "^[aeiou]", match = TRUE)
- That only contain consonants. (Hint: thinking about matching “not”-vowels.)
str_view(stringr::words, "^[^aeiou]*[^aeiouy]$", match = TRUE)
- End with
ed
, but not witheed
.
str_view(stringr::words, "[^e]ed$", match = TRUE)
- End with
ing
orise
.
str_view(stringr::words, "(ing|ise)$", match = TRUE)
Empirically verify the rule “i before e except after c”.
str_view(stringr::words, "(^(ei))|cie|[^c]ei", match = TRUE)
Is “q” always followed by a “u”?
str_view(stringr::words, "q[^u]", match = TRUE)
of the words in list, yes.
Write a regular expression that matches a word if it’s probably written in British English, not American English.
str_view(stringr::words, "(l|b)our|parat", match = TRUE)
Create a regular expression that will match telephone numbers as commonly written in your country.
x <- c("dkl kls. klk. _", "(425) 591-6020", "her number is (581) 434-3242", "442", " dsi") str_view(x, "\\(\\d\\d\\d\\)\\s\\d\\d\\d-\\d\\d\\d\\d")
Aboves not a good way to solve this, will see better methods in next section.
14.3.4.1
Describe the equivalents of
?
,+
,*
in{m,n}
form.?
:{0,1}
+
:{1, }
*
:{0, }
Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
^.*$
: starts with anything, and ends with anything–matches whole thing
str_view(x, "^.*$")
"\\{.+\\}"
: match text in brackets greater than nothing
x <- c("test", "some in {brackets}", "just {} no match") str_view(x, "\\{.+\\}")
\d{4}-\d{2}-\d{2}
: 4 numbers - 2 numbers - 2 numbers
x <- c("4444-22-22", "test", "333-4444-22") str_view(x, "\\d{4}-\\d{2}-\\d{2}")
"\\\\{4}"
: 4 brackets
x <- c("\\\\\\\\", "\\\\\\", "\\\\", "\\") writeLines(x)
## \\\\ ## \\\ ## \\ ## \
str_view(x, "\\\\{4}")
x <- c("\\\\\\\\", "\\\\\\", "\\\\", "\\") str_view(x, "\\\\\\\\")
Create regular expressions to find all words that:
- find all words that start with three consonants
str_view(stringr::words, "^[^aeoiouy]{3}", match = TRUE)
- Include
y
because when it shows up otherwise, is in vowel form.
- have three or more vowels in a row
str_view(stringr::words, "[aeiou]{3}", match = TRUE)
In this case, do not include the
y
.- have 2 or more vowel-consonant pairs in a row
str_view(stringr::words, "([aeiou][^aeiou]){2,}", match = TRUE)
Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.
14.3.5.1
Describe, in words, what these expressions will match:
- I change questions 1 and 3 to what I think they were meant to be written as
(.)\\1\\1
and(.)\\1
respectively.
(.)\\1\\1
: repeat the char in the first group, and then repeat that char again"(.)(.)\\2\\1"
: 1st char, 2nd char followed by 2nd char, first char(..)\\1
: 2 chars repeated twice"(.).\\1.\\1"
: chars shows-up 3 times with one character between each"(.)(.)(.).*\\3\\2\\1"
: 3 chars in one order with * chars between, then 3 chars with 3 letters in the reverse order of what it started
x <- c("steefddff", "ssdfsdfffsdasdlkd", "DLKKJIOWdkl", "klnlsd", "t11", "(.)\1\1") str_view_all(x, "(.)\\1\\1", match = TRUE) #xxx
str_view_all(fruit, "(.)(.)\\2\\1", match = TRUE) #xyyx
str_view_all(fruit, "(..)\\1", match = TRUE) #xxyy
str_view(stringr::words, "(.).\\1.\\1", match = TRUE) #x.x.x
str_view(stringr::words, "(.)(.)(.).*\\3\\2\\1", match = TRUE) #xyz.*zyx
- I change questions 1 and 3 to what I think they were meant to be written as
Construct regular expressions to match words that:
- Start and end with the same character.
str_view(stringr::words, "^(.).*\\1$", match = TRUE)
- Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(stringr::words, "(..).*\\1", match = TRUE)
- Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(stringr::words, "(.).+\\1.+\\1", match = TRUE)
14.4 Tools
noun <- "(a|the) ([^ \\.]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract_all(noun, simplify = TRUE)
#creates split into seperate pieces
has_noun %>%
str_match_all(noun)
#Can make dataframe with, but need to name all
tibble(has_noun = has_noun) %>%
extract(has_noun, into = c("article", "noun"), regex = noun)
- When using
boundary()
withstr_split
can set to “character”, “line”, “sentence”, and “word” and gives alternative to splitting by pattern.
14.4.2
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple
str_detect()
calls.- Find all words that start or end with
x
.
str_subset(words, "^x|x$")
## [1] "box" "sex" "six" "tax"
- Find all words that start with a vowel and end with a consonant.
str_subset(words, "^[aeiou].*[^aeiouy]$")
## [1] "about" "accept" "account" "across" "act" ## [6] "actual" "add" "address" "admit" "affect" ## [11] "afford" "after" "afternoon" "again" "against" ## [16] "agent" "air" "all" "allow" "almost" ## [21] "along" "alright" "although" "always" "amount" ## [26] "and" "another" "answer" "apart" "apparent" ## [31] "appear" "appoint" "approach" "arm" "around" ## [36] "art" "as" "ask" "at" "attend" ## [41] "awful" "each" "east" "eat" "effect" ## [46] "egg" "eight" "either" "elect" "electric" ## [51] "eleven" "end" "english" "enough" "enter" ## [56] "environment" "equal" "especial" "even" "evening" ## [61] "ever" "exact" "except" "exist" "expect" ## [66] "explain" "express" "if" "important" "in" ## [71] "indeed" "individual" "inform" "instead" "interest" ## [76] "invest" "it" "item" "obvious" "occasion" ## [81] "odd" "of" "off" "offer" "often" ## [86] "old" "on" "open" "or" "order" ## [91] "original" "other" "ought" "out" "over" ## [96] "own" "under" "understand" "union" "unit" ## [101] "unless" "until" "up" "upon" "usual"
Counted
y
as a vowel if ending with, but not to start. This does not work perfect. For example words likeygritte
would still be included even thoughy
is activng as a vowel there whereas words likeboy
would be excluded even though acting as a consonant there. From here on out I am going to always excludey
.- Are there any words that contain at least one of each different vowel?
vowels <- c("a","e","i","o","u") words[str_detect(words, "a") & str_detect(words, "e") & str_detect(words, "i") & str_detect(words, "o") & str_detect(words, "u")]
## character(0)
No.
- Find all words that start or end with
What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)
vowel_counts <- tibble(words = words, n_string = str_length(words), n_vowel = str_count(words, vowels), prop_vowel = n_vowel / n_string)
‘Experience’ has the most vowels
vowel_counts %>% arrange(desc(n_vowel))
## # A tibble: 980 x 4 ## words n_string n_vowel prop_vowel ## <chr> <int> <int> <dbl> ## 1 experience 10 4 0.4 ## 2 individual 10 3 0.3 ## 3 achieve 7 2 0.286 ## 4 actual 6 2 0.333 ## 5 afternoon 9 2 0.222 ## 6 against 7 2 0.286 ## 7 already 7 2 0.286 ## 8 america 7 2 0.286 ## 9 benefit 7 2 0.286 ## 10 choose 6 2 0.333 ## # ... with 970 more rows
‘a’ has the highest proportion
vowel_counts %>% arrange(desc(prop_vowel))
## # A tibble: 980 x 4 ## words n_string n_vowel prop_vowel ## <chr> <int> <int> <dbl> ## 1 a 1 1 1 ## 2 too 3 2 0.667 ## 3 wee 3 2 0.667 ## 4 feed 4 2 0.5 ## 5 in 2 1 0.5 ## 6 look 4 2 0.5 ## 7 need 4 2 0.5 ## 8 room 4 2 0.5 ## 9 so 2 1 0.5 ## 10 soon 4 2 0.5 ## # ... with 970 more rows
14.4.3.1
In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.
Add space in front of colors:
colours <- c("red", "orange", "yellow", "green", "blue", "purple") %>% paste0(" ", .) colour_match <- str_c(colours, collapse = "|") more <- sentences[str_count(sentences, colour_match) > 1] str_view_all(more, colour_match)
From the Harvard sentences data, extract:
- The first word from each sentence.
str_extract(sentences, "[A-z]*")
- All words ending in
ing
.
#ends in "ing" or "ing." sent_ing <- str_subset(sentences, ".*ing(\\.|\\s)") str_extract_all(sent_ing, "[A-z]+ing", simplify=TRUE)
- All plurals.
str_subset(sentences, "[A-z]*s(\\.|\\s)") %>% #take all sentences that have a word ending in s str_extract_all("[A-z]*s\\b", simplify = TRUE) %>% .[str_length(.) > 3] %>% #get rid of the short words str_subset(".*[^s]s$") %>% #get rid of words ending in 'ss' str_subset(".*[^i]s$") #get rid of 'this'
14.4.4.1
Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
#Create regex expression nums <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine") nums_c <- str_c(nums, collapse = "|") # see stringr cheatsheet: "(?<![:alpha:])" means not preceded by re <- str_c("(", "(?<![:alpha:])", "(", nums_c, "))", " ", "([^ \\.]+)", sep = "") sentences %>% str_subset(regex(re, ignore_case = TRUE)) %>% str_extract_all(regex(re, ignore_case = TRUE)) %>% unlist() %>% tibble::enframe(name = NULL) %>% separate(col = "value", into = c("num", "following"), remove = FALSE)
## # A tibble: 30 x 3 ## value num following ## <chr> <chr> <chr> ## 1 Four hours Four hours ## 2 Two blue Two blue ## 3 seven books seven books ## 4 two met two met ## 5 two factors two factors ## 6 three lists three lists ## 7 Two plus Two plus ## 8 seven is seven is ## 9 two when two when ## 10 Eight miles Eight miles ## # ... with 20 more rows
- I’d initially appended
"\\b"
in front of each number to prevent things like “someone” being captured – however this didn’t work with cases where a sentence started with a number – hence switched to using the not preceded by method in the stringr cheatsheet.
- I’d initially appended
Find all contractions. Separate out the pieces before and after the apostrophe.
#note the () facilitate the split with functions contr <- "([^ \\.]+)'([^ \\.]*)" sentences %>% #note the improvement this word definition is to the above [^ ]+ str_subset(contr) %>% str_match_all(contr)
## [[1]] ## [,1] [,2] [,3] ## [1,] "It's" "It" "s" ## ## [[2]] ## [,1] [,2] [,3] ## [1,] "man's" "man" "s" ## ## [[3]] ## [,1] [,2] [,3] ## [1,] "don't" "don" "t" ## ## [[4]] ## [,1] [,2] [,3] ## [1,] "store's" "store" "s" ## ## [[5]] ## [,1] [,2] [,3] ## [1,] "workmen's" "workmen" "s" ## ## [[6]] ## [,1] [,2] [,3] ## [1,] "Let's" "Let" "s" ## ## [[7]] ## [,1] [,2] [,3] ## [1,] "sun's" "sun" "s" ## ## [[8]] ## [,1] [,2] [,3] ## [1,] "child's" "child" "s" ## ## [[9]] ## [,1] [,2] [,3] ## [1,] "king's" "king" "s" ## ## [[10]] ## [,1] [,2] [,3] ## [1,] "It's" "It" "s" ## ## [[11]] ## [,1] [,2] [,3] ## [1,] "don't" "don" "t" ## ## [[12]] ## [,1] [,2] [,3] ## [1,] "queen's" "queen" "s" ## ## [[13]] ## [,1] [,2] [,3] ## [1,] "don't" "don" "t" ## ## [[14]] ## [,1] [,2] [,3] ## [1,] "pirate's" "pirate" "s" ## ## [[15]] ## [,1] [,2] [,3] ## [1,] "neighbor's" "neighbor" "s"
14.4.5.1
Replace all forward slashes in a string with backslashes.
x <- c("test/dklsk/") str_replace_all(x, "/", "\\\\") %>% writeLines()
## test\dklsk\
Implement a simple version of
str_to_lower()
usingreplace_all()
.x <- c("BIdklsKOS") str_replace_all(x, "([A-Z])", tolower)
## [1] "bidklskos"
Switch the first and last letters in
words
. Which of those strings are still words?str_replace(words, "(^.)(.*)(.$)", "\\3\\2\\1")
Any words that start and end with the same letter, e.g. ‘treat’, as well as a few other examples like, war –> raw .
14.4.6.1
Split up a string like
"apples, pears, and bananas"
into individual components.x <- "apples, pears, and bananas" str_split(x, ",* ") #note that regular expression works to handle commas as well
## [[1]] ## [1] "apples" "pears" "and" "bananas"
Why is it better to split up by
boundary("word")
than" "
?Handles commas and punctuation32.
str_split(x, boundary("word"))
## [[1]] ## [1] "apples" "pears" "and" "bananas"
What does splitting with an empty string (
""
) do? Experiment, and then read the documentation. Splitting by an empty string splits up each character.str_split(x,"")
## [[1]] ## [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" ## [18] "d" " " "b" "a" "n" "a" "n" "a" "s"
- splits each character into an individual element (and creates elements for spaces between strings)
14.5: Other types of patterns
regex
args to know:
ignore_case = TRUE
allows characters to match either their uppercase or lowercase forms. This always uses the current locale.multiline = TRUE
allows^
and$
to match the start and end of each line rather than the start and end of the complete string.comments = TRUE
allows you to use comments and white space to make complex regular expressions more understandable. Spaces are ignored, as is everything after#
. To match a literal space, you’ll need to escape it:"\\ "
.dotall = TRUE
allows.
to match everything, including\n
.
Alternatives to regex()
:
* fixed()
: matches exactly the specified sequence of bytes. It ignores
all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping and can be much faster than
regular expressions.
* coll()
: compare strings using standard collation rules. This is
useful for doing case insensitive matching. Note that coll()
takes a
locale
parameter that controls which rules are used for comparing
characters.
14.5.1
How would you find all strings containing
\
withregex()
vs. withfixed()
? would be\\
instead of\\\\
str_view_all("so \\ the party is on\\ right?", fixed("\\"))
What are the five most common words in
sentences
?str_extract_all(sentences, boundary("word"), simplify = TRUE) %>% as_tibble() %>% gather(V1:V12, value = "words", key = "order") %>% mutate(words = str_to_lower(words)) %>% filter(!words == "") %>% count(words, sort = TRUE) %>% head(5)
## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`. ## This warning is displayed once per session.
## # A tibble: 5 x 2 ## words n ## <chr> <int> ## 1 the 751 ## 2 a 202 ## 3 of 132 ## 4 to 123 ## 5 and 118
14.7: stringi
Other functions:
apropos
searches all objects available from the global environment–useful if you can’t remember fun name.
Check those that start with replace
:
apropos("^(replace)")
## [1] "replace" "replace_na"
Check those that start with str
, but not stri
apropos("^(str)[^i]")
## [1] "str_c" "str_conv" "str_count"
## [4] "str_detect" "str_dup" "str_extract"
## [7] "str_extract_all" "str_flatten" "str_glue"
## [10] "str_glue_data" "str_interp" "str_length"
## [13] "str_locate" "str_locate_all" "str_match"
## [16] "str_match_all" "str_order" "str_pad"
## [19] "str_remove" "str_remove_all" "str_replace"
## [22] "str_replace_all" "str_replace_na" "str_sort"
## [25] "str_split" "str_split_fixed" "str_squish"
## [28] "str_sub" "str_sub<-" "str_subset"
## [31] "str_to_lower" "str_to_title" "str_to_upper"
## [34] "str_trim" "str_trunc" "str_view"
## [37] "str_view_all" "str_which" "str_wrap"
## [40] "strcapture" "strftime" "strheight"
## [43] "strOptions" "strptime" "strrep"
## [46] "strsplit" "strtoi" "strtrim"
## [49] "StructTS" "structure" "strwidth"
## [52] "strwrap"
14.7.1
Find the stringi functions that:
- Count the number of words. –
stri_count
- Find duplicated strings. –
stri_duplicated
- Generate random text. –
str_rand_strings
- Count the number of words. –
How do you control the language that
stri_sort()
uses for sorting?The
decreasing
argument
Appendix
14.4.2.3
One way of doing this using iteration methods:
vowels <- c("a","e","i","o","u")
tibble(vowels = vowels, words = list(words)) %>%
mutate(detect_vowels = purrr::map2(words, vowels, str_detect)) %>%
spread(key = vowels, value = detect_vowels) %>%
unnest() %>%
mutate(unique_vowels = rowSums(.[2:6])) %>%
arrange(desc(unique_vowels))
## # A tibble: 980 x 7
## words a e i o u unique_vowels
## <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <dbl>
## 1 absolute TRUE TRUE FALSE TRUE TRUE 4
## 2 appropriate TRUE TRUE TRUE TRUE FALSE 4
## 3 associate TRUE TRUE TRUE TRUE FALSE 4
## 4 authority TRUE FALSE TRUE TRUE TRUE 4
## 5 colleague TRUE TRUE FALSE TRUE TRUE 4
## 6 continue FALSE TRUE TRUE TRUE TRUE 4
## 7 encourage TRUE TRUE FALSE TRUE TRUE 4
## 8 introduce FALSE TRUE TRUE TRUE TRUE 4
## 9 organize TRUE TRUE TRUE TRUE FALSE 4
## 10 previous FALSE TRUE TRUE TRUE TRUE 4
## # ... with 970 more rows
#seems that nothing gets over 4
I still sometimes prefer to use patterns where possible over
boundary
function. Regex is more generally applicabale as well outside of R.↩