Ch. 15: Factors
Key questions:
- 15.3.1. #1, 3 (make the visualization and table)
- 15.5.1. #1
Functions and notes:
factor
make variable a factor based onlevels
providedfct_rev
reverses order of factorsfct_infreq
orders levels in increasing frequencyfct_relevel
lets you move levels to front of orderfct_inorder
orders existing factor by order values show-up in in datafct_reorder
orders input factors by other specified variables value (median by default), 3 inputs:f
: factor to modify,x
: input var to order by,fun
: function to use on x, also havedesc
optionfct_reorder2
orders input factor by max of other specified variable (good for making legends align as expected)fct_recode
lets you change value of each levelfct_collapse
is variant offct_recode
that allows you to provide multiple old levels as a vectorfct_lump
allows you to lump together small groups, usen
to specify number of groups to end with
Create factors by order they come-in:
Avoiding dropping levels with drop = FALSE
gss_cat %>%
ggplot(aes(race))+
geom_bar()+
scale_x_discrete(drop = FALSE)
15.4: Modifying factor order
15.4.1
There are some suspiciously high numbers in
tvhours
. Is the mean a good summary?gss_cat %>% mutate(tvhours_fct = factor(tvhours)) %>% ggplot(aes(x = tvhours_fct)) + geom_bar()
- Distribution is reasonably skewed with some values showing-up as 24 hours which seems impossible, in addition to this we have a lot of
NA
values, this may skew results - Given high number of missing values,
tvhours
may also just not be reliable, doNA
s associate with other variables? – Perhaps could try and impute theseNA
s
- Distribution is reasonably skewed with some values showing-up as 24 hours which seems impossible, in addition to this we have a lot of
For each factor in
gss_cat
identify whether the order of the levels is arbitrary or principled.gss_cat %>% purrr::keep(is.factor) %>% purrr::map(levels)
## $marital ## [1] "No answer" "Never married" "Separated" "Divorced" ## [5] "Widowed" "Married" ## ## $race ## [1] "Other" "Black" "White" "Not applicable" ## ## $rincome ## [1] "No answer" "Don't know" "Refused" "$25000 or more" ## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" ## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999" ## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable" ## ## $partyid ## [1] "No answer" "Don't know" "Other party" ## [4] "Strong republican" "Not str republican" "Ind,near rep" ## [7] "Independent" "Ind,near dem" "Not str democrat" ## [10] "Strong democrat" ## ## $relig ## [1] "No answer" "Don't know" ## [3] "Inter-nondenominational" "Native american" ## [5] "Christian" "Orthodox-christian" ## [7] "Moslem/islam" "Other eastern" ## [9] "Hinduism" "Buddhism" ## [11] "Other" "None" ## [13] "Jewish" "Catholic" ## [15] "Protestant" "Not applicable" ## ## $denom ## [1] "No answer" "Don't know" "No denomination" ## [4] "Other" "Episcopal" "Presbyterian-dk wh" ## [7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us" ## [10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth" ## [13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod" ## [16] "Luth ch in america" "Am lutheran" "Methodist-dk which" ## [19] "Other methodist" "United methodist" "Afr meth ep zion" ## [22] "Afr meth episcopal" "Baptist-dk which" "Other baptists" ## [25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am" ## [28] "Am bapt ch in usa" "Am baptist asso" "Not applicable"
rincome
is principaled, rest are arbitrary
Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?
- Becuase is moving this factor to be first in order
15.5: Modifying factor levels
Example with fct_recode
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
## # A tibble: 10 x 2
## partyid n
## <fct> <int>
## 1 No answer 154
## 2 Don't know 1
## 3 Other party 393
## 4 Republican, strong 2314
## 5 Republican, weak 3032
## 6 Independent, near rep 1791
## 7 Independent 4119
## 8 Independent, near dem 2499
## 9 Democrat, weak 3690
## 10 Democrat, strong 3490
15.5.1
How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
As a line plot:
gss_cat %>% mutate(partyid = fct_collapse( partyid, other = c("No answer", "Don't know", "Other party"), rep = c("Strong republican", "Not str republican"), ind = c("Ind,near rep", "Independent", "Ind,near dem"), dem = c("Not str democrat", "Strong democrat") )) %>% count(year, partyid) %>% group_by(year) %>% mutate(prop = n / sum(n)) %>% ungroup() %>% ggplot(aes( x = year, y = prop, colour = fct_reorder2(partyid, year, prop) )) + geom_line() + labs(colour = "partyid")
As a bar plot:
gss_cat %>% mutate(partyid = fct_collapse( partyid, other = c("No answer", "Don't know", "Other party"), rep = c("Strong republican", "Not str republican"), ind = c("Ind,near rep", "Independent", "Ind,near dem"), dem = c("Not str democrat", "Strong democrat") )) %>% count(year, partyid) %>% group_by(year) %>% mutate(prop = n / sum(n)) %>% ungroup() %>% ggplot(aes( x = year, y = prop, fill = fct_reorder2(partyid, year, prop) )) + geom_col() + labs(colour = "partyid")
- Suggests proportion of republicans has gone down with independents and other going up.
How could you collapse
rincome
into a small set of categories?other = c("No answer", "Don't know", "Refused", "Not applicable") high = c("$25000 or more", "$20000 - 24999", "$15000 - 19999", "$10000 - 14999") med = c("$8000 to 9999", "$7000 to 7999", "$6000 to 6999", "$5000 to 5999") low = c("$4000 to 4999", "$3000 to 3999", "$1000 to 2999", "Lt $1000") mutate(gss_cat, rincome = fct_collapse( rincome, other = other, high = high, med = med, low = low )) %>% count(rincome)
## # A tibble: 4 x 2 ## rincome n ## <fct> <int> ## 1 other 8468 ## 2 high 10862 ## 3 med 970 ## 4 low 1183
Appendix
Viewing all levels
A few ways to get an initial look at the levels or counts across a dataset
gss_cat %>%
purrr::map(unique)
gss_cat %>%
purrr::map(table)
gss_cat %>%
purrr::map(table) %>%
purrr::map(plot)
gss_cat %>%
mutate_if(is.factor, ~fct_lump(., 14)) %>%
sample_n(1000) %>%
GGally::ggpairs()
Percentage NA each level:
gss_cat %>%
purrr::map(~(sum(is.na(.x)) / length(.x))) %>%
as_tibble()
# essentially equivalent...
gss_cat %>%
summarise_all(~(sum(is.na(.)) / length(.)))
Print all levels of tibble:
gss_cat %>%
count(age) %>%
print(n = Inf)