crosstabs

library(pollster)
library(dplyr)
library(knitr)
library(ggplot2)

Crosstabs can come in wide or long format. Each is useful, depending on your purpose. Wide data is best for display tables. Long data is usually better for making plots, for instance..

Here is a wide table.

crosstab(df = illinois, x = sex, y = educ6, weight = weight) %>%
  kable()
sex LT HS HS Some Col AA BA Post-BA n
Male 10.61114 31.08939 20.66614 7.498318 19.58311 10.55190 49108796
Female 10.37899 30.13231 21.64697 8.526414 19.26636 10.04895 53569718

And here is long format.

crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long")
#> # A tibble: 12 × 4
#>    sex    educ6      pct         n
#>    <fct>  <fct>    <dbl>     <dbl>
#>  1 Male   LT HS    10.6  49108796.
#>  2 Male   HS       31.1  49108796.
#>  3 Male   Some Col 20.7  49108796.
#>  4 Male   AA        7.50 49108796.
#>  5 Male   BA       19.6  49108796.
#>  6 Male   Post-BA  10.6  49108796.
#>  7 Female LT HS    10.4  53569718.
#>  8 Female HS       30.1  53569718.
#>  9 Female Some Col 21.6  53569718.
#> 10 Female AA        8.53 53569718.
#> 11 Female BA       19.3  53569718.
#> 12 Female Post-BA  10.0  53569718.

By default, row percentages are used. You can also explicitly choose cell or column percentages using the pct_type argument. I discourage the use of column percentages–it’s better to just flip the x and y variables and make row percents–but the option is included to match functionality provided by other standard statistical software.

# cell percentages
crosstab(df = illinois, x = sex, y = educ6, weight = weight, pct_type = "cell")
#> # A tibble: 2 × 8
#>   sex    `LT HS`    HS `Some Col`    AA    BA `Post-BA`          n
#>   <fct>    <dbl> <dbl>      <dbl> <dbl> <dbl>     <dbl>      <dbl>
#> 1 Male      5.08  14.9       9.88  3.59  9.37      5.05 102678514.
#> 2 Female    5.41  15.7      11.3   4.45 10.1       5.24 102678514.

# column percentages
crosstab(df = illinois, x = sex, y = educ6, weight = weight, pct_type = "column")
#> # A tibble: 3 × 7
#>   sex       `LT HS`         HS `Some Col`        AA         BA  `Post-BA`
#>   <chr>       <dbl>      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>
#> 1 Male         48.4       48.6       46.7      44.6       48.2       49.0
#> 2 Female       51.6       51.4       53.3      55.4       51.8       51.0
#> 3 n      10770999.  31409418.  21745113.  8249909.  19937965.  10565110.

To make a graph, just feed your tibble output to a ggplot2 function.

crosstab(df = illinois, x = sex, y = educ6, weight = weight, format = "long") %>%
  ggplot(aes(x = educ6, y = pct, fill = sex)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Educational attainment of the Illinois adult population by gender")

Margin of error

How the margin of error is calculated

The margin of error is calculated including the design effect of the sample weights, using the following formula:

sqrt(design effect)*zscore*sqrt((pct*(1-pct))/(n-1))*100

The design effect is calculated using the formula length(weights)*sum(weights^2)/(sum(weights)^2).


Get at topline table with the margin of error in a separate column using the moe_crosstab function. By default, a z-score of 1.96 (95% confidence interval is used). Supply your own desired z-score using the zscore argument. Only row and cell percents are supported. By default, the table format is long because I anticipate making visualizations will be the most common use-case for this graphic.

moe_crosstab(illinois, educ6, voter, weight)
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> # A tibble: 12 × 5
#>    educ6    voter       pct   moe         n
#>    <fct>    <fct>     <dbl> <dbl>     <dbl>
#>  1 LT HS    Voted      42.5  1.75  8999310.
#>  2 LT HS    Not voted  57.5  1.75  8999310.
#>  3 HS       Voted      56.5  1.03 26638087.
#>  4 HS       Not voted  43.5  1.03 26638087.
#>  5 Some Col Voted      63.6  1.20 18697544.
#>  6 Some Col Not voted  36.4  1.20 18697544.
#>  7 AA       Voted      67.6  1.89  7196039.
#>  8 AA       Not voted  32.4  1.89  7196039.
#>  9 BA       Voted      73.6  1.13 17447907.
#> 10 BA       Not voted  26.4  1.13 17447907.
#> 11 Post-BA  Voted      83.1  1.32  9322214.
#> 12 Post-BA  Not voted  16.9  1.32  9322214.

A wide format table looks like this.

moe_crosstab(illinois, educ6, voter, weight, format = "wide")
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> # A tibble: 6 × 6
#>   educ6    pct_Voted `pct_Not voted` moe_Voted `moe_Not voted`         n
#>   <fct>        <dbl>           <dbl>     <dbl>           <dbl>     <dbl>
#> 1 LT HS         42.5            57.5      1.75            1.75  8999310.
#> 2 HS            56.5            43.5      1.03            1.03 26638087.
#> 3 Some Col      63.6            36.4      1.20            1.20 18697544.
#> 4 AA            67.6            32.4      1.89            1.89  7196039.
#> 5 BA            73.6            26.4      1.13            1.13 17447907.
#> 6 Post-BA       83.1            16.9      1.32            1.32  9322214.

ggplot2 offers multiple ways to visualize the margin of error. Here is one good option. (Please note, if you don’t have ggplot2 >= 3.3.0 you’ll get an error message.)

illinois %>%
  filter(year == 2016) %>%
  moe_crosstab(educ6, voter, weight) %>%
  ggplot(aes(x = pct, y = educ6, xmin = (pct - moe), xmax = (pct + moe),
             color = voter)) +
  geom_pointrange(position = position_dodge(width = 0.2))

Special case, the x-variable identifies survey waves

If the x-variable in your crosstab uniquely identifies survey waves for which the weights were independently generated, it is best practice to calculate the design effect independently for each wave. moe_wave_crosstab does just that. All of the arguments remain the same as in moe_crosstab.

moe_wave_crosstab(df = illinois, x = year, y = rv, weight = weight)
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Your data includes weights equal to zero. These are removed before calculating the design effect.
#> Joining with `by = join_by(year)`
#> # A tibble: 24 × 5
#>     year rv               pct   moe        n
#>    <dbl> <fct>          <dbl> <dbl>    <dbl>
#>  1  1996 Registered      77.7  1.49 7485319.
#>  2  1996 Not Registered  22.3  1.49 7485319.
#>  3  1998 Registered      75.1  1.57 7364191.
#>  4  1998 Not Registered  24.9  1.57 7364191.
#>  5  2000 Registered      81.2  1.44 7276876.
#>  6  2000 Not Registered  18.8  1.44 7276876.
#>  7  2002 Registered      77.7  1.56 7185545.
#>  8  2002 Not Registered  22.3  1.56 7185545.
#>  9  2004 Registered      83.4  1.38 7719084.
#> 10  2004 Not Registered  16.6  1.38 7719084.
#> # … with 14 more rows