---
title: "An Overview of gssr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{An Overview of gssr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
  )

## Three quick-and-dirty functions, one to help clean some labels, one to define some custom colors for our plot, and one to title-case variable labels. 

convert_agegrp <- function(x){
    x <- gsub("\\(", "", x)
    x <- gsub("\\[", "", x)
    x <- gsub("\\]", "", x)
    x <- gsub(",", "-", x)
    x <- gsub("-89", "+", x)
    regex <- "^(.*$)"
    x <- gsub(regex, "Age \\1", x)
    x
}

my_colors <- function (palette = "cb") 
{
    cb_palette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", 
        "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
    rcb_palette <- rev(cb_palette)
    bly_palette <- c("#E69F00", "#0072B2", "#000000", "#56B4E9", 
        "#009E73", "#F0E442", "#D55E00", "#CC79A7")
    if (palette == "cb") 
        return(cb_palette)
    else if (palette == "rcb") 
        return(rcb_palette)
    else if (palette == "bly") 
        return(bly_palette)
    else stop("Choose cb, rcb, or bly only.")
}

# from help(chartr)
capwords <- function(x, strict = FALSE) {
    cap <- function(x) paste(toupper(substring(x, 1, 1)),
                  {x <- substring(x, 2); if(strict) tolower(x) else x},
                             sep = "", collapse = " " )
    sapply(strsplit(x, split = " "), cap, USE.NAMES = !is.null(names(x)))
}

```

## Introduction

The [General Social Survey](http://gss.norc.org), or GSS, is one of the cornerstones of American social science and one of the most-analyzed datasets in Sociology. It is routinely used in research, in teaching, and as a reference point in discussions about changes in American society since the early 1970s. It is also a model of open, public data. The [National Opinion Research Center](http://norc.org) already provides many excellent tools for working with the data, and has long made it freely available to researchers. Casual users of the GSS can examine the [GSS Data Explorer](https://gssdataexplorer.norc.org), and social scientists can [download complete datasets](http://gss.norc.org/Get-The-Data) directly. At present, the GSS is provided to researchers in a variety  of commercial formats: Stata (`.dta`), SAS, and SPSS (`.sav`). It's not too difficult to get the data into R using the [Haven](http://haven.tidyverse.org) package, but it can be a little annoying to have to do it repeatedly, or across projects. After doing it one too many times, I got tired of it and I made a package instead. Currently, the `gssr` package provides Release 2a the GSS Cumulative Data File (1972-2022); three GSS Three Wave Panel Data Files (for panels beginning in 2006, 2008, and 2010, respectively); and the 2020 panel file. The package also integrates codebook information about variables into R's help system, allowing them to be accessed via the help browser or from the console with `?`.  The `gssr` package makes the GSS a little more accessible to users of R, the free software environment for statistical computing, and thus helps in a small way to make the GSS even more open than it already is.

### Packages

This article makes use of some additional packages beyond `gssr` itself. My assumption is that users of `gssr` will most likely use and analyze the data in conjunction with some combination of [Tidyverse](https://tidyverse.org) tools and the [survey](http://r-survey.r-forge.r-project.org/survey/), [srvyr](https://github.com/gergness/srvyr), and [panelr](http://panelr.jacob-long.com) packages.

```{r packages}
library(dplyr)
library(ggplot2)
library(haven)
library(tibble)
library(survey)
library(srvyr)
```

## Load the gssr package and its data

```{r setup}
library(gssr)
```

We will begin with the Cumulative Data file (1972-2022). As the startup message notes, the data objects are not automatically loaded. That is, we do not use R's "lazy loading" functionality. This is because the main GSS dataset is rather large. Instead we load it manually with `data()`. For the purposes of this vignette, because the full Cumulative Data object is big, we will use just a few columns of it stored in an object called `gss_sub`. But all the code here will also work with the full dataset object, `gss_all`, which you can load with the command `data(gss_all)`. 

```{r load}
data(gss_sub)
```

The GSS data comes in a *labelled* format, mirroring the way it is encoded for Stata and SPSS platforms. The numeric codes are the content of the column cells. The labeling information is stored as an attribute of the column.

```{r labels}
gss_sub
```

We will use the label information later when recoding the variables into, say, character or factor variables. 

## Descriptive analysis of the data: an example

The GSS is a complex survey. When working with it, we need to take its structure into account in order to properly calculate statistics such as the population mean for a variable in some year, its standard error, and so on. For these tasks we use the `survey` and `srvyr` packages. For details on `survey`, see Lumley (2010). We will also do some recoding prior to analyzing the data, so we load several additional `tidyverse` packages to assist us.

We will examine a topic that was the subject of recent media attention, in the *New York Times* and elsewhere, regarding the beliefs of young men about gender roles. Some surveys seemed to point to some recent increasing conservatism on this front amongst young men. As it happens, the GSS has a longstanding question named `fefam`, where respondents are asked to give their opinion on the following statement:

>  It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.

Respondents may answer that they Strongly Agree, Agree, Disagree, or Strongly Disagree with the statement (as well as refusing to answer, or saying they don't know). 

### Look at the Variable's summary

Variable summaries for the cumulative data file are built in to `gssr` and are integrated into R's help system. For example, we can type `?fefam` at the console and have the variable's documentation appear in the documentation browser.

<img src="figures/fefam_help.png" />

Scrolling further down will give a summary of the values the variable can take.

### Subset the Data

The GSS data retains labeling information (as it was originally imported via the `haven` package). When working with the data in an analysis, we will probably want to convert the labeled variables to data types such as factors. This should be done with care (and not on the whole dataset all at once). Typically, we will want to focus on some relatively small subset of variables and examine those. For example, let's say we want to explore the `fefam` question. We will subset the data and then prepare that for analysis. Here we are going to subset `gss_sub` into an object called `gss_fam` containing just the variables we want to examine, along with core measures that identify respondents (such as `id` and `year`) and variables necessary for the survey weighting later (such as `wtssps`).

```{r setup-subset}
cont_vars <- c("year", "id", "ballot", "age")

cat_vars <- c("race", "sex", "fefam")

wt_vars <- c("vpsu",
             "vstrat",
             "oversamp",
             "formwt",              # weight to deal with experimental randomization
             "wtssps",              # weight variable
             "sampcode",            # sampling error code
             "sample")              # sampling frame and method

my_vars <- c(cont_vars, cat_vars, wt_vars)

gss_fam <- gss_sub |>
  select(all_of(my_vars))

gss_fam

```

### Recode the Subsetted Data

Next, we will do some recoding and create some new variables. We also create some new variables: age quintiles, a variable flagging whether a respondent is 25 or younger, recoded `fefam` to binary "Agree" or "Disagree" (with non-responses dropped). 

We begin by figuring out the cutpoints for age quintiles.

```{r recodes_1}
qrts <- quantile(as.numeric(gss_fam$age), 
                 na.rm = TRUE)
qrts

quintiles <- quantile(as.numeric(gss_fam$age), 
                      probs = seq(0, 1, 0.2), na.rm = TRUE)

quintiles
```


Next, we clean up `gss_fam` a bit, discarding some of the label and missing value information we don't need. The data in `gss_all` retains the labeling structure provided by the GSS. Variables are stored numerically with labels attached to them. Often, when using the data in R, it will be convenient to convert the categorical variables we are interested in to character or factor type instead.  

```{r recodes_2}
## Recoding
## The convert_agegrp() and capwords() functions seen here are defined
## at the top of the Rmd file used to produce this document.

gss_fam <- gss_fam |>
  mutate(
    # Convert all missing to NA
    across(everything(), haven::zap_missing), 
    # Convert all weight vars to numeric
    across(all_of(wt_vars), as.numeric),
    # Make all categorical variables factors and relabel nicely
    across(all_of(cat_vars), forcats::as_factor),
    across(all_of(cat_vars), \(x) forcats::fct_relabel(x, capwords, strict = TRUE)),
    # Create the age groups
    ageq = cut(x = age, breaks = unique(qrts), include.lowest = TRUE),
    ageq =  forcats::fct_relabel(ageq, convert_agegrp), 
    agequint = cut(x = age, breaks = unique(quintiles), include.lowest = TRUE),
    agequint = forcats::fct_relabel(agequint, convert_agegrp),
    year_f = droplevels(factor(year)),
    young = ifelse(age < 26, "Yes", "No"),
    # Dichotomize fefam
    fefam = forcats::fct_recode(fefam, NULL = "IAP", NULL = "DK", NULL = "NA"),
    fefam_d = forcats::fct_recode(fefam,
                                  Agree = "Strongly Agree",
                                  Disagree = "Strongly Disagree"),
    fefam_n = recode(fefam_d, "Agree" = 0, "Disagree" = 1),
    samplerc = case_when(sample %in% c(3:4) ~ 3, 
                              sample %in% c(6:7) ~ 6,
                              .default = sample))

## Some of the recoded variables
gss_fam |>
  filter(year > 1976) |> 
  select(race:fefam, agequint:fefam_d) |> 
  relocate(year_f)
```

### Integrate the Survey Weights 

At this point we can calculate some percentages, such as the percent of male and female respondents disagreeing with the `fefam` proposition each year. 

```{r}
gss_fam |> 
  group_by(year, sex, young, fefam_d) |> 
  tally() |> 
  drop_na() |> 
  mutate(pct = round((n/sum(n))*100, 1)) |> 
  select(-n) 
```

However, these calculations do not take the GSS survey design into account. We set up the data so we can properly calculate population means and errors and so on. We use the `svyr` package's wrappers to `survey` for this.

```{r weights}

options(survey.lonely.psu = "adjust")
options(na.action="na.pass")

gss_svy <- gss_fam |>
    filter(year > 1974) |>
    tidyr::drop_na(fefam_d, young) |>
    mutate(stratvar = interaction(year, vstrat)) |>
    as_survey_design(id = vpsu,
                     strata = stratvar,
                     weights = wtssps,
                     nest = TRUE)
```

The `gss_svy` object contains the same data as `gss_fam`, but incorporates information about the sampling structure in a way that the `survey` package's functions can work with: 

```{r svyobj}
gss_svy
```

### Calculate the survey-weighted statistics

We're now in a position to calculate some properly-weighted summary statistics for the variable we're interested in, for every year it is    in the data. 

```{r summary}
## Get the breakdown for every year
out_ff <- gss_svy |>
    group_by(year, sex, young, fefam_d) |>
    summarize(prop = survey_mean(na.rm = TRUE, vartype = "ci")) |> 
  drop_na(sex)

out_ff
```

### Plot the Results

We finish with a polished plot of the trends in `fefam` over time, for men and women in two (recoded) age groups over time. 

```{r fefamplot, fig.width=8, fig.height=6}
theme_set(theme_minimal())

facet_names <- c("No" = "Age Over 25 when surveyed", 
                 "Yes" = "Age 18-25 when surveyed")
fefam_txt <- "Disagreement with the statement, ‘It is much better for\neveryone involved if the man is the achiever outside the\nhome and the woman takes care of the home and family’"

out_ff |> 
  filter(fefam_d == "Disagree") |>
  ggplot(mapping = 
           aes(x = year, y = prop,
               ymin = prop_low, 
               ymax = prop_upp,
               color = sex, 
               group = sex, 
               fill = sex)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(alpha = 0.3, color = NA) +
  scale_x_continuous(breaks = seq(1978, 2022, 4)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_color_manual(values = my_colors("bly")[2:1],
                     labels = c("Men", "Women"),
                     guide = guide_legend(title=NULL)) +
  scale_fill_manual(values = my_colors("bly")[2:1],
                    labels = c("Men", "Women"),
                    guide = guide_legend(title=NULL)) +
  facet_wrap(~ young, labeller = as_labeller(facet_names),
             ncol = 1) +
  coord_cartesian(xlim = c(1977, 2022)) +
  labs(x = "Year",
       y = "Percent Disagreeing",
       subtitle = fefam_txt,
       caption = "Kieran Healy http://socviz.co.\n
         Data source: General Social Survey") +
  theme(legend.position = "bottom")
```

## The GSS Three Wave Panels 

In addition to the Cumulative Data File, the gssr package also includes the GSS's panel data. The current rotating panel design began in 2006. A panel of respondents were interviewed that year and followed up on for further interviews in 2008 and 2010. A second panel was interviewed beginning in 2008, and was followed up on for further interviews in 2010 and 2012. And a third panel began in 2010, with follow-up interviews in 2012 and 2014. The `gssr` package provides three datasets, one for each of three-wave panels. They are `gss_panel06_long`, `gss_panel08_long`, and `gss_panel10_long`.  The datasets are provided by the GSS in wide format but (as their names suggest) are packaged here in long format. The conversion was carried out using the [`panelr` package](https://panelr.jacob-long.com) and its `long_panel()` function. Conversion from long back to wide format is possible with the tools provided in `panelr`.

We load the panel data as before. For example:

```{r panel}
data(gss_panel06_long)

gss_panel06_long

```

The panel data objects were created by `panelr` but are regular tibbles. The column names in long format do not have wave identifiers. Rather,  `firstid` and `wave` variables track the cases. The `firstid` variable is unique for every row and has no missing values. The `id` variable is from the GSS and tracks individuals within waves.

```{r panel-example}
gss_panel06_long |> select(firstid, wave, id, sex)
```

We can look at attrition across waves with, e.g.:

```{r attrition}
gss_panel06_long |> 
  select(wave, id) |>
  group_by(wave) |>
  summarize(observed = n_distinct(id),
            missing = sum(is.na(id)))
```

## References

Lumley, Thomas (2010). *Complex Surveys: A Guide to Analysis Using R*. New York: Wiley.