R4SME 06 case-control.Rmd

---
title: "R for SME 6: Analysis of an unmatched case-control study"
author: Andrea Mazzella [link](https://github.com/andreamazzella)
output: html_notebook
---

# Part 1: Introduction
## A) Basics

Install these packages if you haven't already
```{r}
install.packages("epiDisplay")
install.packages("pubh")
install.packages("statix")
```

Load packages
- Did you know? library() and require() are equivalent.
```{r}
require(haven)
library(magrittr) # %$% pipe
require(epiDisplay) # Epi functions
library(pubh) # chi-for-trend
require(rstatix) # test-for-trend
library(tidyverse) # %>% pipe, data management...
```


## B) Data exploration & management

Make sure you have the mortality.dta dataset in the same folder as this .rmd, and load it. It contains data on HIV infection among women in Mwanza, Tanzania.
```{r}
# Import the dataset
mwanza <- read_dta("./mwanza.dta")
```

Rather unhelpfully, the values are not labelled, and the Stata metadata file format, .hlp, can only be read in Stata or in Windows 7.
*Thanks, Stata!*
- How many observations does the dataset have?
- How many variables and what do these variables code?
- Which type are these variables?
```{r}

```

Solution:
```{r}
#Familiarise yourself with the data
View(mwanza)
glimpse(mwanza)
summary(mwanza)
```

1. Recategorise the variables "ed" and "age1" into two new variables called ed2 and age2 so that:
- ed2 is binary (1 = none; 2 = any formal education)
- age2 is grouped this way: 1=15-24, 2=25-34, 3=35+ years.

```{r}
# Tabulate all possible values of ed and age1


# Recategorise and label education level


# Recategorise and label age


# Check it worked ok


```

Possible solution:
```{r}
# Tabulate all possible values of ed and age1
mwanza %$% table(ed)
mwanza %$% table(age1)

# Recategorise and label education level
mwanza %<>%
  mutate(ed2 = as.factor(
    case_when(
      ed == 1 ~ "none",
      ed == 2 ~ "any formal",
      ed == 3 ~ "any formal",
      ed == 4 ~ "any formal"
    )
  ))

# Recategorise and label age
mwanza %<>%
  mutate(age2 =  as.factor(
    case_when(
      age1 <= 2 ~ "15-24",
      age1 == 3 | age1 == 4 ~ "25-34",
      age1 == 5 | age1 == 6 ~ "35+"
    )
  ))

# Check it worked ok
mwanza %$% table(ed,ed2)
mwanza %$% table(age1,age2)
```

Also, make "case" a factor.


```{r}
# Make "case" a factor and label it
mwanza %<>%
  mutate(case = factor(case,
    levels = c(0, 1),
    labels = c("Control", "HIV_case")
  ))
summary(mwanza$case)
```


## C) Data analysis

2. Obtain the crude OR for education as a risk factor for HIV. The commands are from {epiDisplay}: tabpct() and cc().
- Note that cc() also calculates the Fisher's exact test automatically, unlike Stata.
```{r}
# 2x2 table with row percentages


# 2x2 table with crude OR


```

Possible solution:
```{r}
# 2x2 table with row percentages
mwanza %$% tabpct(case, ed2, percent= "row", graph = F)

# 2x2 table with crude OR
mwanza %$% cc(case, ed2, graph = F)
```

3. Assess whether age is a confounder or an effect modifier in the association between education and HIV.
- Obtain tables of HIV/education stratified by age
- Estimate ORs of HIV/education by different age groups (epiDisplay::mhor() - you need to specify the package because {pubh} also has a function called mhor)
- What is the Mantel-Haenszel summary estimate of the OR?
- What is the interpretation of the test for interaction?
```{r}
# Obtain tables of HIV/education stratified by age


# Estimate ORs of HIV/education by different age groups


```

Possible solution:
```{r}
# Obtain tables of HIV/education stratified by age
print("15-24 years stratum")
mwanza %>%
  filter(age2 == "15-24") %$%
  tabpct(case, ed2, percent= "row", graph = F)

print("25-34 years stratum")
mwanza %>%
  filter(age2 == "25-34") %$%
  tabpct(case, ed2, percent= "row", graph = F)

print("35+ years stratum")
mwanza %>%
  filter(age2 == "35+") %$%
  tabpct(case, ed2, percent= "row", graph = F)

# Estimate ORs of HIV/education by different age groups
mwanza %$% epiDisplay::mhor(case, ed2, age2, graph = F)

```

4. Assess whether religion is a confounder or an effect modifier between education and HIV infection.
- The "rel" variable is coded as such: 1 = Muslim, 2 = Catholic, 3 = Protestant, 4 = Other, 9 = missing value.

```{r}
# Recode missing values (9) as NA


# Exploratory tabulation


# Obtain tables of HIV/education stratified by religion


# Estimate ORs of HIV/education by different religions


```

Possible solution:
```{r}
# Recode missing values (9) as NA
summary(mwanza$rel)

mwanza %<>% 
  mutate(rel = na_if(rel, 9))

summary(mwanza$rel)

# Exploratory tabulation
mwanza %$% tabpct(case, rel, percent= "row", graph = F)
mwanza %$% tabpct(ed2, rel, percent= "row", graph = F)

# Obtain tables of HIV/education stratified by religion
print("Muslim")
mwanza %>%
  filter(rel == "1") %$%
  tabpct(case, ed2, percent= "row", graph = F)

print("Catholic")
mwanza %>%
  filter(rel == "2") %$%
  tabpct(case, ed2, percent= "row", graph = F)

print("Protestant")
mwanza %>%
  filter(rel == "3") %$%
  tabpct(case, ed2, percent= "row", graph = F)

print("Other")
mwanza %>%
  filter(rel == "4") %$%
  tabpct(case, ed2, percent= "row", graph = F)

# Estimate ORs of HIV/education by different religions
mwanza %$% epiDisplay::mhor(case, ed2, rel, graph = F)
```

5. Dealing with missing values for a potential confounder (npa)
The variable npa contains information on the number of sexual partners. It's coded: 1 (0-1), 2 (2-4), 3 (5-9), 4 (10-19), 9 (missing value).
- Tell R which values are missing. Observations with missing values will automatically be excluded from this analysis.
```{r}

```

Possible solution:
```{r}
mwanza %$% table(npa)

# Replace rel "9" with "NA"
mwanza$npa <- na_if(mwanza$npa, 9)

# 2x2 table with crude OR
mwanza %$% cc(case, ed2, graph = F)

# Estimate ORs of HIV/education by different age groups
mwanza %$% epiDisplay::mhor(case, ed2, npa, graph = F)
```


6. Exploring a dose-response relationship
- Create a new variable, npa2, with values of 0, 3, 7, and 15 instead with the original values. These correspond to an average of partners in each group.
```{r}

```

Possible solution:
```{r}
# Duplicate the npa variable and change its values

mwanza %<>% 
  mutate(npa2 = npa) %>%
  mutate(npa2 = recode(
    npa2,
    `1` = 0,
    `2` = 3,
    `3` = 7,
    `4` = 15
  ))
mwanza %$% table(npa, npa2)
```

- Perform a chi-squared test for trend of odds for the exposure npa2 and outcome.
```{r}
# Odds ratio for each partner-number group compared to those with 0/1 partner.
odds_trend(case ~ npa2, data = mwanza)

# Test for trend - {statix}
case_by_npa2 <- mwanza %$% table(case, npa2)
prop_trend_test(case_by_npa2)
```

- Perform a test for departure from trend for npa2.
```{r}
# Calculate difference between chi2 and trend test
departure_chi <-
  chisq_test(case_by_npa2)$statistic - prop_trend_test(case_by_npa2)$statistic

# Test for departure from linear trend
pchisq(departure_chi, 2)
```