## Abstract

**Background:** In epidemiologic investigations of disease outbreaks, multivariable regression techniques with adjustment for confounding can be applied to assess the association between exposure and outcome. Traditionally, logistic regression has been used in analyses of case-control studies to determine the odds ratio (OR) as the effect measure. For rare outcomes (incidence of 5% to 10%), an adjusted OR can be used to approximate the risk ratio (RR). However, concern has been raised about using logistic regression to estimate RR because how closely the calculated OR approximates the RR depends largely on the outcome rate. The literature shows that when the incidence of outcomes exceeds 10%, ORs greatly overestimate RRs. Consequently, in addition to logistic regression, other regression methods to accurately estimate adjusted RRs have been explored. One method of interest is Poisson regression with robust standard errors. This generalized linear model estimates RR directly vs logistic regression that determines OR. The purpose of this study was to empirically compare risk estimates obtained from logistic regression and Poisson regression with robust standard errors in terms of effect size and determination of the most likely source in the analysis of a series of simulated single-source disease outbreak scenarios.

**Methods:** We created a prototype dataset to simulate a foodborne outbreak following a public event with 14 food exposures and a 52.0% overall attack rate. Regression methods, including binary logistic regression and Poisson regression with robust standard errors, were applied to analyze the dataset. To further examine how these two models led to different conclusions of the potential outbreak source, a series of 5 additional scenarios with decreasing attack rates were simulated and analyzed using both regression models.

**Results:** For each of the explanatory variables—sex, age, and food types—in both univariable and multivariable models, the ORs obtained from logistic regression were estimated further from 1.0 than their corresponding RRs estimated by Poisson regression with robust standard errors. In the simulated scenarios, the Poisson regression models demonstrated greater consistency in the identification of one food type as the most likely outbreak source.

**Conclusion:** Poisson regression with robust standard errors proved to be a decisive and consistent method to estimate risk associated with a single source in an outbreak when the cohort data collection design was used.

*Keywords:*

## INTRODUCTION

The primary objective of an outbreak investigation is to identify the source to (1) control the epidemic and (2) prevent future occurrences. To control the epidemic, efficiency is typically prioritized in an outbreak investigation to identify the potential origins in a timely fashion. However, to identify the source for the purpose of preventing future occurrences, a more precise analytical approach is required. Risk estimation is one method used to identify the source.^{1} By evaluating different exposures and comparing the risk of developing disease attributed to each exposure, the most likely causative agent can be identified.

After an outbreak has been detected, a typical way to proceed with the investigation is to identify the cases (those with the disease outcome) and the non-cases (those who have not yet developed disease). Past exposures are then ascertained in the same way for both groups. In this practical approach, the case-control design is the applicable design for epidemiologic data collection.^{1} Alternatively, the data collection design can be conceptualized as a retrospective cohort study. In this design, all individuals at risk of developing the disease could be conceptualized as an inception cohort—a group of individuals who gathered at an event where they were potentially exposed to putative risk factors and could be followed to identify whether they developed the disease at the time of the outbreak investigation.^{2} After the data have been compiled through either of these approaches, the association between exposures and outcome can be determined.

In epidemiologic investigations, binomial or dichotomous outcome variables, such as the occurrence and nonoccurrence of disease, are common. For example, in an investigation of a *Staphylococcus aureus* food poisoning outbreak that occurred in Oswego County, NY, the two states for the dichotomous variable (disease) were ill and not-ill.^{3} For the Oswego outbreak, the classic analytic approach of calculating attack rate was used. The food-specific attack rate was calculated by dividing the number of people who ate a specific food and became ill by the total number of people who ate that food. However, attack rate only provides the risk of getting the disease solely among those exposed to a specific factor. The major limitation of attack rate is that it does not allow hypothesis testing of the association between each food with the disease.

In the contemporary approach using cohort or case-control data collection designs, multivariable regression techniques with adjustment for confounding can be applied to assess the association between exposure and outcome.^{4-7} Multivariable logistic regression techniques are the most commonly used, especially for binomial outcome variables.^{8-10} Traditionally, logistic regression is used in analyses of case-control studies to determine the odds ratio (OR) as the effect measure.^{4,11} Yet logistic regression has also been applied to the dichotomous outcomes in cohort studies and randomized controlled trials (RCTs).^{12} In cohort studies and RCTs, logistic regression can serve as a valuable tool to estimate risk and assess the association between exposure and outcome in certain situations. For rare outcomes (incidence of 5% to 10%), an adjusted OR can be rationally used to approximate the risk ratio (RR).^{10-12} Therefore, some epidemiologists have advocated for the use of ORs in cohort studies.^{13,14} Nonetheless, concern has been raised about using logistic regression to estimate RR because how closely the calculated OR approximates the RR depends largely on the outcome rate.^{15,16} The literature shows that when the incidence of outcomes exceeds 10%, ORs greatly overestimate RRs.^{8-9} Consequently, in addition to logistic regression, other regression methods to accurately estimate adjusted RRs in cohort studies and RCTs have been explored.^{8,17} Two methods of interest are Poisson regression with robust standard errors and log-binomial regression. These generalized linear models estimate RR directly vs logistic regression that determines OR.^{8}

Even though the overestimation of RR by OR in a cohort study has been illustrated,^{8} the extent of the difference between the OR and RR in an outbreak investigation with high-incidence dichotomous outcomes has not been examined. In this current study, estimates of gastroenteritis risk attributed to consumption of certain foods were obtained from analyzing simulated outbreak data using logistic and Poisson regression. The purpose of these analyses was to compare the ORs and RRs and to examine how these two models led to different conclusions regarding the most likely source of the outbreak. In addition, we examined these differences in a set of simulated scenarios with varying attack rates to assess whether there was a situation in which a certain statistical technique was more applicable.

## METHODS

### Dataset

We created a dataset to simulate a hypothetical foodborne disease outbreak following a public event with 75 people in attendance. This scenario included 14 types of food (foods 1 to 14). The dichotomous outcome of interest was gastroenteritis that occurred after ingestion of contaminated food. To model an outbreak (common outcome), the simulated data had an overall attack rate (incidence of disease) of 52.0%. Food 12 was designated as the most likely source of this single-source outbreak.

Theoretically, this situation could be conceptualized either as a retrospective case-control study or as a retrospective cohort study. If the data could be practically collected by a case-control approach, the cases would consist of people who developed gastroenteritis and sought medical treatment, and the controls would be those who did not develop the outcome. These two groups of people would be traced backwards to identify the types of food they ingested. In this case, the OR could be estimated to approximate the RR of the outcome attributed to food ingestion. In contrast, if the situation were considered a retrospective cohort study, the cohort would be people who were at risk of the outcome, and food ingestion at the event would be the exposure that preceded the gastroenteritis occurrence. In this case, RR could be directly estimated.

In our investigation, the aim was to identify the food that most likely contributed to the gastroenteritis outbreak to prevent future occurrences. We used an analytic approach that measured the independent effect of each explanatory variable, controlling for the confounding effect of other factors.

To further investigate the differences in risk estimates when attack rates were altered from the initial dataset, we generated 5 additional datasets with a maintained total of 75 participants. We gradually decreased the number of ill individuals by one for each subsequent scenario by altering their status from ill to not-ill. In an effort to reduce the attack rate for the most likely food source (food 12) and maintain the attack rate for the second most likely source (food 4), we selected ill individuals who initially ate food 12 without eating food 4. Thus, the effect of food 12 on the outcome of gastroenteritis was decreased.

Because these hypothetical datasets did not involve human subjects, no human subject research ethical considerations were applicable, and institutional review board approval was not required.

### Statistical Analysis

Statistical analysis was performed using Stata software, v.15.0 (StataCorp, LLC). Descriptive statistics are used to describe general characteristics of the hypothetical subjects. The food-specific attack rates were calculated and reported as percentages. Univariable logistic regression analysis was used to estimate crude ORs and 95% confidence intervals (CIs). Crude RRs and their CIs were estimated using univariable Poisson regression with robust standard errors. Adjusted ORs and RRs were obtained through multivariable logistic regression and multivariable Poisson regression with robust standard errors.^{18} The Poisson regression is a generalized linear model with a log link function and a Poisson distribution.^{19} The robust standard error is estimated using the sandwich estimation method to take the incorrect assumption of Poisson distributed outcome in the Poisson regression into consideration.^{8} Using this approach, Poisson regression can be applied to estimate the risk in prospective studies with binary outcomes.^{18} To examine whether different analytical methods led to different conclusions regarding the food type that most likely contributed to the disease outbreak, attack rate was compared to its corresponding OR and RR that were estimated in the multivariable models. OR, RR, and their corresponding CIs for each explanatory variable were also compared to evaluate the difference in risk estimation. In addition, the feasibility of applying log-binomial regression—a generalized linear model with a log link function and a binomial distribution that also allows direct estimation of RR—to analyze this data was explored.^{8}

## RESULTS

In this hypothetical data, a higher sex-specific attack rate was found among females (56.8%). The age-specific attack rate was highest in the elderly age group (56.2%). Four types of food had food-specific attack rates >60%. Food 14 had the highest food-specific attack rate (66.7%) (Table 1).

Crude ORs were estimated further from 1.0 than the crude RR estimates on both sides of the scale—above and below 1.0. All the CIs estimated by the univariable Poisson regression with robust standard errors were narrower than those estimated by the univariable logistic regression (Table 2).

Multivariable logistic regression revealed two food types with adjusted ORs ≥2 and statistically significant *P* values (foods 4 and 12). Multivariable Poisson regression with robust standard errors, in contrast, specifically identified a single food type with an adjusted RR ≥2 and statistically significant *P* value (food 12) (Table 3).

When the overall attack rate of 52.0% in the prototype scenario was reduced to 50.7% and 49.3%, the multivariable logistic regression model still identified two food types—foods 4 and 12—with meaningful ORs and statistical significance (Table 4). However, in these same two scenarios, the multivariable Poisson regression consistently identified a single food type (food 12) with an RR ≥2 and statistical significance. In scenarios 4 and 5, when the overall attack rate was further reduced to 48.0% and 46.7%, respectively, the logistic regression models provided meaningful ORs for both foods 4 and 12; however, only food 4 maintained statistical significance. The Poisson regression model in scenario 4 determined one food type (food 12). Yet in scenario 5, food 12 was no longer statistically significant. Scenario 6 (reduction of the attack rate to 45.3%) diverged from scenarios 1 to 5 because the Poisson regression model suggested multiple sources of the outbreak. Thus, scenario 6 was outside the scope of our investigation into single-source outbreaks.

## DISCUSSION

In classical analyses, food-specific attack rates have been used as epidemiologic evidence to show the probability of foodborne infection following consumption of a certain food.^{1} Nonetheless, analysis of food-specific attack rates in this scenario did not lead to a definitive conclusion regarding the source of the outbreak because four possible food types (foods 4, 5, 12, and 14) had remarkably high attack rates, and the confounding problem still existed (Table 1).

Univariable logistic regression and univariable Poisson regression with robust standard errors produced crude estimates of ORs and RRs (Table 2). The univariable logistic regression model revealed 2 food types (foods 4 and 12) with meaningful ORs (OR ≥2), and food 12 had statistically significant results (*P*<0.05). The univariable Poisson regression model revealed only 1 food type (food 12) with a meaningful RR (RR ≥2) and statistical significance. These crude associations between individual food types and the outcome of gastroenteritis seemed suggestive of causal relationships, but they were not conclusive. The univariable models did not account for the interplay among multiple exposures or individuals having eaten more than one type of food. If a large proportion of ill individuals who ate one innocuous food type also ate the contaminated food type, the crude ORs and RRs would theoretically suggest causal relationships between both the innocuous and contaminated food exposures and the disease outcome. The innocuous food would seem to have had an effect on the outbreak. While we actually measured the effect of the contaminated food, we wrongly concluded that the innocuous food also contributed to the outbreak. This problem is known as confounding.

To account for the confounding problem, multivariable logistic regression was initially used to statistically adjust the confounding effect, estimating the risk of disease attributed to a certain food type independent of the effect from other factors. In this scenario, the multivariable logistic regression model still revealed four food types (foods 4, 11, 12, and 14) with meaningful ORs (OR ≥ 2), two of which had statistically significant results (*P*<0.05) (Table 3). Therefore, the analysis of food-specific attack rates (Table 1) and multivariable logistic regression analysis (Table 3) similarly pointed to four potential food types that likely contributed to the outbreak. Although three of the four food types (foods 4, 12, and 14) had attack rates ≥60% and ORs ≥2, conclusions regarding which of the four food types most likely contributed to the disease outbreak would differ according to analytic method. Based on the analysis of attack rates, food 5 (with an attack rate of 60.9%) would be considered in addition to foods 4, 12, and 14 (Table 1). However, food 5 failed to produce a meaningful OR in the multivariable logistic regression model. Conversely, food 11 which had an attack rate of 55% but an OR of 2 would be considered a likely source of the outbreak based on the multivariable logistic regression model.

The food-specific attack rates—an epidemiologic measure of disease frequency—indicated that food 14 was the most likely source of the outbreak (Table 1). In contrast, based on the adjusted ORs estimated by multivariable logistic regression—an epidemiologic measure of association—food 12 was identified as the most likely source of the outbreak (Table 3). Based on the use of the epidemiologic measure of association and the deconfounding principle, the adjusted OR provided more reliable epidemiologic evidence than the attack rate for identifying the likely source of outbreak in this situation.

In contrast to the multivariable logistic regression model that revealed two possible food types that potentially contributed to the outbreak, the multivariable Poisson regression with robust standard errors specifically identified food 12 as the single and most likely food type responsible for the outbreak (adjusted RR=3.09, 95% CI=1.23, 7.80). The other food types failed to obtain meaningful RRs and statistical significance.

For each of the explanatory variables in both the univariable and multivariable models, the OR was estimated further away from 1.0 than its corresponding RR. This finding from empirical analysis indicates the overestimation of RR by OR in the analysis of a foodborne outbreak conceptualized as a retrospective cohort study with a common outcome, consistent with the theory proposed in several research methodology articles.^{4,5,9,10}

Although a meaningful measure of effect could be obtained from using logistic regression in a cohort study,^{8,11} seeing an effect in OR without an effect in RR and drawing a different conclusion would also be possible.^{12} This phenomenon can be illustrated by comparing the statistically significant adjusted OR of 5.70 to the statistically nonsignificant adjusted RR of 1.74 for food 4 (Table 3). The adjusted OR leads to the conclusion that food 4 is a strong candidate for contributing to the outbreak. However, the markedly smaller adjusted RR without statistical significance does not support that conclusion. Thus, a regression technique that allows direct estimation of adjusted RR should be considered first when analyzing a cohort study with a relatively common outcome rather than a regression technique that estimates adjusted OR. In addition, the considerably narrower CIs obtained from the Poisson model also improve precision in the parameter estimation of effect. One limitation of the Poisson regression model is the inability to directly estimate probabilities. In this scenario, the estimated means from the Poisson regression model were used as surrogates for probabilities. As a result, obtaining individual predicted probabilities beyond the bounds of 0 and 1.0 was possible. These unrealistic predicted probabilities >1.0 could be problematic when the research objective is to obtain individual predicted probabilities of disease in predictive research—diagnostic and prognostic research.^{4} However, in an etiologic study with the focus on estimating a valid RR, the probabilities >1.0 would not pose a problem.^{4}

We simulated additional scenarios to assess the change in risk estimates when attack rates were reduced (Table 4). In scenarios 1 to 4, the Poisson regression models consistently indicated a single food type (food 12), leading to the decisive conclusion that food 12 was the sole source. In contrast, when the attack rates were altered, the risk estimates produced by the logistic regression models were highly variable. In scenarios 1 to 3, a decisive conclusion about the primary source for the outbreak could not be drawn from the logistic regression models because they yielded more than one potential food type, a finding that would prompt additional investigations into the alternative sources. Furthermore, when the logistic regression model indicated one food type in scenario 4, this finding contradicted the results of the Poisson regression model that directly estimated RR. In scenarios 4 and 5, the logistic models indicated food 4 as the most likely single source, while the Poisson regression models indicated food 12 as the primary source based on the meaningful RR; however, in scenario 5, none of the RRs for the food types remained statistically significant. As a result, the Poisson regression model for this scenario could no longer lead to a decisive conclusion about the single food type.

In multiple simulated scenarios, the Poisson regression model led to a more decisive conclusion about the single source of the outbreak. When the attack rates were altered (Table 4), the Poisson regression model consistently indicated food 12 as the most likely source. In terms of generalizability, Poisson regression with robust standard errors should be the statistical method of choice when incidence of disease can be obtained from cohort data collection design^{8,11}; however, this design requires relatively complete data collection that would be burdensome in large-scale outbreaks. For such outbreaks, the case-control data collection design is commonly used. For this data collection approach, the investigation usually starts by encountering a cluster of ill individuals who seek medical care. Then investigators identify a control group of non-cases to estimate the risk. OR is then calculated by logistic regression to estimate risk. In the scenarios presented in this study, the ORs led to inconsistent conclusions regarding the primary food type responsible for the single-source outbreak. Thus, in situations where the outcome is common (>10%), the OR overstates the effect size and potentially leads to misleading conclusions as shown in Table 4 and in the literature.^{8,9,16,20} Sheldrick and colleagues^{16} have shown how OR is mathematically related to RR in the following equation where *p _{A}* and

*p*represent the probabilities of events A and B, respectively:

_{B}In situations where *p _{A}* and

*p*

_{B}are close to zero (probability of rare event), the OR closely estimates the RR. However, in the case of a common outcome when

*p*and/or

_{A}*p*are considerably greater than zero, (1 –

_{B}*p*)/(1 –

_{A}*p*) no longer approximates 1.0, and OR noticeably overestimates RR.

_{B}In this study, we also assessed the feasibility of applying log-binomial regression by applying this regression technique to estimate the adjusted RRs in the multivariable model. However, the multivariable model did not concave to yield any estimates. The literature suggests that this convergence problem could be encountered when the incidence of outcome is high.^{4,8}

## CONCLUSION

This study illustrates that Poisson regression with robust standard errors is a decisive and consistent method to estimate risk associated with a single source in an outbreak when the cohort data collection design is used. However, in outbreak investigations that use the case-control data collection design, ORs obtained from logistic regression could overestimate the risk and potentially influence the conclusion regarding the source. Consequently, maintaining awareness of this overestimation when interpreting ORs is important.

*This article meets the Accreditation Council for Graduate Medical Education and the American Board of Medical Specialties Maintenance of Certification competencies for Patient Care and Medical Knowledge.*

## ACKNOWLEDGMENTS

*The authors have no financial or proprietary interest in the subject matter of this article.*

- © Academic Division of Ochsner Clinic Foundation