Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data

William Dupont’s Statistical Modeling for Biomedical Researchers, Second Edition is ideal for a one-semester graduate course in biostatistics and epidemiology. Dupont assumes only a basic knowledge of statistics, such as that obtained from a standard introductory statistics course. Stata is used extensively throughout the text, making it possible to introduce computationally complex methods with little or no higher-level mathematics. As a result, Dupont focuses on concepts and model assumptions, rather than on the underlying mathematics. The text covers linear regression, logistic regression, Poisson regression, survival analysis, and analysis of variance. Two chapters are devoted to each topic: an introductory chapter that uses simple data to develop the concept and a more advanced chapter devoted to explaining more complex models, case studies, diagnostic measures, etc.

Dupont pays equal attention to the methods and to using Stata to apply them. When Stata output is displayed, the most important elements of the output are highlighted and explained in notes that follow the output. These notes help the reader make sense of the output by providing the appropriate focus for the problem at hand. The notes also include instructions for reproducing the analysis via Stata’s point-and-click user interface. The text, replete with examples featuring real medical data, uses Stata graphics extensively, providing ample explanation and detail for reproduction.


Algebraic notation
Descriptive statistics

Dot plot
Sample mean
Sample variance
Sample standard deviation
Percentile and median
Box plot
Scatter plot

The Stata Statistical Software Package

Downloading data from my website
Creating histograms with Stata
Stata command syntax
Obtaining interactive help from Stata
Stata log files
Stata graphics and schemes
Stata do files
Stata pulldown menus
Displaying other descriptive statistics with Stata

Inferential statistics

Probability density function
Mean, variance, and standard deviation
Normal distribution
Expected value
Standard error
Null hypothesis, alternative hypothesis, and P-value
95% confidence interval
Statistical power
The z and Student’s t distributions
Paired t test
Performing paired t tests with Stata
Independent t test using a pooled standard error estimate
Independent t test using separate standard error estimates
Independent t tests using Stata
The chi-squared distribution

Overview of methods discussed in this text

Models with one response per patient
Models with multiple responses per patient

Additional reading



Sample covariance
Sample correlation coefficient
Population covariance and correlation coefficient
Conditional expectation
Simple linear regression model
Fitting the linear regression model
Historical trivia: origin of the term regression
Determining the accuracy of linear regression estimates
Ethylene glycol poisoning example
95% confidence interval for y[x] = ? + ?x evaluated at x
95% prediction interval for the response of a new patient
Simple linear regression with Stata
Lowess regression
Plotting a lowess regression curve in Stata
Residual analyses
Studentized residual analysis using Stata
Transforming the x and y variables

Stabilizing the variance
Correcting for non-linearity
Example: research funding and morbidity for 29 diseases

Analyzing transformed data with Stata
Testing the equality of regression slopes

Example: the Framingham Heart Study

Comparing slope estimates with Stata
Density-distribution sunflower plots
Creating density-distribution sunflower plots with Stata
Additional reading



The model
Confounding variables
Estimating the parameters for a multiple linear regression model
R2 statistic for multiple regression models
Expected response in the multiple regression model
The accuracy of multiple regression parameter estimates
Hypothesis tests
95% confidence interval for ?i
95% prediction intervals
Example: the Framingham Heart Study

Preliminary univariate analyses

Scatter plot matrix graphs

Producing scatter plot matrix graphs with Stata

Modeling interaction in multiple linear regression

The Framingham example

Multiple regression modeling of the Framingham data
Intuitive understanding of a multiple regression model

The Framingham example

Calculating 95% confidence and prediction intervals
Multiple linear regression with Stata
Automatic methods of model selection

Forward selection using Stata
Backward selection
Forward stepwise selection
Backward stepwise selection
Pros and cons of automated model selection

Residual analyses

??_hat influence statistic
Cook’s distance
The Framingham example

Residual and influence analyses using Stata
Using multiple linear regression for non-linear models
Building non-linear models with restricted cubic splines

Choosing the knots for a restricted cubic spline model

The SUPPORT Study of hospitalized patients

Modeling length-of-stay and MAP using restricted cubic splines
Using Stata for non-linear models with restricted cubic splines

Additional reading



Example: APACHE score and mortality in patients with sepsis
Sigmoidal family of logistic regression curves
The log odds of death given a logistic probability function
The binomial distribution
Simple logistic regression model
Generalized linear model
Contrast between logistic and linear regression
Maximum likelihood estimation

Variance of maximum likelihood parameter estimates

Statistical tests and confidence intervals

Likelihood ratio tests
Quadratic approximations to the log likelihood ratio function
Score tests
Wald tests and confidence intervals
Which test should you use?

Sepsis example
Logistic regression with Stata
Odds ratios and the logistic regression model
95% confidence interval for the odds ratio associated with a unit increase in x

Calculating this odds ratio with Stata

Logistic regression with grouped response data
95% confidence interval for ?[x]
Exact 100(1 ? ?)% confidence intervals for proportions
Example: the Ibuprofen in Sepsis Study
Logistic regression with grouped data using Stata
Simple 2 × 2 case–control studies

Example: the Ille-et-Vilaine study of esophageal cancer and alcohol
Review of classical case–control theory
95% confidence interval for the odds ratio: Woolf’s method
Test of the null hypothesis that the odds ratio equals one
Test of the null hypothesis that two proportions are equal

Logistic regression models for 2 × 2 contingency tables

Nuisance parameters
95% confidence interval for the odds ratio: logistic regression

Creating a Stata data file
Analyzing case–control data with Stata
Regressing disease against exposure
Additional reading



Mantel–Haenszel estimate of an age-adjusted odds ratio
Mantel–Haenszel ?2 statistic for multiple 2 × 2 tables
95% confidence interval for the age-adjusted odds ratio
Breslow–Day–Tarone test for homogeneity
Calculating the Mantel–Haenszel odds ratio using Stata
Multiple logistic regression model

Likelihood ratio test of the influence of the covariates on the response variable

95% confidence interval for an adjusted odds ratio
Logistic regression for multiple 2 × 2 contingency tables
Analyzing multiple 2 × 2 tables with Stata
Handling categorical variables in Stata
Effect of dose of alcohol on esophageal cancer risk

Analyzing model (5.25) with Stata

Effect of dose of tobacco on esophageal cancer risk
Deriving odds ratios from multiple parameters
The standard error of a weighted sum of regression coefficients
Confidence intervals for weighted sums of coefficients
Hypothesis tests for weighted sums of coefficients
The estimated variance–covariance matrix
Multiplicative models of two risk factors
Multiplicative model of smoking, alcohol, and esophageal cancer
Fitting a multiplicative model with Stata
Model of two risk factors with interaction
Model of alcohol, tobacco, and esophageal cancer with interaction terms
Fitting a model with interaction using Stata
Model fitting: nested models and model deviance
Effect modifiers and confounding variables
Goodness-of-fit tests

The Pearson ?2 goodness-of-fit statistic

Hosmer–Lemeshow goodness-of-fit test

An example: the Ille-et-Vilaine cancer data set

Residual and influence analysis

Standardized Pearson residual
??_hatj influence statistic
Residual plots of the Ille-et-Vilaine data on esophageal cancer

Using Stata for goodness-of-fit tests and residual analyses
Frequency matched case–control studies
Conditional logistic regression
Analyzing data with missing values

Imputing data that is missing at random
Cardiac output in the Ibuprofen in Sepsis Study
Modeling missing values with Stata

Logistic regression using restricted cubic splines

Odds ratios from restricted cubic spline models
95% confidence intervals for ?_hat[x]

Modeling hospital mortality in the SUPPORT Study
Using Stata for logistic regression with restricted cubic splines
Regression methods with a categorical response variable

Proportional odds logistic regression
Polytomous logistic regression

Additional reading



Survival and cumulative mortality functions
Right censored data
Kaplan–Meier survival curves
An example: genetic risk of recurrent intracerebral hemorrhage
95% confidence intervals for survival functions
Cumulative mortality function
Censoring and bias
Log-rank test
Using Stata to derive survival functions and the log-rank test
Log-rank test for multiple patient groups
Hazard functions
Proportional hazards
Relative risks and hazard ratios
Proportional hazards regression analysis
Hazard regression analysis of the intracerebral hemorrhage data
Proportional hazards regression analysis with Stata
Tied failure times
Additional reading



Proportional hazards model
Relative risks and hazard ratios
95% confidence intervals and hypothesis tests
Nested models and model deviance
An example: the Framingham Heart Study

Kaplan–Meier survival curves for DBP
Simple hazard regression model for CHD risk and DBP
Restricted cubic spline model of CHD risk and DBP
Categorical hazard regression model of CHD risk and DBP
Simple hazard regression model of CHD risk and gender
Multiplicative model of DBP and gender on risk of CHD
Using interaction terms to model the effects of gender and DBP on CHD
Adjusting for confounding variables
Alternative models

Proportional hazards regression analysis using Stata
Stratified proportional hazards models
Survival analysis with ragged study entry

Kaplan–Meier survival curve and the log-rank test with ragged entry
Age, sex, and CHD in the Framingham Heart Study
Proportional hazards regression analysis with ragged entry
Survival analysis with ragged entry using Stata

Predicted survival, log–log plots, and the proportional hazards assumption

Evaluating the proportional hazards assumption with Stata

Hazard regression models with time-dependent covariates

Testing the proportional hazards assumption
Modeling time-dependent covariates with Stata

Additional reading



Elementary statistics involving rates
Calculating relative risks from incidence data using Stata
The binomial and Poisson distributions
Simple Poisson regression for 2 × 2 tables
Poisson regression and the generalized linear model
Contrast between Poisson, logistic, and linear regression
Simple Poisson regression with Stata
Poisson regression and survival analysis

Recoding survival data on patients as patient–year data
Converting survival records to person–years of follow-up using Stata

Converting the Framingham survival data set to person–time data
Simple Poisson regression with multiple data records
Poisson regression with a classification variable
Applying simple Poisson regression to the Framingham data
Additional reading



Multiple Poisson regression model
An example: the Framingham Heart Study

A multiplicative model of gender, age, and coronary heart disease
A model of age, gender, and CHD with interaction terms
Adding confounding variables to the model

Using Stata to perform Poisson regression
Residual analyses for Poisson regression models

Deviance residuals

Residual analysis of Poisson regression models using Stata
Additional reading



One-way analysis of variance
Multiple comparisons
Reformulating analysis of variance as a linear regression model
Non-parametric methods
Kruskal–Wallis test
Example: a polymorphism in the estrogen receptor gene
User contributed software in Stata
One-way analyses of variance using Stata
Two-way analysis of variance, analysis of covariance, and other models
Additional reading



Example: effect of race and dose of isoproterenol on blood flow
Exploratory analysis of repeated measures data using Stata
Response feature analysis
Example: the isoproterenol data set
Response feature analysis using Stata
The area-under-the-curve response feature
Generalized estimating equations
Common correlation structures
GEE analysis and the Huber–White sandwich estimator
Example: analyzing the isoproterenol data with GEE
Using Stata to analyze the isoproterenol data set using GEE
GEE analyses with logistic or Poisson models
Additional reading



Models for continuous response variables with one response per patient
Models for dichotomous or categorical response variables with one response per
Models for survival data (follow-up time plus fate at exit observed on each
Models for response variables that are event rates or the number of events
during a specified number of patient–years of follow-up. The event must be rare
Models with multiple observations per patient or matched or clustered patients



Data manipulation and description
Analysis commands
Graph commands
Common options for graph commands (insert after comma)
Post-estimation commands (affected by preceding regression-type command)
Command prefixes
Command qualifiers (insert before comma)
Logical and relational operators and system variables (see Stata User’s Guide)
Functions (see Stata Data Management Manual)

Author: William D. Dupont
Edition: Second Edition
ISBNN-13: 978-0-521-61480-1
©Copyright: 2009

William Dupont’s Statistical Modeling for Biomedical Researchers, Second Edition is ideal for a one-semester graduate course in biostatistics and epidemiology. Dupont assumes only a basic knowledge of statistics, such as that obtained from a standard introductory statistics course. Stata is used extensively throughout the text, making it possible to introduce computationally complex methods with little or no higher-level mathematics.