The Workflow of Data Analysis Using Stata

The Workflow of Data Analysis Using Stata, by J. Scott Long, is an essential productivity tool for data analysts. Aimed at anyone who analyzes data, this book presents an effective strategy for designing and doing data-analytic projects.

In this book, Long presents lessons gained from his experience with numerous academic publications, as a coauthor of the immensely popular Regression Models for Categorical Dependent Variables Using Stata, and as a coauthor of the SPOST routines, which are downloaded over 20,000 times a year.


A workflow of data analysis is a process for managing all aspects of data analysis. Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow.


Long shows how to design and implement efficient workflows for both one-person projects and team projects. Long guides you toward streamlining your workflow, because a good workflow is essential for replicating your work, and replication is essential for good science.


An efficient workflow reduces the time you spend doing data management and lets you produce datasets that are easier to analyze. When you methodically clean your data and carefully choose names and effective labels for your variables, the time you spend doing statistical and graphical analyses will be more productive and more enjoyable.


After introducing workflows and explaining how a better workflow can make it easier to work with data, Long describes planning, organizing, and documenting your work. He then introduces how to write and debug Stata do-files and how to use local and global macros. Long presents conventions that greatly simplify data analysis—conventions for naming, labeling, documenting, and verifying variables. He also covers cleaning, analyzing, and protecting your data.


While describing effective workflows, Long also introduces the concepts of basic data management using Stata and writing Stata do-files. Using real-world examples, Stata commands, and Stata scripts, Long illustrates effective techniques for managing your data and analyses. If you analyze data, this book is recommended for you.

List of tables
List of figures
List of examples
A word about fonts, files, commands, and examples



Replication: The guiding principle for workflow
Steps in the workflow

Cleaning data
Running analysis
Presenting results
Protecting files

Tasks within each step


Criteria for choosing a workflow


Changing your workflow
How the book is organized



The cycle of data analysis

Principles for organization
Organizing files and directories
Creating your directory structure

A directory structure for a small project
A directory structure for a large, one-person project
Directories for collaborative projects
Special-purpose directories
Remembering what directories contain
Planning your directory structure
Naming files
Batch files

Moving into a new directory structure (advanced topic)Example of moving into a new directory structure


What should you document?
Levels of documentation
Suggestions for writing documentation

Evaluating your documentation

The research log

A sample page from a research log
A template for research logs


A codebook based on the survey instrument

Dataset documentation




Three ways to execute commands

The Command window
Dialog boxes

Writing effective do-files

Making do-files robust

Make do-files self-contained
Use version control
Exclude directory information
Include seeds for random numbers

Making do-files legible

Use lots of comments
Use alignment and indentation
Use short lines
Limit your abbreviations
Be consistent

Templates for do-files

Commands that belong in every do-file
A template for simple do-files
A more complex do-file template

Debugging do-files

Simple errors and how to fix them

Log file is open
Log file already exists
Incorrect command name
Incorrect variable name
Incorrect option
Missing comma before options

Steps for resolving errors

Step 1: Update Stata and user-written programs
Step 2: Start with a clean slate
Step 3: Try other data
Step 4: Assume everything could be wrong
Step 5: Run the program in steps
Step 6: Exclude parts of the do-file
Step 7: Starting over
Step 8: Sometimes it is not your mistake

Example 1: Debugging a subtle syntax error
Example 2: Debugging unanticipated results
Advanced methods for debugging

How to get help




Local and global macros

Local macros
Global macros
Using double quotes when defining macros
Creating long strings

Specifying groups of variables and nested models
Setting options with locals

Information returned by Stata commands

Using returned results with local macros

Loops: foreach and forvalues

The foreach command
The forvalues command

Ways to use loops

Loop example 1: Listing variable and value labels
Loop example 2: Creating interaction variables
Loop example 3: Fitting models with alternative measures of education
Loop example 4: Recoding multiple variables the same way
Loop example 5: Creating a macro that holds accumulated information
Loop example 6: Retrieving information returned by Stata

Counters in loops

Using loops to save results to a matrix

Nested loops
Debugging loops

The include command

Specifying the analysis sample with an include file
Recoding data using include files
Caution when using include files


A simple program to change directories
Loading and deleting ado-files
Listing variable names and labels
A general program to change your working directory
Words of caution

Help files

help me




Posting files
The dual workflow of data management and statistical analysis
Names, notes, and labels
Naming do-files

Naming do-files to re-create datasets
Naming do-files to reproduce statistical analysis
Using master do-files

Master log files

A template for naming do-files

Using subdirectories for complex analysis

Naming and internally documenting datasets

Never name it final!

One time only and temporary datasets
Datasets for larger projects
Labels and notes for datasets
The datasignature command

A workflow using the datasignature command
Changes datasignature does not detect

Naming variables

The fundamental principle for creating and naming variables
Systems for naming variables

Sequential naming systems
Source naming systems
Mnemonic naming systems

Planning names
Principles for selecting names

Anticipate looking for variables
Use simple, unambiguous names
Try names before you decide

Labeling variables

Listing variable labels and other information

Changing the order of variables in your dataset

Syntax for label variable
Principles for variable labels

Beware of truncation
Test labels before you post the file

Temporarily changing variable labels
Creating variable labels that include the variable nam

Adding notes to variables

Commands for working with notes

Listing notes
Removing notes
Searching notes

Using macros and loops with notes

Value labels

Creating value labels is a two-step process

Step 1: Defining labels
Step 2: Assigning labels
Why a two-step system?
Removing labels

Principles for constructing value labels

Keep labels short
Include the category number
Avoid special characters
Keeping track of where labels are used

Cleaning value labels
Consistent value labels for missing values
Using loops when assigning value labels

Using multiple languages

Using label language for different written languages
Using label language for short and long labels

A workflow for names and labels

Step 1: Plan the changes
Step 2: Archive, clone, and rename
Step 3: Revise variable labels
Step 4: Revise value labels
Step 5: Verify the changes

Step 1: Check the source data

Step 1a: List the current names and labels
Step 1b: Try the current names and labels

Step 2: Create clones and rename variables

Step 2a: Create clones
Step 2b: Create rename commands
Step 2c: Rename variables

Step 3: Revise variable labels

Step 3a: Create variable-label commands
Step 3b: Revise variable labels

Step 4: Revise value labels

Step 4a: List the current labels
Step 4b: Create label define commands to edit
Step 4c: Revise labels and add them to dataset

Step 5: Check the new names and labels




Importing data

Data formatsASCII data formats

Binary-data formats

Ways to import data

Stata commands to import data
Using other statistical packages to export data
Using a data conversion program

Verifying data conversion

Converting the ISSP 2002 data from Russia

Verifying variables

Values review

Values review of data about the scientific career
Values review of data on family values

Substantive review

What does time to degree measure?
Examining high-frequency values
Links among variables
Changes in survey questions

Missing-data review

Comparisons and missing values
Creating indicators of whether cases are missing
Using extended missing values
Verifying and expanding missing-data codes
Using include files

Internal consistency review

Consistency in data on the scientific career

Principles for fixing data inconsistencies

Creating variables for analysis

Principles for creating new variables

New variables get new names
Verify that new variables are correct
Document new variables
Keep the source variables

Core commands for creating variables

The generate command
The clonevar command
The replace command

Creating variables with missing values
Additional commands for creating variables

The recode command
The egen command
The tabulate, generate() command

Labeling variables created by Stata
Verifying that variables are correct

Checking the code
Listing variables
Plotting continuous variables
Tabulating variables
Constructing variables multiple ways

Saving datasets

Selecting observations

Deleting cases versus creating selection variables

Dropping variables

Selecting variables for the ISSP 2002 Russian data

Ordering variables
Internal documentation
Compressing variables
Running diagnostics

The codebook, problem command
Checking for unique ID variables

Adding a data signature
Saving the file
After a file is saved

Extended example of preparing data for analysis

Creating control variables
Creating binary indicators of positive attitudes
Creating four-category scales of positive attitudes

Merging files


Sorting the ID variable

One-to-one merging

Combining unrelated datasets

Forgetting to match-merge




Planning and organizing statistical analysis

Planning in the large
Planning in the middle
Planning in the small

Organizing do-files

Using master do-files
What belongs in your do-file?

Documentation for statistical analysis

The research log and comments in do-files
Documenting the provenance of results

Captions on graphs

Analyzing data using automation

Locals to define sets of variables
Loops for repeated analyses

Computing t tests using loops
Loops for alternative model specifications

Matrices to collect and print results

Collecting results of t tests
Saving results from nested regressions
Saving results from different transformations of articles

Creating a graph from a matrix
Include files to load data and select your sample

Baseline statistics

Lost or forgotten files
Software and version control
Unknown seed for random numbers

Bootstrap standard errors
Letting Stata set the seed
Training and confirmation samples

Using a global that is not in your do-file

Presenting results

Creating tables

Using spreadsheets
Regression tables with esttab

Creating graphs

Colors, black, and white
Font size

Tips for papers and presentations


A project checklist




Levels of protection and types of files
Causes of data loss and issues in recovering a file
Murphy’s law and rules for copying files
A workflow for file protection

Part 1: Mirroring active storage
Part 2: Offline backups

Archival preservation





How Stata works

Stata directories
The working directory

Working on a network
Customizing Stata

Fonts and window locations
Commands to change preferences

Options that can be set permanently
Options that need to be set each session

Function keys

Additional resources

Author: J. Scott Long
ISBN-13: 978-1-59718-047-4
©Copyright: 2009
Versione e-Book disponibile

The Workflow of Data Analysis Using Stata, by J. Scott Long, is an essential productivity tool for data analysts. Aimed at anyone who analyzes data, this book presents an effective strategy for designing and doing data-analytic projects.
In this book, Long presents lessons gained from his experience with numerous academic publications, as a coauthor of the immensely popular Regression Models for Categorical Dependent Variables Using Stata, and as a coauthor of the SPOST routines, which are downloaded over 20,000 times a year.