Stata Users Group Meeting - Florence 2022¶

mdata: Stata package to handle metadata¶

Gustavo Iglésias - Banco de Portugal Microdata Research Laboratory (BPLIM)

Motivation¶

  • At BPLIM we deal with large amounts of data, so it is crucial that data is stored efficiently;
  • Although there are formats far more efficient storage-wise than .dta, Stata provides very nice features to handle metadata
  • The goal is to separate data from metadata, and store the latter in an Excel file, so we can analyze, change, combine and apply it to a dataset

 

Tools¶

  • extract: extracts metadata from the dataset in memory to an Excel file

  • apply: applies metadata from an Excel file to the dataset in memory

  • check: checks for inconsistencies in the Excel metadata file

  • cmp: compares Excel metadata files

  • combine: combines Excel metadata files

  • morph: transforms Excel metadata files to eliminate redundant information

  • uniform: harmonizes information in Excel metadata files

  • clear: removes all metadata from the dataset in memory

 

User Guide

  • On this presentation, we are going to focus on the extraction and application of metadata. So most of the examples presented concern the commands mdata extract and mdata apply. These are, in our view, the most useful tools and should be the main focus of the presentation.

  • For a more in-depth look at the other tools, a user guide is available upon request.

Syntax¶

mdata subcommand [, options]

where subcommand is one of the tools presented in the previous slide.

mdata extract¶

  • mdata extract exports metadata from the dataset in memory to an Excel file, which is organized in sheets
  • Metadata exported to this file includes, but is not limited to:

    • Data labels, notes and characteristics

    • Label languages defined

    • Variables' labels, type and format

mdata extract¶

Lets take as an example the Stata data set nlsw88

In [3]:
%%stata
use "data/nlsw88", clear
describe
. use "data/nlsw88", clear
(NLSW, 1988 extract)

. describe

Contains data from data/nlsw88.dta
 Observations:         2,246                  NLSW, 1988 extract
    Variables:            17                  22 Apr 2022 16:41
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 NLS ID
age             byte    %8.0g                 Age in current year
race            byte    %8.0g      racelbl    Race
married         byte    %8.0g      marlbl     Married
never_married   byte    %16.0g     nev_mar    Never married
grade           byte    %8.0g                 Current grade completed
collgrad        byte    %16.0g     gradlbl    College graduate
south           byte    %9.0g      southlbl   Lives in the south
smsa            byte    %9.0g      smsalbl    Lives in SMSA
c_city          byte    %16.0g     ccitylbl   Lives in a central city
industry        byte    %23.0g     indlbl     Industry
occupation      byte    %22.0g     occlbl     Occupation
union           byte    %8.0g      unionlbl   Union worker
wage            float   %9.0g                 Hourly wage
hours           byte    %8.0g                 Usual hours worked
ttl_exp         float   %9.0g                 Total work experience (years)
tenure          float   %9.0g                 Job tenure (years)
-------------------------------------------------------------------------------
Sorted by: idcode

. 

With the describe command, we get the most general features of the dataset in memory. With mdata extract we export this information and a lot more to an Excel file.

mdata extract¶

In [4]:
%%stata 
cap mkdir meta
mdata extract, meta("meta/meta1", replace) 
. cap mkdir meta

. mdata extract, meta("meta/meta1", replace)

File meta/meta1.xlsx saved

. 

Lets inspect the file we have just created

  • General data features

  • Specific data features (features that are not applied with mdata apply)

  • Variables information

  • Characteristics, notes, and value labels

mdata extract¶

  • If we make some changes to the data, those changes are going to be reflected on the metadata file

  • At BPLIM, we provide labels in Portuguese and English

  • Next I show what happens if we add a new language and define labels and value labels

In [5]:
%%stata
* Example with labels in Portuguese
label language pt, new
* Variable labels
label var age "Idade"
label var race "Raça"
* Value labels
label define marlbl_pt 0 "Solteiro" 1 "Casado"
label values married marlbl_pt
label language en
* Extract metadata
mdata extract, meta("meta/meta2", replace)
. * Example with labels in Portuguese
. label language pt, new
(language pt now current language)

. * Variable labels
. label var age "Idade"

. label var race "Raça"

. * Value labels
. label define marlbl_pt 0 "Solteiro" 1 "Casado"

. label values married marlbl_pt

. label language en

. * Extract meta data
. mdata extract, meta("meta/meta2", replace)

File meta/meta2.xlsx saved

. 

mdata extract¶

 

Advantages of using mdata extract:

  • All the metadata is stored in an Excel file, so users can easily inspect it

  • Metadata may be analysed (and changed) by non-Stata users

  • By separating data from metadata, it is possible to use more efficient formats

  • We can apply the stored metadata to new data (mdata apply)

mdata apply¶

  • mdata apply applies metadata stored in the Excel metadata file to data in memory
  • The command assumes that the Excel file has the structure of the file produced by mdata extract
  • The command only applies the metadata after checking its integrity (mdata check)
  • mdata apply is particularly useful when you have incoming (monthly, annual, etc.) data that is structurally similar
  • In this case, data providers may use metadata from previous extractions on the current batch of data
  • We illustrate this feature using a modified version of the nlswork data set for different years

mdata apply¶

In [9]:
%%stata
use data/nlsw85, clear
describe
mdata extract, meta("meta/meta85", replace)
. use data/nlsw85, clear
(NLSW - 1985 extraction)

. describe

Contains data from data/nlsw85.dta
 Observations:         2,085                  NLSW - 1985 extraction
    Variables:             7                  22 Apr 2022 18:26
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 NLS ID
year            byte    %8.0g                 Interview year
birth_yr        byte    %8.0g                 Birth year
age             byte    %8.0g                 Age in current year
race            byte    %8.0g      racelbl    Race
msp             byte    %23.0g     msplbl     1 if married, spouse present
collgrad        byte    %16.0g     collgradlbl
                                              1 if college graduate
-------------------------------------------------------------------------------
Sorted by: idcode  year

. mdata extract, meta(meta/meta85, replace)

File meta/meta85.xlsx saved

. 

mdata apply¶

In [10]:
%%stata
use data/nlsw87, clear
describe
. use data/nlsw87, clear

. describe

Contains data from data/nlsw87.dta
 Observations:         2,164                  
    Variables:             8                  22 Apr 2022 18:29
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 
year            byte    %8.0g                 
birth_yr        byte    %8.0g                 
age             byte    %8.0g                 
race            byte    %8.0g                 
msp             byte    %8.0g                 
collgrad        byte    %8.0g                 
union           byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: 

. 
  • We see that for the year 1987, we have no labels or value labels

  • This may be changed with mdata apply

  • New variable: union

mdata apply¶

In [11]:
%%stata
mdata apply, meta("meta/meta87") do("dos/apply87")
describe
. mdata apply, meta(meta/meta87) do(dos/apply87)
File dos/apply87.do saved

. describe

Contains data from data/nlsw87.dta
 Observations:         2,164                  
    Variables:             8                  22 Apr 2022 18:29
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 NLS ID
year            byte    %8.0g                 Interview year
birth_yr        byte    %8.0g                 Birth year
age             byte    %8.0g                 Age in current year
race            byte    %8.0g      racelbl    Race
msp             byte    %23.0g     msplbl     1 if married, spouse present
collgrad        byte    %16.0g     collgradlbl
                                              1 if college graduate
union           byte    %8.0g      unionlbl   Union worker
-------------------------------------------------------------------------------
Sorted by: idcode  year
     Note: Dataset has changed since last saved.

. 
  • mdata extract and mdata apply are the core tools of the package mdata

  • The other tools extend the capabilities of mdata, but are a complement to extract and apply

  • Nonetheless, let's go over them quickly to see what functionality that they provide

  • For a more detailed explanation, we refer to the user guide

mdata check¶

  • mdata check verifies the integrity of metadata stored in the Excel metadata file
  • Assumes that the Excel file has the structure of the file produced by mdata extract
  • Search for possible problems in the metadata, dividing them into warnings and inconsistencies, the latter being the most problematic
  • Used by mdata apply, whose execution stops if any inconsistency is found
  • Produces a report with option checkfile

Inconsistencies include duplicated variables in the meta file, missing sheets, and duplicated labels or duplicated values in value labels. Warnings cover problems such as duplicated data features, missing variable labels, truncated variable labels, missing value labels if there is more than one language defined, as well as problems with value labels other than duplicated labels or values. Most of the value label problems were based on labelbook, used with option problems.

mdata cmp¶

  • mdata cmp compares metadata found in two Excel metadata files
  • Assumes that both Excel files have the structure of the file produced by mdata extract and that the files should be identical (with the exception of data features)
  • Differences are labeled as inconsistencies

    • Variables

    • Characteristics

    • Notes

    • Value labels

  • Produces a report if any inconsistency is found

mdata cmp is particularly useful if you have incoming data and meta data on a regular basis and that should have the same structure

Imagine a situation where we receive data every month, we would think that the structure of the data and its meta data should not change that much from month to month. We could check if this is in fact true by extract metadata from the two files and compare it using mdata cmp

mdata combine¶

  • mdata combine combines metadata found in two Excel metadata files, generating a new Excel metadata file
  • Assumes that both Excel files have the structure of the file produced by mdata extract
  • Combines metadata in sheets with the same name found in both files, eliminating duplicated information
  • Metadata that only appears in one file for namesake sheets are flagged in the generated Excel metadata file

Going back to the cenario where we receive data every month, we can extract the metadata for each file without combining the files themselves, which may be quite cumbersome when dealing with large data sets. After extracting the metadata, we may combine several meta data files, check and manipulate its content so that any inconsitency is eliminated. Then, we may apply this metadata to all the files, making it consistent across months.

mdata morph¶

  • mdata morph transforms the Excel metadata file by removing redundant information
  • Assumes that the Excel file has the structure produced by mdata extract
  • Acts only on value labels, merging sheets specified by the user into a new sheet
  • This could be important if the meta data file contains different value label sheets with potentially redundant information. Think about two value labels, educ_emp and educ_man, which stand for employee and manager's level of education, respectively. If the two value labels concern the general education level, one could eliminate these two value label sheets and create a new one - educ for example - that applies to both situations.

  • It is worth noting that once we generate a new value label sheet that is the result of merging other sheets, the sheet variables (see mdata extract) also changes, so as to reflect the changes made to the value label that applies to one or more variables.

mdata uniform¶

  • mdata uniform harmonizes metadata stored in the Excel metadata file
  • Assumes that the Excel file has the structure produced by mdata extract
  • Acts only on value labels, meaning that it only harmonizes sheets that start with vl_

Imagine a case where we get data in different months. Some values that are present in the most recent batch of data might be missing in the old one. So, it's possible that the encoding assigns identical numerical codes for different values of the same variable. If we worked with data for only one month, no problem would arise, but we want to be able to combine data from different months. When we combine the meta data, we get identical values with different labels, which is an inconsistency flagged by mdata extract. mdata uniform is used to harmonize the combined metadata so that no inconsistency persists. It does this by assigning new codes to different labels.

Conclusion¶

  • mdata offers a suite of tools to handle metadata

  • All the metadata is stored in an Excel file, so users can easily inspect it

  • Metadata may be analyzed (and changed) by non-Stata users

  • By separating data from metadata:

    • It is possible to use more efficient formats (like parquet for example) when dealing with large amounts of data

    • Manipulate and combine metadata without loading data into memory (useful for huge data sets)

    • Allows users who cannot see the data (confidential data) to still be able to analyze and manipulate the metadata

    • Use the same metadata for multiple data

    • Portability of metadata

Version¶

  • 16

Dependencies¶

  • gtools package by Mauricio Caceres

  • bpencode by BPLIM

Thank you¶