mdata: Stata package to handle metadata¶Gustavo Iglésias - Banco de Portugal Microdata Research Laboratory (BPLIM)
extract: extracts metadata from the dataset in memory to an Excel file
apply: applies metadata from an Excel file to the dataset in memory
check: checks for inconsistencies in the Excel metadata file
cmp: compares Excel metadata files
combine: combines Excel metadata files
morph: transforms Excel metadata files to eliminate redundant information
uniform: harmonizes information in Excel metadata files
clear: removes all metadata from the dataset in memory
mdata subcommand [, options]where subcommand is one of the tools presented in the previous slide.
mdata extract¶mdata extract exports metadata from the dataset in memory to an Excel file, which is organized in sheetsMetadata exported to this file includes, but is not limited to:
Data labels, notes and characteristics
Label languages defined
Variables' labels, type and format
mdata extract¶Lets take as an example the Stata data set nlsw88
%%stata
use "data/nlsw88", clear
describe
. use "data/nlsw88", clear
(NLSW, 1988 extract)
. describe
Contains data from data/nlsw88.dta
 Observations:         2,246                  NLSW, 1988 extract
    Variables:            17                  22 Apr 2022 16:41
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 NLS ID
age             byte    %8.0g                 Age in current year
race            byte    %8.0g      racelbl    Race
married         byte    %8.0g      marlbl     Married
never_married   byte    %16.0g     nev_mar    Never married
grade           byte    %8.0g                 Current grade completed
collgrad        byte    %16.0g     gradlbl    College graduate
south           byte    %9.0g      southlbl   Lives in the south
smsa            byte    %9.0g      smsalbl    Lives in SMSA
c_city          byte    %16.0g     ccitylbl   Lives in a central city
industry        byte    %23.0g     indlbl     Industry
occupation      byte    %22.0g     occlbl     Occupation
union           byte    %8.0g      unionlbl   Union worker
wage            float   %9.0g                 Hourly wage
hours           byte    %8.0g                 Usual hours worked
ttl_exp         float   %9.0g                 Total work experience (years)
tenure          float   %9.0g                 Job tenure (years)
-------------------------------------------------------------------------------
Sorted by: idcode
. 
mdata extract¶%%stata 
cap mkdir meta
mdata extract, meta("meta/meta1", replace) 
. cap mkdir meta
. mdata extract, meta("meta/meta1", replace)
File meta/meta1.xlsx saved
. 
mdata extract¶%%stata
* Example with labels in Portuguese
label language pt, new
* Variable labels
label var age "Idade"
label var race "Raça"
* Value labels
label define marlbl_pt 0 "Solteiro" 1 "Casado"
label values married marlbl_pt
label language en
* Extract metadata
mdata extract, meta("meta/meta2", replace)
. * Example with labels in Portuguese
. label language pt, new
(language pt now current language)
. * Variable labels
. label var age "Idade"
. label var race "Raça"
. * Value labels
. label define marlbl_pt 0 "Solteiro" 1 "Casado"
. label values married marlbl_pt
. label language en
. * Extract meta data
. mdata extract, meta("meta/meta2", replace)
File meta/meta2.xlsx saved
. 
mdata extract¶
Advantages of using mdata extract:
All the metadata is stored in an Excel file, so users can easily inspect it
Metadata may be analysed (and changed) by non-Stata users
By separating data from metadata, it is possible to use more efficient formats
We can apply the stored metadata to new data (mdata apply)
mdata apply¶mdata apply applies metadata stored in the Excel metadata file to data in memorymdata extractmdata check)mdata apply is particularly useful when you have incoming (monthly, annual, etc.) data that is structurally similarmdata apply¶%%stata
use data/nlsw85, clear
describe
mdata extract, meta("meta/meta85", replace)
. use data/nlsw85, clear
(NLSW - 1985 extraction)
. describe
Contains data from data/nlsw85.dta
 Observations:         2,085                  NLSW - 1985 extraction
    Variables:             7                  22 Apr 2022 18:26
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 NLS ID
year            byte    %8.0g                 Interview year
birth_yr        byte    %8.0g                 Birth year
age             byte    %8.0g                 Age in current year
race            byte    %8.0g      racelbl    Race
msp             byte    %23.0g     msplbl     1 if married, spouse present
collgrad        byte    %16.0g     collgradlbl
                                              1 if college graduate
-------------------------------------------------------------------------------
Sorted by: idcode  year
. mdata extract, meta(meta/meta85, replace)
File meta/meta85.xlsx saved
. 
mdata apply¶%%stata
use data/nlsw87, clear
describe
. use data/nlsw87, clear
. describe
Contains data from data/nlsw87.dta
 Observations:         2,164                  
    Variables:             8                  22 Apr 2022 18:29
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 
year            byte    %8.0g                 
birth_yr        byte    %8.0g                 
age             byte    %8.0g                 
race            byte    %8.0g                 
msp             byte    %8.0g                 
collgrad        byte    %8.0g                 
union           byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: 
. 
mdata apply¶%%stata
mdata apply, meta("meta/meta87") do("dos/apply87")
describe
. mdata apply, meta(meta/meta87) do(dos/apply87)
File dos/apply87.do saved
. describe
Contains data from data/nlsw87.dta
 Observations:         2,164                  
    Variables:             8                  22 Apr 2022 18:29
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
idcode          int     %8.0g                 NLS ID
year            byte    %8.0g                 Interview year
birth_yr        byte    %8.0g                 Birth year
age             byte    %8.0g                 Age in current year
race            byte    %8.0g      racelbl    Race
msp             byte    %23.0g     msplbl     1 if married, spouse present
collgrad        byte    %16.0g     collgradlbl
                                              1 if college graduate
union           byte    %8.0g      unionlbl   Union worker
-------------------------------------------------------------------------------
Sorted by: idcode  year
     Note: Dataset has changed since last saved.
. 
mdata check¶mdata check verifies the integrity of metadata stored in the Excel metadata filemdata extractmdata apply, whose execution stops if any inconsistency is foundmdata cmp¶mdata cmp compares metadata found in two Excel metadata filesmdata extract and that the files should be identical (with the exception of data features)Differences are labeled as inconsistencies
Variables
Characteristics
Notes
Value labels
mdata combine¶mdata combine combines metadata found in two Excel metadata files, generating a new Excel metadata filemdata extractmdata morph¶mdata morph transforms the Excel metadata file by removing redundant informationmdata extractmdata uniform¶mdata uniform harmonizes metadata stored in the Excel metadata filemdata extractmdata offers a suite of tools to handle metadata
All the metadata is stored in an Excel file, so users can easily inspect it
Metadata may be analyzed (and changed) by non-Stata users
By separating data from metadata:
It is possible to use more efficient formats (like parquet for example) when dealing with large amounts of data
Manipulate and combine metadata without loading data into memory (useful for huge data sets)
Allows users who cannot see the data (confidential data) to still be able to analyze and manipulate the metadata
Use the same metadata for multiple data
Portability of metadata
gtools package by Mauricio Caceres
bpencode by BPLIM