INTRODUCTION
In this tutorial, we will learn how to load all observations from
- All or a subset of the variables.
- With or without data transformations, such as
- Creating dummy variables.
- Reclassifying string variables to integer categories.
- Creating interaction terms.
ln,exp,lagand more.
from a well-formed dataset. All sections below apply to any dataset that meets our definition of ‘well-formed’ which is explained below.
WHAT FILES DOES THIS APPLY TO?
Our definition of a well-formed dataset includes
- Comma-separated text files (CSV) with headers in the first line of the file.
- Excel files (XLS, XLSX) with headers in the first row of the file.
- GAUSS datasets (DAT) and matrix files (FMT).
- SAS (SAS7BCAT, SAS7BDAT), SPSS (POR, SAV) and Stata (DTA) datasets.
- HDF5 files (H5) if the dataset contains an attribute called
headerswhich contains the variable names.
Regardless of the file type, each file must be organized as a consistent tabular dataset like the example below. Each row of the file must have the same number of columns and each column of the file must have the same number of rows.
Age,Height,Weight 29,61,134 44,74,191 32,70,223
WHAT FILES DOES THIS NOT APPLY TO?
This section does not apply to
- Text files delimited by a character other than a comma.
- CSV files without headers or with empty lines.
- Excel files without headers.
or files which have inconsistent numbers of rows, or columns.
// This dataset is NOT well-formed. // It has: // 1. Comments at the top of the file. // 2. Inconsistent numbers of columns per row. Age,Height,Weight 29,61,134 44,74,191,43,16 32,70,223
HEADERS
The GAUSS function getHeaders will return a string array containing all the variable names from a dataset. It takes only one input, the name of the dataset. The example below reads all of the variable names from the Stata dataset auto2.dta which is located in the GAUSS examples directory.
// Create file name with full path to Stata dataset fname = getGAUSSHome() $+ "examples/auto2.dta"; // Read the variable names from the dataset h = getHeaders(fname); // Print string array containing the dataset headers print h;
will return
make
price
mpg
rep78
headroom
trunk
weight
length
turn
displacement
gear_ratio
foreign
ALL VARIABLES
The GAUSS command loadd can read variables from a dataset. To read all variables from a dataset you only need to pass one input, a string containing the name of the dataset. The example dataset, binary.csv, contains four variables related to college admissions, admit, gre, gpa, and rank.
// Create file name with full path fname = getGAUSSHome() $+ "examples/binary.csv"; // Read all 4 variables from the CSV file X = loadd(fname); // Print the first 5 rows of all columns of 'X' print X[1:5,.];
will return
admit rank gpa rank 0.00 380.00 3.61 3.00 1.00 660.00 3.67 3.00 1.00 800.00 4.00 1.00 1.00 640.00 3.19 4.00 0.00 520.00 2.93 4.00
A SUBSET OF VARIABLES
loadd can accept an optional second argument which is a formula string. The formula string specifies which variables to load and which data transformations to perform. The following operators can be used in a formula string to load a subset of the variables from the dataset.
EXAMPLE: LOAD TWO VARIABLES BY NAME
// Create file name with full path to the SAS dataset fname = getGAUSSHome() $+ "examples/detroit.sas7bdat"; // Load 2 variables by name from the SAS dataset X = loadd(fname, "unemployment + weekly_earn"); // Print the first 5 rows of all columns of 'X' print X[1:5,.];
will return
unemployment weekly_earn
11.0 117.18
7.0 134.02
5.2 141.68
4.3 147.98
3.5 159.85
EXAMPLE: LOAD ALL VARIABLES EXCEPT FOR ONE
cancer.dat is an example GAUSS dataset located in the GAUSS examples directory. It contains five variables, time, histolog, stage, count, and risktime. The example below loads all of these variables, except for stage.
// Create file name with full path fname = getGAUSSHome() $+ "examples/cancer.dat"; // Load all but one variable from the GAUSS dataset X = loadd(fname, ". -stage"); // Print the first 5 rows of all columns of 'X' print X[1:5,.];
will return
time histology count risktime 1.00 1.00 9.00 157.00 1.00 2.00 5.00 77.00 1.00 3.00 1.00 21.00 2.00 1.00 2.00 139.00 2.00 2.00 2.00 68.00
CATEGORICAL VARIABLES
Some data files, such as CSV files, do not contain information specifying the types of the variables in the files. In these cases, it is sometimes necessary to specify how a particular variable should be interpreted.
The following keywords can be used in a formula string to tell GAUSS which variables should be interpreted as categorical or string variables and to create dummy variables if desired.
EXAMPLE: INTEGER VARIABLE TO DUMMY VARIABLES
housing.csv is an example dataset from the GAUSS examples directory. The variable baths, represents the number of bathrooms in the home and contains the following unique values: 1, 2, 3, and 4.
The example code below loads the baths variable unmodified for comparison. In the next step, the code tells loadd to create dummy variables from the integer categories in the baths variable by using the factor keyword in the formula string.
// Create file name with full path fname = getGAUSSHome() $+ "examples/housing.csv"; // Load the original categorical data baths = loadd(fname, "baths"); // Load the categorical variable and create dummy vars dmy = loadd(fname, "factor(baths)");
After the code above, the first 5 rows of baths and dmy will be equal to
baths baths_2 baths_3 baths_4
baths = 2 dmy = 1 0 0
1 0 0 0
2 1 0 0
2 1 0 0
3 0 1 0
© 2025 Aptech Systems, Inc. All rights reserved.
As you can see above, the base case is set to the case when baths equals one.
EXAMPLE: TEXT CATEGORICAL VARIABLE TO DUMMY VARIABLES
The example Excel file, nba_ht_wt.xls, contains seven variables with different information about NBA basketball players. The Pos variable represents the position played by the basketball player. The levels are C, F, and G, which represent center, forward and guard.
The code below uses the cat keyword in the formula string to tell loadd to create a categorical variable from the text data. Then the code wraps the factor keyword around the cat keyword to load the data as a categorical column and convert to dummy variables in one step.
// Create file name with full path fname = getGAUSSHome() $+ "examples/nba_ht_wt.xls"; // Load the string variable turn it // into a categorical variable position = loadd(fname, "cat(Pos)"); // Load the string variable turn it into a // categorical variable and then a dummy variable pos_dummy = loadd(fname, "factor(cat(Pos))");
below is a preview of the first five observations created by the above code:
Pos Pos_F Pos_G
position = C pos_dummy = 0 0
G 0 1
G 0 1
F 1 0
F 1 0
This time, C, is the base case.
COMBINING KEYWORDS AND OPERATORS
Both the cat and factor keywords can be combined with the ., + and - operators. For example, the following statements would be legal.
fname = getGAUSSHome() $+ "examples/housing.csv"; X = loadd(fname, "price + factor(baths) + taxes"); fname = getGAUSSHome() $+ "examples/yarn.xlsx" X = loadd(fname, "cat(amplitude) + cycles");
INTERACTION EFFECTS
The * and : operators are used in formula strings to create interaction effects.
EXAMPLE: INTERACTION TERM
By default when an interaction term is specified in a formula string, the variables that form the interaction are also included. The example below will load 3 variables, Height, Weight and the interaction term of Height*Weight.
// Create file name with full path fname = getGAUSSHome() $+ "examples/nba_ht_wt.xls"; // Load the variables 'Height' and 'Weight' // then create a third variable which is the // interaction between them nba = loadd(fname, "Height*Weight"); // Print the first 5 rows of the 3 specified variables print nba[1:5,.];
The code above will print the following output
Height Weigth Height_Weight
83 260 21580
74 180 13320
77 215 16555
81 260 21060
81 235 19035
The name of the product of Height and Weight is not Height*Weight, because if that variable name was included in another formula string it would indicate an interaction term instead of this variable.
GAUSS replaces invalid characters from variable names with underscores.
EXAMPLE: INTERACTION TERM ALONE
Usually, when we create an interaction term, we will also include the original variables. However, it is sometimes useful to load only the interaction variable. We can specify that we only want the interaction, by using the colon operator, :, in the formula string as shown below.
// Create file name with full path fname = getGAUSSHome() $+ "examples/nba_ht_wt.xls"; // Load only one variable, which is the // interaction between 'Height' and 'Weight' X = loadd(fname, "Height:Weight"); // Print the first 5 rows of the 1 specified variable print X[1:5];
The code above will print the following output
Height_Weigth
21580
13320
16555
21060
19035
DATA TRANSFORMATIONS
GAUSS allows you to transform your variables when loading, by using a procedure in a formula string.
EXAMPLE: NATURAL LOG
// Create file name with full path to Stata dataset fname = getGAUSSHome() $+ "examples/auto2.dta"; // Load 'price' from 'auto2.dta' and perform // natural log transform ln_price = loadd(fname, "ln(price)"); // Print the first 5 rows of 'ln_price' print ln_price[1:5];
The code above will return the following output.
lnprice 8.3185 8.4657 8.2425 8.4797 8.9653
The variable name was updated to represent the transformation with the parentheses replaced with underscores.
EXAMPLE: THE FIRST DIFFERENCE OF THE NATURAL LOG
Now let’s do something slightly more complicated. Suppose you want to compute the first difference of the natural log of the price variable from the auto2.dta dataset. GAUSS allows you to use any procedure in a formula string as long as it takes a column vector as the only input and returns a column vector of the same size as the only output.
So we will first create a procedure to compute the first difference of the natural log. We will call it lnDiff. Then we can use it in our formula string, like this
// Define procedure to compute the first // difference of the natural log of a variable proc (1) = lnDiff(x); local ln_x; // Compute the natural log of the input ln_x = ln(x); // Compute the difference of the natural log // and return the result retp(ln_x - lag(ln_x)); endp; // Create file name with full path to Stata dataset fname = getGAUSSHome() $+ "examples/auto2.dta"; // Load the 'price' variable and call // our 'lnDiff' procedure on it X = loadd(fname, "lnDiff(price)"); // Print the first 5 observations print X[1:5];
The code above will print the following output. Note that the first observation is a missing value since we lose one observation when computing the lag.
lnDiffprice
.
0.1472
-0.2232
0.2372
0.4856
CONCLUSION
In this tutorial, we have learned how to
- Load all or a subset of variables with the
+,-and.operators. - Creating dummy variables with the
factorkeyword. - Reclassifying string variables to integer categories with the
catkeyword. - Creating interaction terms with the
*and:operators. - Performing data transformations by using GAUSS procedures in formula strings.
from a well-formed, tabular dataset.



