Stata Program Define Example Essay

Programming in Stata

Almost as soon as you start writing Stata code, you start looking for ways to write code faster and with less errors. One solution is to make one piece of code do more than one thing. While this may make the code a bit more complex and harder to debug, it saves having to write and debug a separate piece of code for each task. This article will teach you how to write this kind of flexible code. We will cover local macros, programs, loops, and a few miscellaneous tools along the way. We'll end by writing our own brand new Stata command (an ado file).

This article includes many examples which are also available as do files (and one ado file). Links to these files are included within the article, but you may want to get them all at once so you are not interrupted as you work. If you are using an SSCC Linux server, the following commands will create a directory called stataprog and save all the files there.

mkdir ~/stataprog
cd ~/stataprog
cp /usr/global/web/sscc/pubs/files/4-15/* .

You can also download all of them from the web by going to this list of files. Note that the files are written assuming they will be run on an SSCC Linux server. In particular, some of them load the auto data set from /software/stata/auto. You will need to change that to the directory where Stata is installed on the computer you are using. For example, on the Winstats you need to change it to "c:\program files\stata9\auto".

Now on to the programming tools.

Local Macros

Local macros are somewhat like variables in programming languages. They are "boxes" where you can store things and pull them our later. This allows you to write code that will do different things depending on the value of the macros at the time it is run.

Macros are easy to define; try typing the following:

local x=1
display `x'

The first line defines a local macro called x and sets it equal to 1. The second displays `x'. It is critical you see that the single quotation marks around the `x' are not the same (how different they look depends on the font). The left quotation mark (`) is found under the tilde (~), usually in the upper left corner of the keyboard. The right quotation mark (') is found under the double quotation mark (") usually in the center-right of the keyboard. You must put the left quote before the macro name and the right quote after, or Stata will complain at you.

While macros can be used like variables, they are not really variables. What really happens is that macros are replaced by the text they contain before Stata interprets the command. So in our example display `x' is exactly the same as typing display 1. All macros are stored as strings, even numbers. In fact we don't even need the equals sign in the macro definition unless we want Stata to do some math first.

local x 1

is the same as

local x=1

and Stata will actually process the first a bit quicker. The following example should show you when to use the equals sign. Note that display "stuff" means to display the stuff in the parentheses as a string, without evaluating it.

local x 2+2
di `x'
di "`x'"

local x=2+2
di `x'
di "`x'"

In the first case the local macro x contains "2+2", as we could see when we displayx as a string. In the second case 2+2 was evaluated to be 4 and then stored in x. Here's a test: What will the following code display (note that ^2 means raise to the second power)?

local x -2
di `x'^2

If you guessed 4, you forgot either the precedence of algebraic operators or how Stata uses macros. `x' is replaced by -2 before Stata does anything with it, so it sees -2^2. But the power takes precedence over the minus sign, so this is the same as -(2^2), not (-2)^2. If `x' were a variable like in other programming languages, the minus sign would not be separate from the 2.

The nice thing about macros not being variables is that you can put almost anything in them and use them absolutely anywhere. You can even include macros in macro definitions. Try:

local i=`i'+1

Right now, `i' has nothing in it. So what just happened? An undefined macro will be replaced by nothing. So what Stata saw was

local i=+1

which is perfectly legal. In most cases however, using an undefined macro will lead to syntax errors. If you mistype a macro name, Stata will assume you meant some other, currently undefined macro--this can be a particularly difficult error to debug.

Run that command again:

local i=`i'+1

This time `i' does have some table, so Stata sees

local i=1+1

and `i' is set to 2. You could use a command like this to increment a counter.

Macros are perfectly legal in file names for log files and data sets. For example if you were creating separate data sets by race and sex, you could just define macros for race and sex and then use them in the save command. If you type

local race Black
local sex Women
save `race'`sex'

then Stata creates a data set called . If the save command followed

local race White
local sex Men

it would create .

You could also use macros as a replacement for copy and paste in your text editor: assign a macro to a short piece of code that is repeated in your program and just use type the macro instead of the code. But this generally makes your code much harder to read. For example:

tab `macro1'

could be tabulating anything, and could include if or in conditions and options. In order to know what's going on, you have to find the most recent definition of macro1. On the other hand, if used wisely macros can make your code clearer. The key is to use well-named macros to substitute for logical chunks of code. For example, if you had a big list of control variables that you used constantly, you could define the list as a a macro called controls. Then instead of

reg income edu race occupation location... (many more control variables)

you could type

reg income edu `controls'

and be done. Or if you repeatedly deal with subsamples of your data, you could define a macro that gave the conditions for that subsample. For example

reg income edu `controls' if race=="black" & sex=="female"

could also be done as

local blackWomen race=="black" & sex=="female"

reg income edu `controls' if `blackWomen'

You could save a bit of typing by including the if within the macro; clearly it only makes sense when following an if. But you don't want to in order to preserve the readability of your code. If you include the if, it's not clear what the macro `BlackWomen' does. But if `BlackWomen' makes it fairly obvious that the macro just gives the particular conditions that define a black woman in your data set. In both of these cases, the macros (perhaps arguably) make your code clearer by hiding the details of the implementation.

To see the above examples in action, run macro.do.

Programs

A program allows you to define a chunk of code in one place and run it repeatedly. You can also pass in parameters which will be stored as macros, then use those macros in various ways within the program.

A Stata program is just some Stata code with the line program define name at the beginning and the line end at the end. It is a tradition that the first program you write in a new language simply display the message "Hello World" (who starts these traditions?) so let's do that. Type:

program define hello
1. di "Hello World"
2. end

Note that Stata provides the line numbers for you; you will not put them in when you write a program in a do file (see the example code at the end). To run your program, type hello. Okay, now you've paid your dues to tradition. More importantly you now understand the mechanics of writing a program. Now any time you want to say "Hello World" all you have to do is type hello. What a time-saver! But even supposing you had a reason to say "Hello World" once, to say it more than once in exactly the same way seems a bit redundant. You need to add some flexibility to your program, and that's where the macros come in.

When running a program, anything typed after the program name will be interpreted as arguments. Arguments for programs work much like mathematical functions: the program does whatever it does depending on its arguments. Within the program, arguments are referenced by number: `1' is the first thing after the program name, `2' the second, etc. (spaces define where one argument ends and the next begins).

Change your program a bit:

program define hello
1. di "Hello `1'"
2. end

But you never got that far, did you? Instead you got an ugly message saying

You can't have two hello programs at once. You need to get rid of the original by typing:

program drop hello

This can be a minor nuisance if you're running Windows Stata or GUI Stata on Linux, where you can't really be sure what has been going on before your do file is executed. If you try to define a program that's already been defined, your .do file will crash with the message you just saw. If you try to drop a program that hasn't been defined (for example if you tried program drop hello twice) your do file will crash again, with the message

The solution lies in the capture command. When a command is preceded by capture, any errors it generates are ignored (they are captured by capture). So capture program drop hello will get rid of a program called hello if it exists, and do nothing if it doesn't.

Now back to your modified hello program. You see that it now calls upon the local macro `1'. Try running it by typing hello and see what it does. Since we haven't defined `1', it is ignored. More precisely, the `1' is replaced by its value: absolutely nothing. Now try typing

hello Russell

The program responds .

hello Russell Dimond

just displays because Dimond is stored in `2', and our program doesn't do anything with `2'. On the other hand, the macro `0' contains all the arguments, which gives us some additional flexibility. Let's change our hello program one more time:

program drop hello
program define hello
1. di "Hello `0'"
2. end

Now try:

hello
hello Russell
hello Russell Dimond, how are you today?

If you are an excellent typist you may have missed an important lesson, but the rest of us got it: never try to input a program interactively. One mistake and you have to drop the program and start all over again. Always define programs in .do files.

Finally. if you have ever used a general-purpose programming language (FORTRAN, C/C++, Java, whatever) or if you work with someone who has, be prepared for a bit of confusion about nomenclature. The logical equivalent of a Stata .do file in these languages is a program, while the logical equivalent of a Stata program is (depending on the language) a subroutine, function, procedure, or method (at any rate just a part of a program). You'll probably hear people refer to do files as programs all the time (I do it), and don't be confused if someone starts calling Stata a program a subroutine.

The file program.do will run all these examples.

Loops

Most Stata commands are really loops. Stata carries out the command for the first observation, then the second, and so forth. Take advantage of this looping structure whenever you can, because it is quite fast. But it's not hard to imagine other loops you might want: for example, you might want to execute the same command for five different variables. Stata allows you to do these too--you just have to write them yourself.

foreach

The foreach command allows you to create loops that loop over a list of things.

foreach macroname in/of [list type] list {
code involving `macroname'...
}

macroname is a name we choose to represent the elements in our our list. As always, make the name informative. The in/of construct means you will use either in or of (not both), depending on the type of list. You'll use in for generic lists, of for all others. The list type is optional. If it is omitted Stata will interpret what follows as a generic list. Finally there will be the list to be acted on, and then a left curly bracket. Note the placement of the brackets: the first one must be part of the same statement as the foreach (it cannot go on the next line like in C/C++ or Java) and the last one must be its own statement (it cannot go at the end of the last command inside the loop). Everything inside the curly brackets will be executed once for each item in the list, and macroname is a local macro that will contain each item in the list in turn. Let's look at an example:

foreach color in red blue green {
1. di "`color'"
2. }

will give the output

Note that, as with programs, Stata gives you line numbers when you type a foreach loop interactively but you will not need to type them in do files. Using in with no list type indicates a generic list. Stata makes no attempt to interpret what follows other than break it up into elements. Thus our loop runs three times, once for each element. The first time, `color' is set to the word red, the second time to blue, and the third time to green.

It's very common for your list of items to be stored in a macro that was contructed earlier in the program. You can use such a macro directly in a foreach command:

local colors red blue green
foreach color in `colors' {
1. di "`color'"
2. }

However, this is so common that Stata wrote special code to handle this case more efficiently.

local colors red blue green
foreach color of local colors {
1. di "`color'"
2. }

Note that in changed to of because local is officially a list type, if a rather odd one. Also note that colors is not in quotes in the foreach command. If it were in quotes, the standard macro processor would expand it out to red blue green. Instead, we let the locallist typelook up what the macro means, which it does very quickly.

Normally list types tell Stata tell what types of things are in your list. The available types are varlist, newlist, and numlist.

For the next example, we'll use the auto data set that comes installed with Stata. Load it by typing

sysuse auto

(The sysuse command loads a file from whatever directory Stata is installed in--it's only useful for examples.)

The varlist construction specifies that what follows is an official list of variables. That's not quite as important as it sounds, because you can also put variable names in generic lists. But compare the following:

foreach var in price mpg rep78 {
1. di "`var'"
2. sum `var'
3. }

foreach var of varlist price-rep78 {
1. di "`var'"
2. sum `var'
3. }

foreach var in price-rep78 {
1. di "`var'"
2. sum `var'
3. }

In the first case, foreach interpreted the list as three words, each of which the sum command later recognized as variable names. In the second case, foreach was forwarned to expect a variable list, and thus interpreted price-rep78 as a list of three variables. However, in the third case foreach had no such warning and interpreted price-rep78 as a single word. As a result the loop was actually executed just once. It was the sum command that later interpreted price-rep78 as a variable list containing three variables.

newlist is for lists of new variables; variables which do not yet exist but will be created inside the loop. For example:

foreach var of newlist x1 x2 x3 x4 x5 {
1. gen `var'=0
2. }

newlist checks to make sure the list only contains valid variable names, but does not actually create the variables--gen does that.

numlist is for lists of numbers. Compare this with the previous:

foreach i of numlist 1/5 {
1. gen y`i'=0
2. }

Note how the `i' macro acts like a subscript to the y variable. This is a very common construction: population`year', income`wave', etc.

forvalues

On the other hand, looping over a list of evenly-spaced numbers is the specialty of forvalues, and it will do it faster than foreach. Also, since foreach has to construct the whole list of numbers before it can start, it can only handle relatively small lists. forvalues has not such limit. It's quicker to type too:

forvalues i=1/5 {
1. gen z`i'=0
2. }

forvalues isn't limited to counting upwards by one--type help forvalues for details on other contructions.

Use loop.do to run all these examples.

Stored Results

This section may not be a programming topic, but it is a tool we'll use in our final example. And it's good to know anyway.

Many Stata commands store values in an internal array you can access once you know it's there. Estimation commands create an array called e( ), and you can see what's in it by typing ereturn list. Almost all other commands that return results put them in an array called r( ), and you can see what's in r( ) by typing return list. The manuals also describe what each command returns. The only trick is that every command that uses the e( ) or r( ) arrays overwrites the previous tables. So if you want to do anything with the results of a command, you must do it before you issue another command that returns values. One option is to save the results in a variable or local macros for later use. Try the following:

reg price weight foreign mpg
ereturn list
sum weight
return list

If you want to demean weight (subtract the mean from all observations), all you have to do is type

replace weight = weight - r(mean)

Try it and then do

sum weight

again to see the results. Note that there are issues with numerical precision, but you've accomplished your purpose. Keep in mind that you have also replaced the old values of the r( ) array with a new set of values referring to the second time you ran summarize. Good thing you were done with the old results.

To see this in practice, take a look at results.do.

A Program to Demean Data

Let's put together everything you've learned by writing a program that demeans data. This is a simple enough task that a program isn't really needed, but we'll go a step further and make it both flexible and error-resistant. In other words, we'll put a lot more effort into it than it's worth (except as a learning experience, of course).

We'll start with the simplest possible version (which is generally a good idea when programming). It will take one argument, a variable name, and demean that variable.

program define demean
1. sum `1',meanonly
2. replace `1'=`1'-r(mean)
3. end

Try it out and see how it does (just reload the auto data set if you start running out of variables with non-zero means).

That's fine as far as it goes. But suppose you wanted to demean 20 different variables? It's time to add a foreach loop. Recall that local macro `0' (zero) contains all the arguments passed in to the program. We could have our foreach loop work with this as a variable list, or even a generic list. But local was created for exactly this kind of situation and will run a bit faster. So the next version is:

program drop demean

program define demean
1. foreach var of local 0 {
2. sum `var',meanonly
3. replace `var'=`var'-r(mean)
4. }
5. end

There's just one problem with your demean program. To see it type demean make. The make variable is a string. It has no mean, and so your program crashes. Now, you may be thinking that anyone who tries to demean a string deserves what's coming to them, but let's fix it anyway, just so you can learn how. You may not be able to demean a string, but you can give a better error message, and then proceed to demean any other variables that were requested and are valid.

If as a Way to Control Program Flow

You're used to using if at the end of commands. That meant "execute the preceding command for a given observation only if this condition is true for that observation." What you're going to do now is very different. You're going to say "only execute the following commands for ANY observation if this condition is true." The condition itself is also different: it is a scalar. It is evaluated just once, not once for each observation. If the condition includes a variable, the value of that variable for the first observation will be used. It is also possible to combine if with else, so you can make arbitrarily complex sets of conditions. The syntax looks like this (this is a fairly complex example so can you see how all the pieces work--we'll do something simpler in our program):

if condition1 {
commands to execute if condition1 is true...
}
else if condition2 {
commands to execute if condition one is false and condition2 is true...
}
else {
commands to execute if both condition1 and condition2 are false...
}

Note how the brackets have to be placed just like with foreach.

The problem with your program is that as soon as Stata sees you try to subtract something from a string variable, it crashes with the message

before it even looks at any observations. So your job is to detect strings before you try to demean them, and only subtract things that can be subtracted. You can do this using the confirm command. It's a bit like assert in that you use it to check on things you believe to be true, but it's designed for programmers. Thus it allows you to check things like that a file actually exists, or in this case, that a variable is numeric and thus has a mean. The syntax is

confirm numeric variable var

where var is the variable you're checking. It will do nothing if the variable is numeric, and cause an error if it is not. But you don't want it to crash the program, so put capture in front of it.

But how will you know the result if you use capture? Every command creates a variable called _rc when it runs, which is short for return code. A return code of zero means the command was successful. Any other value means something went wrong (different errors give different return codes). So all you have to do is check the value of _rc with an if statement. If _rc is zero, you know the variable is numeric and you can demean it. If not, you give an error message but the program continues to run and processes the rest of the variables.

program drop demean

program define demean
1. foreach var of local 0 {
2. capture confirm numeric variable `var'
3. if _rc==0 {
4. sum `var',meanonly
5. replace `var'=`var'-r(mean)
6. }
7. else di "`var' is not a numeric variable and cannot be demeaned."
8. }
9. end

The file demean.do contains and demonstrates all the various versions of the demean program. You'll also notice some comments and a great deal of indenting to make the logical structure easy to see. Both practices are highly recommended.

ado (Automatic Do) files

You now have a nice little program that could be useful in a variety of settings. But you have to run the code that defines it before you can use it. What if you could make it act like any other Stata command and run as soon as you type it? You can, by making it an ado (automatic do) file.

An ado file is just like a do file that defines a program, but the filename ends with .ado and it is stored in one of several ado directories. When you type a command, Stata checks the ado directories to see if there is an ado file with that name. If there is, Stata automatically runs the ado file that defines the program and then executes it. Thus from the user's perspective, using an ado file is just like using a built-in Stata command. In fact many Stata commands are actually implemented as ado files.

In order to create an ado file, you need isolate the demean program in a separate file and save it as demean.ado in your personal ado directory. You can identify your personal ado directory by typing sysdir. On the SSCC's Linux servers, it is ~/ado/personal (recall that ~ means your home directory). On the Winstats it is w:\ado\personal.

Once that's done, demean.ado will almost be like an official Stata command. Not quite though: note that we made no provision for standard Stata syntax like by: or if. Doing so isn't actually as hard as you might think, but still beyond the scope of this article.

You've now learned a powerful set of tools that can save you a great deal of time and trouble. At first you may need to consciously look for opportunities to use them. But they will soon become second nature, and writing code without them will seem unbearably tedious. Consider that progress.

Last Revised: 9/11/2007

4 Programming Stata

This section is a gentle introduction to programming Stata. I discuss macros and loops, and show how to write your own (simple) programs. This is a large subject and all I can hope to do here is provide a few tips that hopefully will spark your interest in further study. However, the material covered will help you use Stata more effectively.

Stata 9 introduced a new and extremely powerful matrix programming language called Mata. This extends the programmer's tools well beyond the macro substitution tools discussed here, but Mata is a subject that deserves separate treatment. Your efforts here will not be wasted, however, because Mata is complementary to, not a complete substitute for, classic Stata programming.

To learn more about programming Stata I recommend Kit Baum's An Introduction to Stata Programming, now in its second edition. You may also find useful Chapter 18 in the User's Guide, referring to the Programming volume and/or the online help as needed. Nick Cox's regular columns in the Stata Journal are a wonderful resource for learning about Stata. Other resources were listed in Section 1 of this tutorial.

4.1 Macros

A macro is simply a name associated with some text. Macros can be local or global in scope.

4.1.1 Storing Text in Local Macros

Local macros have names of up to 31 characters and are known only in the current context (the console, a do file, or a program).

You define a local macro using and you evaluate it using . (Note the use of a backtick or left quote.)

The first variant, without an equal sign, is used to store arbitrary text of up to ~64k characters (up to a million in Stata SE). The text is often enclosed in quotes, but it doesn't have to be.

Example: Control Variables in Regression.

You need to run a bunch of regression equations that include a standard set of control variables, say , , , and . You could, of course, type these names in each equation, or you could cut and paste the names, but these alternatives are tedious and error prone. The smart way is to define a macro

You then type commands such as

which in this case is exactly equivalent to typing .

If there's only one regression to run you haven't saved anything, but if you have to run several models with different outcomes or treatments, the macro saves work and ensures consistency.

This approach also has the advantage that if later you realize that you should have used log-income rather than income as a control, all you need to do is change the macro definition at the top of your do file, say to read instead of and all subsequent models will be run with income properly logged (assuming these variables exist).

Warning: Evaluating a macro that doesn't exist is not an error; it just returns an empty string. So be careful to spell macro names correctly. If you type , Stata will read , because the macro does not exist. The same would happen if you type because macro names cannot be abbreviated the way variable names can. Either way, the regression will run without any controls. But you always check your output, right?

Example: Managing Dummy Variables

Suppose you are working with a demographic survey where age has been grouped in five-year groups and ends up being represented by seven dummies, say to , six of which will be used in your regressions. Define a macro

and then in your regression models use something like

which is not only shorter and more readable, but also closer to what you intend, which is to regress on "age", which happens to be a bunch of dummies. This also makes it easier to change the representation of age; if you later decide to use linear and quadratic terms instead of the six dummies all you do is define and rerun your models. Note that the first occurrence of here is the name of the macro and the second is the name of a variable. I used quotes to make the code clearer. Stata never gets confused.

Note on nested macros. If a macro includes macro evaluations, these are resolved at the time the macro is created, not when it is evaluated. For example if you define . Stata sees that it includes the macro and substitutes the current value of . Changing the contents of the macro at a later time does not change the contents of the macro .

There is, however, a way to achieve that particular effect. The trick is to escape the macro evaluation character when you define the macro, typing . Now Stata does not evaluate the macro (but eats the escape character),so the contents of becomes . When the macro is evaluated, Stata sees that it includes the macro and substitutes its current contents.

In one case substitution occurs when the macro is defined, in the other when it is evaluated.

4.1.2 Storing Results in Local Macros

The second type of macro definition, with an equal sign, is used to store results. It instructs Stata to treat the text on the right hand side as an expression, evaluate it, and store a text representation of the result under the given name.

Suppose you just run a regression and want to store the resulting R-squared, for comparison with a later regression. You know that stores R-squared in , so you think would do the trick.

But it doesn't. Your macro stored the formula , as you can see by typing . What you needed to store was the value. The solution is to type , with an equal sign. This causes Stata to evaluate the expression and store the result.

To see the difference try this

. sysuse auto, clear (1978 Automobile Data) . quietly regress mpg weight . local rsqf e(r2) . local rsqv = e(r2) . di `rsqf' // this has the current R-squared .65153125 . di `rsqv' // as does this .65153125 . quietly regress mpg weight foreign . di `rsqf' // the formula has the new R-squared .66270291 . di `rsqv' // this guy has the old one .65153125

Another way to force evaluation is to enclose in single quotes when you define the macro. This is called a macro expression, and is also useful when you want to display results. It allows us to type instead of . (What do you think would happen if you type ?)

An alternative way to store results for later use is to use scalars (type to learn more.) This has the advantage that Stata stores the result in binary form without loss of precision. A macro stores a text representation that is good only for about 8 digits. The downside is that scalars are in the global namespace, so there is a potential for name conflicts, particular in programs (unless you use temporary names, which we discuss later).

You can use an equal sign when you are storing text, but this is not necessary, and is not a good idea if you are using an old version of Stata. The difference is subtle. Suppose we had defined the macro by saying . This would have worked fine, but the quotes cause the right-hand-side to be evaluated, in this case as a string, and strings used to be limited to 244 characters (or 80 in Stata/IC before 9.1), whereas macro text can be much longer. Type to be reminded of the limits in your version.

4.1.3 Keyboard Mapping with Global Macros

Global macros have names of up to 32 characters and, as the name indicates, have global scope.

You define a global macro using and evaluate it using . (You may need to use to clarify where the name ends.)

I suggest you avoid global macros because of the potential for name conflicts. A useful application, however, is to map the function keys on your keyboard. If you work on a shared network folder with a long name try something like this

Then when you hit F5 Stata will substitute the full name. And your do files can use commands like . (We need the braces to indicate that the macro is called , not .)

Obviously you don't want to type this macro each time you use Stata. Solution? Enter it in your file, a set of commands that is executed each time you run Stata. Your profile is best stored in Stata's start-up directory, usually . Type to learn more.

4.1.4 More on Macros

Macros can also be used to obtain and store information about the system or the variables in your dataset using extended macro functions. For example you can retrieve variable and value labels, a feature that can come handy in programming.

There are also commands to manage your collection of macros, including and . Type to learn more.

4.2 Looping

Loops are used to do repetitive tasks. Stata has commands that allow looping over sequences of numbers and various types of lists, including lists of variables.

Before we start, however, don't forget that Stata does a lot of looping all by itself. If you want to compute the log of income, you can do that in Stata with a single line:

This loops implicitly over all observations, computing the log of each income, in what is sometimes called a vectorized operation. You could code the loop yourself, but you shouldn't because (i) you don't need to, and (ii) your code will be a lot slower that Stata's built-in loop.

4.2.1 Looping Over Sequences of Numbers

The basic looping command takes the form

Here is a keyword, is the name of a local macro that will be set to each number in the sequence, and is a range of values which can have the form

  • to indicate a sequence of numbers from to in steps of one, for example yields 1, 2 and 3, or
  • which yields a sequence from to in steps of size . For example yields 15,20,25,30,35,40,45 and 50.

(There are two other ways of specifying the second type of sequence, but I find the one listed here the clearest, see for the alternatives.)

The opening left brace must be the last thing on the first line (other than comments), and the loop must be closed by a matching right brace on a line all by itself. The loop is executed once for each value in the sequence with your local macro (or whatever you called it) holding the value.

Creating Dummy Variables

Here's my favorite way of creating dummy variables to represent age groups. Stata 11 introduced factor variables and Stata 13 improved the labeling of tables of estimates, drastically reducing the need to "roll your own" dummies, but the code remains instructive.

This will create dummy variables to . The way the loop works is that the local macro will take values between 20 and 45 in steps of 5 (hence 20, 25, 30, 35, 40, and 45), the lower bounds of the age groups.

Inside the loop we create a local macro to represent the upper bounds of the age groups, which equals the lower bound plus 4. The first time through the loop is 20, so is 24. We use an equal sign to store the result of adding 4 to .

The next line is a simple generate statement. The first time through the loop the line will say , as you can see by doing the macro substitution yourself. This will create the first dummy, and Stata will then go back to the top to create the next one.

4.2.2 Looping Over Elements in a List

The second looping command is and comes in six flavors, dealing with different types of lists. I will start with the generic list:

Here is a keyword, is a local macro name of your own choosing, is another keyword, and what comes after is a list of blank-separated words. Try this example

This loop will print "cats", "and", and "dogs", as the local macro is set to each of the words in the list. Stata doesn't know "and" is not an animal, but even if it did, it wouldn't care because the list is generic.

If you wanted to loop over an irregular sequence of numbers --for example you needed to do something with the Coale-Demeny regional model life tables for levels 2, 6 and 12-- you could write

That's it. This is probably all you need to know about looping.

4.2.3 Looping Over Specialized Lists

Stata has five other variants of which loop over specific types of lists, which I now describe briefly.

Lists of Variables

Perhaps the most useful variant is

Here , and are keywords, and must be typed exactly as they are. The is just that, a list of existing variable names typed using standard Stata conventions, so you can abbreviate names (at your own peril), use to refer to all variables that start with "var", or type to refer to variables to .

The advantages of this loop over the generic equivalent is that Stata checks that each name in the list is indeed an existing variable name, and lets you abbreviate or expand the names.

If you need to loop over new as opposed to existing variables use . The keyword replaces and tells Stata to check that all the list elements are legal names of variables that don't exist already.

Words in Macros

Two other variants loop over the words in a local or global macro; they use the keyword or followed by a macro name (in lieu of a list). For example here's a way to list the control variables from the section on local macros:

Presumably you would do something more interesting than just list the variable names. Because we are looping over variables in the dataset we could have achieved the same purpose using with a ; here we save the checking.

Lists of Numbers

Stata also has a variant that specializes in lists of numbers (or in Stataspeak) that can't be handled with .

Suppose a survey had a baseline in 1980 and follow ups in 1985 and 1995. (They actually planned a survey in 1990 but it was not funded.) To loop over these you could use

Of course you would do something more interesting than just print the years. The numlist could be specified as , or (meaning 1 2 3 4 5), or (count from 1 to 7 in steps of 2 to get 1 3 5 7); type for more examples.

The advantage of this command over the generic is that Stata will check that each of the elements of the list of numbers is indeed a number.

4.2.4 Looping for a While

In common with many programming languages, Stata also has a loop, which has the following structure

where condition is an expression. The loop executes as long as the condition is true (nonzero). Usually something happens inside the loop to make the condition false, otherwise the code would run forever.

A typical use of is in iterative estimation procedures, where you may loop while the difference in successive estimates exceeds a predefined tolerance. Usually an iteration count is used to detect lack of convergence.

The command allows breaking out of any loop, including , and . The command stops the current iteration and continues with the next, unless is specified in which case it exits the loop.

4.2.5 Conditional Execution

Stata also has an programming command, not to be confused with the qualifier that can be used to restrict any command to a subset of the data, as in . The command has the following structure

Here and the optional are keywords, type for an explanation of expressions. The opening brace must be the last thing on a line (other than comments) and the closing brace must be on a new line by itself.

If the or parts consist of a single command they can go on the same line without braces, as in . But is not legal. You could use the braces by spreading the code into three lines and this often improves readability of the code.

So here we have a silly loop where we break out after five of the possible ten iterations:

And with that, we break out of looping.

4.3 Writing Commands

We now turn to the fun task of writing your own Stata commands. Follow along as we develop a couple of simple programs, one to sign your output, and another to evaluate the Coale-McNeil model nuptiality schedule, so we can create a plot like the figure below.

4.3.1 Programs With No Arguments

Let us develop a command that helps label your output with your name. (Usually you would want a timestamp, but that is already available at the top of your log file. You always log your output, right?) The easiest way to develop a command is to start with a do file. Fire up Stata's do-file editor (Ctrl-8) and type:

That's it. If you now type Stata will display the signature using the text style (usually black on your screen).

The statement is needed in case we make changes and need to rerun the do file, because you can't define an existing program. The is needed the very first time, when there is nothing to drop.

The statement says this command was developed for version 9.1 of Stata, and helps future versions of Stata run it correctly even if the syntax has changed in the interim.

The last line uses a bit of SMCL, pronounced "smickle" and short for Stata Markup Control Language, which is the name of Stata's output processor. SMCL uses plain text combined with commands enclosed in braces. For example sets display mode to text, and draws a horizontal rule exactly 62 characters wide. To learn more about SMCL type .

4.3.2 A Program with an Argument

To make useful programs you will often need to pass information to them, in the form of "arguments" you type after the command. Let's write a command that echoes what you say

Try typing to see what happens.

When you call a command Stata stores the arguments in a local macro called . We use a display command with to evaluate the macro. The result is text, so we enclose it in quotes. (Suppose you typed , so the local macro has ; the command would read and Stata will complain, saying 'Hi not found'. We want the command to read , which is why we code .)

If we don't specify anything, the local macro will be an empty string, the command will read and Stata will print a blank line.

4.3.3 Compound Quotes

Before we go out to celebrate we need to fix a small problem with our new command. Try typing echo The hopefully "final" run. Stata will complain. Why? Because after macro substitution the all-important display command will read

The problem is that the quote before final closes the initial quote, so Stata sees this is as followed by , which looks to Stata like an invalid name. Obviously we need some way to distinguish the inner and outer quotes.

Incidentally you could see exactly where things went south by typing and running the command. You can see in (often painful) detail all the steps Stata goes through, including all macro substitutions. Don't forget to type when you are done. Type to learn more.

The solution to our problem? Stata's compound double quotes: to open and to close, as in . Because the opening and closing symbols are different, these quotes can be nested. Compound quotes

  • can be used anywhere a double quote is used.
  • must be used if the text being quoted includes double quotes.

So our program must . Here's the final version.

You will notice that I got rid of the . This is because we are now ready to save the program as an ado file. Type to find out where your personal ado directory is, and then save the file there with the name . The command will now be available any time you use Stata.

(As a footnote, you would want to make sure that there is no official Stata command called . To do this I typed . Stata replied "command echo not found as either built-in or ado-file". Of course there is no guarantee that they will not write one; Stata reserves all english words.)

4.3.4 Positional Arguments

In addition to storing all arguments together in local macro , Stata parses the arguments (using white space as a delimiter) and stores all the words it finds in local macros , , , etc.

Typically you would do something with and then move on to the next one. The command comes handy then, because it shifts all the macros down by one, so the contents of is now in , and is in , and so on. This way you always work with what's in and shift down. When the list is exhausted is empty and you are done.

So here is the canonical program that lists its arguments

Don't forget the , otherwise your program may run forever. (Or until you hit the break key.)

Try . Now try . Notice how one can group words into a single argument by using quotes.

This method is useful, and sometimes one can given the arguments more meaningful names using , but we will move on to the next level, which is a lot more powerful and robust.

(By the way one can pass arguments not just to commands, but to do files as well. Type to learn more.)

4.3.5 Using Stata Syntax

If your command uses standard Stata syntax, which means the arguments are a list of variables, possibly a weight, maybe an or clause, and perhaps a bunch of options, you can take advantage of Stata's own parser, which conveniently stores all these elements in local macros ready for you to use.

A Command Prototype

Let us write a command that computes the probability of marrying by a certain age in a Coale-McNeil model with a given mean, standard deviation, and proportion marrying. The syntax of our proposed command is

So we require an existing variable with age in exact years, and a mandatory option specifying a new variable to be generated with the proportions married. There are also options to specify the mean, the standard deviation, and the proportion ever married in the schedule, all with defaults. Here's a first cut at the command

The first thing to note is that the command looks remarkably like our prototype. That's how easy this is.

Variable Lists

The first element in our syntax is an example of a list of variables or . You can specify minima and maxima, for example a program requiring exactly two variables would say . When you have only one variable, as we do, you can type , which is short for .

Stata will then make sure that your program is called with exactly one name of an existing variable, which will be stored in a local macro called . (The macro is always called , even if you have only one variable and used in your syntax statement.) Try and Stata will complain, saying "variable nonesuch not found".

(If you have done programming before, and you spent 75% of your time writing checks for input errors and only 25% focusing on the task at hand, you will really appreciate the command. It does a lot of error checking for you.)

Options and Defaults

Optional syntax elements are enclosed in square brackets and . In our command the option is required but the other three are optional. Try these commands to generate a little test dataset with an age variable ranging from 15 to 50

Now try . This time Stata is happy with but notes 'option generate() required'. Did I say saves a lot of work? Options that take arguments need to specify the type of argument (, , , ) and, optionally, a default value. Our takes a , and is required, so there is no default. Try . Stata will complain that 2 is not a name.

If all is well, the contents of the option is stored in a local macro with the same name as the option, here .

Checking Arguments

Now we need to do just a bit of work to check that the name is a valid variable name, which we do with :

Stata then checks that you could in fact generate this variable, and if not issues error 110. Try and Stata will say 'age already defined'.

It should be clear by now that Stata will check that if you specify a mean, standard deviation or proportion ever married, abbreviated as , and , they will be real numbers, which will be stored in local macros called ,, , and . If an option is omitted the local macro will contain the default.

You could do more checks on the input. Let's do a quick check that all three parameters are non-negative and the proportion is no more than one.

You could be nicer to your users and have separate checks for each parameter, but this will do for now.

Temporary Variables

We are now ready to do some calculations. We take advantage of the relation between the Coale-McNeil model and the gamma distribution, as explained in Rodríguez and Trussell (1980). Here's a working version of the program

We could have written the formula for the probability in one line but only by sacrificing readability. Instead we first standardize age, by subtracting the mean and dividing by the standard deviation. What can we call this variable? You might be tempted to call it , but what if the user of your program has a variable called ? Later we evaluate the gamma function. What can we call the result?

The solution is the command, which asks Stata to make up unique temporary variable names, in this case two to be stored in local macros and . Because these macros are local, there is no risk of name conflicts. Another feature of temporary variables is that they disappear automatically when your program ends, so Stata does the housekeeping for you.

The line probably looks a bit strange at first. Remember that all quantities of interest are now stored in local macros and we need to evaluate them to get anywhere, hence the profusion of backticks: gets the name of our temporary variable, gets the name of the age variable specified by the user, gets the value of the mean, and gets the value of the standard deviation. After macro substitution this line will read something like , which probably makes a lot more sense.

If/In

You might consider allowing the user to specify and conditions for your command. These would need to be added to the syntax, where they would be stored in local macros, which can then be used in the calculations, in this case passed along to generate.

For a more detailed discussion of this subject type and select and then . The entry in is also relevant.

4.3.6 Creating New Variables

Sometimes all your command will do is create a new variable. This, in fact, is what our little command does. Wouldn't it be nice if we could use an type of command like this:

Well, we can! As it happens, is user-extendable. To implement a function called you have to create a program (ado file) called , in other words add the prefix . The documentation on egen extensions is a bit sparse, but once you know this basic fact all you need to do is look at the source of an command and copy it. (I looked at .)

So here's the version of our Coale-McNeil command.

There are very few differences between this program and the previous one. Instead of an input variable accepts an expression, which gets evaluated and stored in a temporary variable called . The output variable is specified as a , in this case a . That's why now works with , and creates . The mysterious is there because lets you specify the type of the output variable ( by default) and that gets passed to our function, which passes it along to .

4.3.7 A Coale-McNeil Fit

We are ready to reveal how the initial plot was produced. The data are available in a Stata file in the demography section of my website, which has counts of ever married and single women by age. We compute the observed proportion married, compute fitted values based on the estimates in Rodríguez and Trussell (1980), and plot the results. It's all done in a handful of lines

The actual estimation can be implemented using Stata's maximum likelihood procedures, but that's a story for another day.

4.4 Other Topics

For lack of time and space I haven't discussed returning values from your program, type to learn more. For related subjects on estimation commands which can post estimation results see and . An essential reference on estimation is Maximum Likelihood Estimation with Stata, Fourth Edition, by Gould, Pitblado and Poi (2010).

Other subjects of interest are matrices (start with ), and how to make commands "byable" (type ). For serious output you need to learn more about SMCL, start with . For work on graphics you may want to study class programming () and learn about sersets (). To provide a graphical user interface to your command try . It is possible to read and write text and binary files (see ), but I think these tasks are better handled with Mata. You can even write Stata extensions in C, but the need for those has also diminished with the availability of Mata.

Reference

Rodríguez G. and Trussell T.J. (1980). Maximum Likelihood Estimation of the Parameters of Coale's Model Nuptiality Schedule from Survey Data. World Fertility Survey Technical Bulletins, 7.

One thought on “Stata Program Define Example Essay

Leave a Reply

Your email address will not be published. Required fields are marked *