Process | Project TIER | Teaching Integrity in Empirical Research

The Process guidelines of the TIER Protocol outline a workflow for your entire research project that will help you keep organized and enhance your own understanding of the data processing and analysis you do.

This workflow emphasizes that constructing replication documentation for a project should be an integral part of your work throughout the entire research process--not a discrete task that you postpone until the end. You should begin constructing your documentation before you even start working with your data, and add to it incrementally as your research progresses.

By the time you have finished the data processing and analysis for your project and are ready to write the final paper, only a few additions and checks should be required to produce comprehensive replication documentation that meets the Specifications of the TIER Protocol.

The three major phases of your research

You should consider your research in terms of three major phases: pre-data, data work and wrap-up

Pre-data overview

In the pre-data phase there are two tasks to complete before you begin working with your data

Construct a hierarchy of empty folders

You will use these folders to save your work while you conduct your research, and when you are finished you will store your complete replication documentation in them.

Create three blank documents

You will record information in these documents as your work progresses, and at the end they will be part of your replication documentation.

Data Work overview

The work you do with your data can be broken down into the following steps

Documenting your original data

Each time you obtain a new file containing data you will use for your project, you should preserve a copy to include in your replication documentation, and record some information about it in one of the blank documents you created.

Writing command files that process your data

In these command files, you write code that transforms the data files with which you begin your project into the data files you use to generate the results for your paper.

Constructing a Data Appendix

This document serves as a codebook for your analysis data files.

Writing command files that generate your results

In these command files, you write code that executes the procedures or analyses that generate the results you report in your paper.

Wrap-up overview

After you finish the analysis and write your paper, a few steps remain to complete the replication documentation

Writing a Read Me file

As described above, you create a blank Read Me file at the very beginning of the project. As described later in this document, you will record some information in the Read Me file at various points in your work. Other information that should be in the Read Me file can only be entered after you have completed your paper.

Proofreading and testing your command files

Proofread your command files to be sure that they are concise and that you have provided sufficient comments, and test them to be sure that they run without errors and successfully replicate your study.

Making a final check of all your replication documentation

Check that your replication documentation contains all the components described in the specifications of the TIER Protocol, and that they are stored in the hierarchy of folders and sub-folders called for by the Protocol.

Pre-data

Before you start working with your data

Construct a hierarchy of empty folders

Begin by making a hierarchy of empty folders and sub-folders to store your work in.

You will use these folders to save data files, command files and other documents as you assemble and construct them throughout the course of your work on the project.

When you have finished, you will store the replication documentation for the completed project in these folders.

Choose a safe place to save your work

Decide where you will keep the various files you will be working with over the course of your project—such as a dedicated folder on the hard-disk of your computer, a server maintained by your institution, or a web-based platform like Dropbox, GitHub or the Open Science Framework (OSF).

Choose a place that is secure and stable, and easily accessible to everyone who will be working on the project.

Decide on a reliable system for backing up your work.

Create the hierarchy of folders and sub-folders specified by the TIER Protocol.

The hierarchy of folders and sub-folders is described in detail in the specifications of the TIER Protocol, and illustrated below.

An illustration of the TIER folder hierarchy

If you prefer, you may simply download a pre-made set of folders with the appropriate hierarchy.

Save the folders you created in the safe place you chose to save your work.

Throughout your work on the project, you will be adding new documents to these folders and editing ones that you have already stored there.

Create three blank documents

Using whatever word-processing or text-editing software you choose, create three blank documents that will eventually become part of your replication documentation.

You will add information to these documents throughout the process of conducting your research.

The three documents to create are:

A Read Me file
A Metadata Guide
A Data Appendix

Type an appropriate title at the beginning of each document

These hypothetical examples illustrate the kinds of titles you might choose:

"Read Me File for Econometrics Project by A. Smith and B. Jones."
"Metadata Guide for J. Student Senior Thesis."
"Data Appendix for 'The Economics and Politics of Popular Stringed Instruments,' by U. K. Laylee."

Other than their titles, leave these documents blank.

Give these documents appropriate names, and save them in the appropriate folders.

Give your Read Me file the name ReadMe.EXT, and save it in your Documents folder.
Give your Metadata Guide the name MetadataGuide.EXT, and save it in your Metadata folder (which is a sub-folder of your Original Data folder).
Give your Data Appendix the name DataAppendix.EXT, and save it your Documents folder.

In the above file names, .EXT represents the extension used on the names of documents created with your word-processing or text-editing software. (For example, if you are using Microsoft Word, .EXTwould be replaced by .docx.)

You should not begin working with your data until you have completed all the "Pre-data" steps described above.

Data work

Working With Your Data

Documenting Your Original Data

Any document from which you extract statistical data to use in your project is called an original data file.

In some cases, all of the data for a project may come from a single original data file; in others, the data are extracted from two or more original data files.

Each time you obtain an original data file, there are several things you should do immediately

Save a copy of the original data file in your Original Data folder

You may give the file a new name when you save it in the Original Data folder, but other than that the copy you save should be identical to the original version of the data file. The contents and format should not be modified in any way.

Enter information about the original data file in your Metadata Guide

The Metadata Guide is one of the three documents you created before you started working with your data.

Each time you obtain a new original data file, add a section about that file to the Metadata Guide.

Begin the section with a header that identifies the original data file it pertains to (e.g., “Metadata for penn_tables_1986_2010.txt”.)

Review the specifications for the Metadata Guide given in the TIER Protocol, then enter the relevant information about the original data file in your Metadata Guide.

If you enter the required information right away each time you obtain a new original data file, your Metadata Guide will be complete well before you finish your project.

Create a version of your document in a format that your software can read

When you need to create a new version of a data file to make it possible for your software to read it, the new version is called an importable data file.

When you create an importable data file, you should make only the minimal changes necessary to make it possible for your software to read the data. You should not modify the data in the file in any way. For example, you should not create new variables, delete variables or cases, or reshape the data. Although the format in which the data are saved will be different, the data in an importable data file should be identical to the data in the original data file.

This is important because... To ensure that your work can be completely replicated, you need to write command files, in the syntax of whatever statistical software you are using, that executes all the processing and analysis of your data required to generate the results you report in your paper--from the point at which you first open your original data files, through all the cleaning and processing necessary to prepare them for analysis, to the procedures that finally generate the results. After you have completed the project, an interested reader could replicate your study simply by running your command files. Changes you make to the data when you create an importable version of an original data file cannot be executed by commands written in a command file; they therefore cannot be automatically reproduced.

When you create an importable version of an original data file, keep both the original and the importable version in the Data folder.

Give the importable version of the file a name that reminds you it is the importable version of the original data file from which it was created. For example, if the original data file is called gdp_growth.sav, give the importable version a name like i_gdp_growth.dta. (The “i_” prefix is a reminder that the file is “importable;” the change in the extension reflects the change in the format in which the data file is saved.)

Write an in explanation in your Read Me file

For each importable data file you create, write an explanation in your Read Me file describing the steps you took to create the importable version from the original data file.

This may be the first time you enter any information (other than the title) in your Read Me file.

As described in the TIER Protocol Specifications, these explanations of the modifications you made to your importable data files will constitute section 2 of your Read Me file. At this point, you do not need to worry about what information the completed Read Me file will include and how it will be organized. The essential thing is just that, each time you create an importable version of a data file, you should make a note in the Read Me file that gives the names of both the original and importable files, and explains precisely (in complete, grammatically correct sentences) the steps you took to create the importable version from the original.

If all of your original data files are in formats that your software is able to read, so that it is not necessary to create importable versions of any of them, you may simply omit section 2 from the Read Me file. (In that case, you should call the last section of the Read Me file section 2 instead of section 3.)

Processing Your Data

The processing phase of a project consists of all the steps involved in transforming your original data files (or the importable versions) into the fully cleaned and processed analysis data files that you use to generate your results.

All of the commands necessary for processing your data must be written in a command files, or in several command files that can be run sequentially. When you have finished writing these command files, executing them will automatically conduct all the procedures necessary to transform your original (or importable) data files into your analysis data files.

Writing, experimenting with and editing these command files until they successfully carry out the necessary steps of processing is the focal point of the work you do with your data.

Transform your data file into analysis data files

In one or more command files, write code that transforms your importable data files into your analysis data files.

Exactly what steps of processing are required varies a great deal, but examples of some common procedures include:

Having your software open your importable data files.
Cleaning the data to resolve any errors or discrepancies.
Removing variables or cases that you do not need.
Combining data from different importable data files.
Transposing a data table so that columns become rows and rows become columns.
Generating new variables
Saving intermediate and analysis data files.

Decide how to organize the code that processes your data into one or more command files

It is always possible to put all the necessary commands in a single file.

But in many cases separating different parts of the processing phase into different command files can help you keep track of what you are doing.

The best way of dividing your data processing among command files will depend on the particulars of your project, but the following scheme often works well.

For every importable data file you have, write one command file that reads the data it contains, cleans the data as necessary to prepare them for merging with the data from the other importable data files, and then saves them in a new file, in the native format of the software you are using.

Then write one additional command file that merges all these natively formatted files, processes them as necessary to construct the analysis data files, and then saves the analysis data files in your software’s native format.

Depending on the number of importable data files you have and how the data in them are organized, other schemes for dividing the processing phase among your command files may be more convenient; you should use whatever scheme you find works best for your project.

Whatever scheme you choose, you will explain in your Read Me file (in section 3) the order in which the command files need to be run to replicate your project.

Save your command files and your analysis data files in the appropriate folders

Save the command files that process your data and create your analysis data file(s) in your Command Files folder. Save the analysis data file(s) in your Analysis Data folder.

Constructing your Data Appendix

The Data Appendix provides information about all the variables in your analysis data files, such as their names, definitions, coding, and summary statistics.

The Data Appendix serves as a codebook and users’ guide for your analysis data files.

The Data Appendix is one of the three documents you created before you began working with your data.

You should construct your Data Appendix as soon as you have finished writing the command files that create your analysis data files.

When you construct your Data Appendix, you are likely to learn things about your data that you should know before you begin your analysis. So you should not begin the analysis until after you have constructed your Data Appendix.

Review the TIER Protocol Specifications for the information that should be included in the Data Appendix

The Specifications of the TIER Protocol describe the information that should be included in the Data Appendix.

To summarize the specifications briefly: The Data Appendix should provide information about every variable in your analysis data files, including definitions and coding (for all variables), summary statistics and histograms (for quantitative variables), and relative frequency tables and charts (for categorical variables).

Generate the descriptive statistics, tables and figures for the Data Appendix

Through writing a command file generate all the descriptive statistics, tables and figures needed for the Data Appendix. These should be created using the data in your analysis data files.

Give this command file the name DataAppendix.CMD, where CMD represents the extension your statistical software uses for the names of command files.

Save DataAppendix.CMD in your Command Files folder.

Finish composing the Data Appendix, inserting the descriptive statistics, tables and figures in the appropriate places

When you have finished, save the Data Appendix in your Documents folder.

Analyzing your data

In this phase of your work you perform the procedures on your analysis data files that generate the figures, tables and other statistical results you report in your paper.

The results consist of all the findings you report in your paper that are based on computations performed on your analysis data files. They may be presented in various forms, including tables, figures, and numerical values reported in the text of the paper.

As in the processing phase, composing command files that execute all the necessary procedures is central.

In one or more command files, write code that generates all the results you report in your paper

These command files should contain commands that open up your analysis data files, and then use those data to generate the output upon which your results are based.

Every command that generates any of your results should be preceded by a comment that states which result the command generates. A few Hypothetical examples illustrate what these comments might look like:

* The following command generates the first column of Table 6.

* The following command generates the second column of Table 6.

* The following command generates Figure 4.

/* The following command generates the correlation of 0.31 between the variables INC (individual annual income, reported in the natural log of current US dollars) and SATIS (individual subjective self-report of overall satisfaction with life, on a scale of 0—least satisfied—to 10—most satisfied). This correlation is reported on page 27 of the paper. */

The command files for your analysis phase should not contain any commands that generate new variables or process your data in any way. All the procedures required to prepare your data for analysis should be executed by the command files you wrote for the processing phase.

It is often convenient to write all the commands for the analysis phase in a single command file. However, if the nature of your project or the structure of your data are such that you think it would make sense to divide the code that generates the results into two or more command files, you should feel free to do so. No matter how you organize your analysis command files, your Read Me file will include an explanation of how to use them to reproduce your results.

Save the command files you write for the analysis phase in the Command Files folder.

Wrap-up

Wrapping things up on your project

If you follow the research process described here, you construct your replication documentation incrementally throughout the course of your work on the project. By the time you have completed your final paper, your replication documentation should also be nearly complete.

Finishing Your Read Me File

The Read Me file is one of the three documents you created before you began working with your data.

Review the specifications for the information that should be included in the Read Me file given in the Specifications of the TIER Protocol.

You should already have recorded one part of the required information, namely notes explaining any modifications you made to the original data files when you made importable versions of them.

To finish your Read Me file, you should add the other items specified by the TIER Protocol:

An overview of all the files included in the replication documentation, and the structure of the folders in which they are stored.
Step-by-step instructions for using the replication documentation to replicate the study

Proofreading and Testing Your Command Files

In the course of your project, you constructed command files that processed your data, produced the descriptive statistics and figures for your Data Appendix, and executed the analyses and procedures that generated the results you reported in your paper.

But before you consider your command files to be complete and ready to store in your final replication documentation, you should edit them for accuracy and clarity, and test them to be sure they reproduce the results of your project as intended.

Editing Your Command Files

Edit all your command files to be sure they are accurate, concise, and free of detritus.

Remove any commands that turned out not to be necessary for your project.
If you realize in hindsight that the code you wrote to execute any of the procedures could be rewritten in a simpler or more streamlined way, then revise the code accordingly.
Be sure the comments in your command files are extensive and clear enough to allow someone else to understand what is accomplished in each step of data processing and analysis.

Testing Your Command Files

Test your command files to be sure that they all run without error and that they successfully reproduce the results you reported in your paper.

Try following the instructions for replicating your project that you wrote in the Read Me file to be sure that all your command files run without a hitch and produce the intended output.
If you encounter any errors or crashes, diagnose and fix the problem, and then start the test over.

Making a Final Check of All Your Replication Documentation

Before you consider your replication documentation to be complete and final, check to be sure that it satisfies all the requirements of the TIER Documentation Protocol.

First, review the specifications of the TIER Documentation Protocol.

Then check that:

All the required files are included in your replication documentation, and that they are stored in the correct folders.
The content and format of every file meet the specifications of the TIER Protocol.

Finally, delete any extraneous files that are not called for by the TIER Protocol. Your replication documentation should contain only the files specified by the TIER Protocol, or that you intentionally chose to include for a particular purpose.

After you check that your replication documentation contains everything that it should, and that it does not contain anything extraneous, you may consider it complete.

Pre-data overview

Construct a hierarchy of empty folders

Create three blank documents

Data Work overview

Documenting your original data

Writing command files that process your data

Constructing a Data Appendix

Writing command files that generate your results

Wrap-up overview

Writing a Read Me file

Proofreading and testing your command files

Making a final check of all your replication documentation

Construct a hierarchy of empty folders

Choose a safe place to save your work

Create the hierarchy of folders and sub-folders specified by the TIER Protocol.

Create three blank documents

Type an appropriate title at the beginning of each document

Give these documents appropriate names, and save them in the appropriate folders.

Documenting Your Original Data

Save a copy of the original data file in your Original Data folder

Enter information about the original data file in your Metadata Guide

Create a version of your document in a format that your software can read

Write an in explanation in your Read Me file

Processing Your Data

Transform your data file into analysis data files

Decide how to organize the code that processes your data into one or more command files

Save your command files and your analysis data files in the appropriate folders

Save the command files that process your data and create your analysis data file(s) in your Command Files folder. Save the analysis data file(s) in your Analysis Data folder.

Constructing your Data Appendix

Review the TIER Protocol Specifications for the information that should be included in the Data Appendix

Generate the descriptive statistics, tables and figures for the Data Appendix

Finish composing the Data Appendix, inserting the descriptive statistics, tables and figures in the appropriate places

Analyzing your data

In one or more command files, write code that generates all the results you report in your paper

Finishing Your Read Me File

Proofreading and Testing Your Command Files

Editing Your Command Files

Testing Your Command Files

Making a Final Check of All Your Replication Documentation

Get Updates

Meet Yoda, the TIER Terrier: