Input Data Files

The data files you initially obtain or construct to use in a project are called Input Data Files.

You should not make any changes to the contents or format of your Input Data Files: the copies you save in the Input Data Folder should be identical to the files you first obtained or constructed, before you modified them in any way.

Where Do Input Data Files Come From?

Existing datasets

In some cases, you obtain Input Data Files from existing datasets.

  • Read more

    Existing datasets can be obtained from a wide variety of sources, such as government or international agencies, non-profit research organizations, academic institutions, individual scholars, and for-profit businesses.

    In most cases, you obtain existing datasets by downloading them from a web-site maintained by the owner or distributor of the data.

    If the project is a lab exercise or homework assignment, your instructor might make the Input Data Files available to you simply by posting them on a course management platform or some other site where you can access them.

Datasets you generate yourself

In other cases, you generate the data yourself.

  • Read more

    There are many ways you can collect or generate the Input Data Files for a project. A few common examples include:

    Conducting a survey

    If you use a web-based tool such as Qualtrics or Google Forms, the survey responses may be automatically stored in a spreadsheet.

    If you have respondents complete a paper survey, you will need to enter the data into a spreadsheet.

    Running an experiment

    If you use experimental software such as z-Tree or Psychtoolbox, results of the experiment will be stored in a spreadsheet.

    If results of the experiment are recorded on paper, you will need to enter the data into a spreadsheet.

    Web scraping

    You can also create Input Data Files by writing scripts that collect data from the Internet and save it in a spreadsheet.

Documenting Your Input Data Files

It is important to record of information about the sources and contents of your Input Data Files.

Details about the information you need to assemble can be found on the pages for the Data Sources Guide and Codebooks.

Restricted Access Data

In some cases, sharing your Input Data Files publicly may not be allowed because of concerns about individual privacy or intellectual property rights.

  • Read more

    For the work you do privately on your computer, data-sharing restrictions may have little impact.

    If you (and any collaborators and/or an advisor for the project) have authorization to use the data, then it may be permissible to store your Input Data Files in your InputData/ folder, and follow the standard recommendations of the TIER Protocol as you proceed with your research.

    Note: It is very important to pay close attention to any conditions of use you agreed to when you were given access to the data and to follow them faithfully. For example, in some cases a user or team of users is given permission to store a restricted data set on their own computer, but only under the condition that the computer is not connected to the internet.

    For the reproduction documentation you post publicly when you have completed your report, you must take care not to include restricted data.

    If an Input Data File may not be shared publicly, remove it from the InputData/ folder before you post your documentation.

    When you omit an Input Data File from your documentation because of data-sharing restrictions, you should provide information explaining how an interested user can get access to it.

    This information should be provided in the Guide to Data Sources, and it should include:

    • The source from which the file is available (e.g., give the URL of a website, or the name of the data owner or distributor).
    • What steps must be taken to obtain authorization to use the data.
    • Enough additional details to enable a user with access to the data source to find or extract a data file identical to the Input Data File you used for the project (but had to remove from the public reproduction documentation).

    Even if you are not permitted to post any of your Input Data Files, you should still include an (empty) InputData/ folder in your documentation.

    Once the user has been granted access and obtained copies of the restricted data files, they will store copies of them in the InputData/ folder. When they have done that, the files in the user's InputData/ folder should be identical to those you would have stored there if you were allowed to share the data publicly.

Naming Input Data Files

Your default practice should be to leave the name of each Input Data File unchanged from the name that was assigned to the file when you first downloaded it. This is in keeping with the principle that your Input Data Files should be identical to the files you originally obtained for the project.

In some cases, however, Input Data Files will originally have long, complicated names that do not describe or label the file in any useful way. In those cases, you should change the name of the Input Data File to something shorter, simpler, and descriptive.

  • For example...

    For example, if you download an extract from the World Bank's World Development Indicators database, the name of the file might be something like 65ed4f53-5b19-432f-b221-6b1978e4a315_Data.csv. That name doesn't give any clue about the contents of the file, and it would be a hassle to write that entire name every time you refer to the file. It would be advisable to change the name to something like WDI_inequality.csv.

But remember: If you decide to change the name of an Input Data File, you must leave the filename extension (e.g., .csv, .txt, xlsx, .dta, .sav, .Rdata) unchanged.