Digital Archive Database Design

Digital Archive Sponsor Logos

The individual files described above are schematically linked to one another in a relational database as described in Figure 1 (below).  This design allows individuals maximum flexibility in choosing the appropriate unit of analysis for particular analyses.  For example, one analysis may use the manuscript as the unit of analysis and compare how different manuscripts fare based on the attributes of the manuscript, such as submission year or topic, as well as attributes of the authors, such as gender, race or type of institution.  In another example, one may use authors as the unit of analysis and compare how many manuscripts are submitted (or accepted) with different characteristics or with the type of manuscript they submit.  Or, in a final example, the unit of analysis may be the reviewer and reviewers may be compared as to how likely they are to accept (or reject) manuscripts based on their own characteristics or characteristics of the manuscript or author(s).

This type of flexibility would be difficult, if not impossible, if researchers were given a single rectangular file with individuals (or manuscripts, authors, or reviews) as the unit of analysis.  To begin with a single file would be extremely large due to the number of variables needed to accommodate rare individuals who submitted many manuscripts (one person submitted 44 manuscripts, some of which are original submissions and other resubmissions) or many reviews (another person prepared 193 reviews).   In a public use file, where names are eliminated, each possible submission or review is associated with 16 different variables, describing the manuscript, the author, and the reviewer.  To allow for all this information for the maximum number of submissions and reviews would require 3,712 variables for each person, even though most of these variables would be empty since most individuals submitted far fewer manuscripts or reviews.  Since the person file contains 9,867 individuals who have submitted one or more manuscripts and/or reviews, researchers would need to set software workspace requirements to 37,415 KB. Beyond working with such an extremely large file, analyses based on a single file solution would require users to perform variety of aggregation procedures and the use of lag or lead functions to combine information across individuals.

The relational database design, allows users to work with smaller files and relatively simple one-to-one or one-to-many matches to create files with the appropriate unit of analysis.  Specific examples of these tasks are provided in the section below titled “Creation of Analysis Files”. However, at this point, five general points are worth noting.

  1. As described above, data from the Journal Builder is used to populate the manuscript, review, and author files.
  2. Manuscript and version numbers are the keys to matching information across these three file types.
  3. The curated file name serves as the key to link to pdf versions of specific manuscript versions or specific reviews.
  4. Data from the ASA Grad Guide, the permissions survey and ASA membership data is used to populate the person file.
  5. The unique person identification number provides the key to match individual level data from the person file directly to authors and reviews and then indirectly to a specific manuscript.

Archive Schematic Database Diagram

Figure 1

Archive Schematic Database Diagram