Merged Data Sets

The following is a list of documents/presentations highlighting our current progress towards ultimately merging data sets. Click here to view more documents related to merged data sets.

Describing the Processes of Ontology Generation

Describing the Process of DEMO Generation

Previous and Competing Data Resources

Addressing Problems of Missing Data

Feature Projects in Historical Data (Opium and Silver)

Background

This section of the Dataverse website, still under construction, enables users to link and merge datasets drawn from the Pitt Archive. It centers on an interface that enables users to explore multiple datasets and to select fields or whole datasets, assembling them into new and composite (or “federated”) datasets. Users can then analyze the federated datasets to seek out relationships within them.

While the ultimate objective of the World-Historical Dataverse project is the creation and display of a worldwide historical dataset, such a large-scale dataset will take years to create, even in its most preliminary form. The initial task in the project, the collection of datasets of world-historical relevance and their placement in the Archive, does not in itself yield the interconnections in data that are required for a world-historical dataset.

For these reasons, the creation of merged datasets is the principal interim task of the Dataverse project. The merging of datasets is complicated in itself, because of inconsistencies among datasets in the definition and coding of evidence. For instance, variables for “time period” might present data in annual, monthly, or decennial totals, and might include data for certain years but not for others. The merging of datasets thus requires a range of criteria for assembling datasets. That is, the most exacting criterion for merging datasets would be that the time frame and other variables would be fully consistent in all datasets. But other criteria and assumptions will be developed to enable the merging of datasets for which the temporal and other variables are not fully consistent.