Accessing Open-Source Data from Data.gov

From E-Learning Faculty Modules


Contents

Module Summary

In a democratic society, with a functioning government, data is generated as a part of the regular work of service-provision, administration, and oversight. Some of this data is de-identified and released to the tax-paying public citizenry and others as part of government transparency. The U.S. (particularly its executive branch) releases such open-source data through its data.gov portal, which has been in existence since May 2009. The datasets may be downloaded along with related metadata for use in online learning about data structures, data visualizations, statistical analysis, government responsibilities, and other uses. Currently, some 155.976 datasets are hosted on the site.

Datagov.png

Takeaways

Learners will...

  • learn about data.gov as a repository for open-source data from the U.S. government
  • consider some ways to use open-source data in online learning
  • explore common data formats and available metadata in data.gov (and consider ways to properly cite the data)
  • review the steps to downloading data from data.gov
  • consider what may be knowable from the downloadable data and some potential uses of the downloadable data

Module Pretest

1. What is data.gov? What are the various types of data on data.gov?

2. What are some ways that the data in data.gov may be used for online learning? What are the terms of use?

3. What are common data formats in data.gov? What types of metadata are there in data.gov? How should the respective datasets be cited?

4. What are the steps to downloading the data from data.gov?

5. What are simple ways to evaluate what is knowable from the available data? How can you assess the potential uses of the downloadable data?

Main Contents

This section contains some contents linked to the module pre-test.


1. What is data.gov? What are the various types of data on data.gov?

“data.gov” is the name of a U.S. government website through which the U.S. General Services Administration, Technology Transformation Service shares datasets for public consumption. This site was launched in late May 2009 at the behest of the U.S.’s first-ever Federal Chief Information Officer Vivek Kundra. The idea was to enhance confidence in government by increasing the transparency of its functions. According to an article about this site: “The U.S. Open Directive of December 8, 2009, requires that all agencies post at least three high-value data sets online and register them on data.gov within 45 days.” (“data.gov,” April 13, 2017). The underlying data is used to power various apps. The slogan for the site is “the home of the U.S. Government’s open data,” and the site offers “data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.”

What are the data topics? As listed on the site, they include the following: agriculture, climate, consumer, ecosystems, education, energy, finance, health, local government, manufacturing, maritime, ocean, public safety, and science & research.


2. What are some ways that the data in data.gov may be used for online learning? What are the terms of use?

In terms of licensing, the site’s policies stipulate that all the data on the site is “offered free and without restriction. Data and content created by government employees within the scope of their employment are not subject to domestic copyright protection under 17 U.S.C. § 105.” Privacy and Website Policies, 2017 The site also hosts some data from non-federal sources, such as universities, states, and so on. In those cases, the “Access and Use” section on the respective dataset pages will address licensing.


3. What are common data formats in data.gov? What types of metadata are there in data.gov? How should the respective datasets be cited?

The data types are varied and include the following: xls, xlsx, Excel; HTML; csv; JSON; RDF; WAF; IMG (image files); WAF; WMS; EsriREST; PDF; gml; txt; KMZ, WAF, LAS, TAR (compressed format), ArcGrid, GridFloat, and others. Some are in Access database format (.accdb). Some are basic types of structured, semi-structured, and unstructured data. There are also zipped folders of various data contents.

The metadata about the respective datasets are available in general in one of two places. There may be a downloadable file with the metadata on the landing page where the dataset is hosted, or the data may be within the dataset file. If there is a downloadable file, that may be in various file format types.

In terms of source citations, it looks like users of the data will have to collect that information: name of the dataset, date of release, years of the data, organization, related URL, and so on. The order and formatting of the database source citation will vary depending on the citation format or bibliographic method used.


4. What are the steps to downloading the data from data.gov?

There are multiple ways to search through the data.gov site.

On the main page, https://www.data.gov/, searchers may browse by topic. Again, these include (in alphabetical order): agriculture, climate, consumer, ecosystems, education, energy, finance, health, local government, manufacturing, maritime, ocean, public safety, and science & research. At the top of each section is a “Highlights” one which features a selected dataset. By scrolling down, a user may read short snippets about other databases with the dates of release as part of the metadata. Clicking on the titles of the short pieces will link to the particular landing page for the dataset. Some links connect the user to an external data portal, and more navigation and searching may be required.

To go right to the datasets, a researcher or instructor may go to the dataset catalog and do a text-based search of the datasets. The datasets may be further filtered by location with the “Filter by location” feature. The direct link to the catalog is as follows: https://catalog.data.gov/dataset.


5. What are simple ways to evaluate what is knowable from the available data? How can you assess the potential uses of the downloadable data?

The breadth of the data types on this site means that it will be difficult to inclusively summarize approaches. (The writer is not familiar with all the data types possible here, only some.) Basically, it helps to download the dataset, open it up, and see what’s there. In a structured data dataset, a perusal of the column headers and the data in the columns is sufficient to get a sense of what the dataset covers and what may be studied through simple mathematical, statistical, and frequency count methods. In more complex sets with larger amounts of textual data, it is possible to run that text data through software programs to analyze various aspects of the words. If URLs are included in a dataset, there can be studies of related websites. If map data is included, it is possible to place datapoints and other information on maps.

How that data may be used and how assignments may be created around these datasets will depend on the teaching and learning context and the learning domain. A simple assignment may be to explore the dataset and make three factual observations from the empirical data. Another simple assignment may be to have the learners work on teams to exploit various aspects of the data, and then to have the teams share their insights in a “jigsaw” type of assignment. Learners may be asked to extract basic descriptive statistics from a column in a dataset. Or they may be asked to create accurate data visualizations from the downloaded data. Or they may be asked to accurately place data on a geographical map for an informative data visualization.

Instructors may add more complexity by augmenting the learning with other related datasets or data sources, in the context of more complex analytical assignments.

Examples

(Forthcoming.)

How To

In the “Main Contents” section above, in #4, there are directions for how to download datasets. In #5, there are some light suggestions on how to think about assignments.

Possible Pitfalls

The data.gov data portal offers access to a rich range of government-created data. This resource offers plenty of learning opportunities. The datasets are carefully de-identified. While some datasets are more informative than others, there does seem to be a wide range of selections. No privacy data is collected on users of the site, according to their terms of service. The terms of use are very light and non-onerous.

The risks come from

  • the setup of the assignments
  • the quality of instructor oversight in helping debrief what the respective datasets mean and
  • the amount of instructor guidance and engagement with the type of work done by the learners.

All the risk factors are within the control of the instructor.

Module Post-Test

1. What is data.gov? What are the various types of data on data.gov?

2. What are some ways that the data in data.gov may be used for online learning? What are the terms of use?

3. What are common data formats in data.gov? What types of metadata are there in data.gov? How should the respective datasets be cited?

4. What are the steps to downloading the data from data.gov?

5. What are simple ways to evaluate what is knowable from the available data? How can you assess the potential uses of the downloadable data?

References

Applications. (2017). Data.gov. https://www.data.gov/applications.

data.gov. (2017, Apr. 13). Wikipedia. Retrieved Apr. 30, 2017, from https://en.wikipedia.org/wiki/Data.gov.

The home of the U.S. Government’s open data. (2017). Data.gov. https://www.data.gov/.

Privacy and Website Policies. (2017). Data.gov. https://www.data.gov/privacy-policy.

Extra Resources

Data.gov. https://www.data.gov/