thinking

The Challenge: How should agencies be directed to supply more data for data.gov, and which data? How can the government work with third parties to make data.gov even more useful? What Government-wide approaches to data and metadata should we undertake to ensure data transparency?

What We’ve Heard from You: In the Discussion Phase you shared with us very thoughtful comments on data.gov as well as general comments on the importance of publishing government data in raw form. We welcome your recommendation or collection of recommendations. Feel free to remix content from other drafts and collaborate with other authors to create an ideal draft.

Drafting Directions: Review the comments from the Discussion blog as well as comments made by government employees and review the submissions in From the Inbox. Incorporating earlier input, you may write your own draft, or combine and edit those of others to create a new one.

Writing policy requires translating good ideas into clear, specific directions for practical implementation. Hence a good recommendation will be no more than 4 sentences and a set of recommendations will be no more than 1 page. To be of maximum use, a recommendation should address:

- Who is being directed to do something? (e.g. All agencies must)
- What is the institution being directed to do?
- Why is it important that they do so?
- How will success be measured?

Note that per the terms of use, your drafts are expected to be (among other things) civil and on-topic. We are depending upon you, the community, to help maintain the quality of this process, by reporting drafts which appear to violate these terms. Once reported a sufficient number of times, drafts will be submitted for moderator review. They will then be republished in their original place, republished as an "off-topic" draft, or archived off-line if it can not be republished.

Return to Open Government Directive, Phase Three: Drafting, or to the OSTP Blog.

In order to make the abundance of data that's already available online more visible, and thus useful, a Government Universe map page should be created. At the highest level this universe would display 6 galaxies - the Executive, Congressional, Judicial, States, Business Sectors and Public Sector galaxies, Each galaxy would have its major components ciricling around it as stars.

The Executive Galaxy would have the Presidency at its center with a star for each cabinet position surrounding it.

The Congressional Galaxy would be a binary system with two big stars at its center, the House of Representives being one star and the Senate being another. The major committees for these two bodies of Congress would surround their respective stars.

The Judicial Galaxy would have the Supreme Court at its center. The stars surrounding it could be either the Appeals Courts of specific areas of law.

The States Galaxy would have the word States at its center and have 50 stars circling it (with maybe a couple of more for territories and the like).

The Business Sectors Galaxy would have the words Business Sectors at its center and stars with labels like healthcare, manufacturing, construction, banking, etc... circling it.

The Public Sector Galaxy would have the words Public Sector at its center and be surrounded by stars from all sorts of different groups - churches, environmental organizations, the NRA, etc...

That would be the highest level view. Off to the side there would be boxes you could click on that would let you drill down for a closer look. For example, one box might say "Pending legislation in the Senate". Clicking on it might give you a dropdown box listing bills by subject area. When you picked a bill you'd be taken back to the map and lines would appear connecting the committe the bill was being considered in to all of the other stars in the other galaxies that would be affected by it. The depth to which you could drill down would only be limited by the budget and imagination of the map's creators.

For people who are visually oriented this would be a much faster way to get to the piece of data they're looking for. It could also be a very useful tool in school classrooms to show how things are connected.
All agencies shall proactively make all public data available. In cases where data is not made available, the privacy or security exception will be recorded and the quantity and origin of such requests will be made public.

Data will be made available in a timely manner. To determine reasonable timeliness, each agency will submit to the CIO a list of the data streams it produces and the frequency with which they are updated. The CIO will approve the minimum timeline for releasing each data stream and will publish a list of all data streams in a searchable data catalog.

Data must be machine-readable. Each record in the data should include well-defined identifiers that persist across revisions. Metadata standards for use across agencies will be defined and published in a centralized federal repository.

Data must be released in multiple formats, including at least one non-proprietary format.

Data must be granular, searchable, and from the source.

Individuals must be able to access and export data anonymously.

Data must not be copyrighted or otherwise limited by licensing.

Data must be of reliably high quality. The office of the CIO will be responsible for overseeing randomized, independent audits of 1% of released data. Data sets will be available for public rating. Data flagged as unreliable will be publicly marked as such and reviewed within a reasonable timeframe.

Data will proactively be shared with individuals via RSS, email, and other methods based on requested keyword(s)/tags. Agencies will enable public "communities of practice" around data of common interest. Individuals must be able to suggest new data streams and receive answers to questions in a timely manner. A robust, plain language FAQ should be provided.

The CIO will score each agency based on these criteria annually, the results of which will be published. Within agencies, the timely and proactive release of data will be built into performance reviews and rewarded.
21st Century Right to Know Recommendations: Government-Wide Data Standards

Effective government transparency is contingent upon the tools and technologies that provide access: the type of technologies deployed by the government to disseminate information; and the extent to which the general public can access and incorporate pertinent and accurate government information into their everyday decision making processes.

The Chief Technology Officer should lead efforts to standardize technology policy for public access across the government and within agencies. Specifically the position should be tasked with the following goals:

oData standards for sharing information, interagency, reporting, open programming.

oOpen programming/interface policy

oTransparency and dissemination

oControl privacy and identity theft

oSecurity of system

oProcurement (to have coherency)

oContract review

oTechnical capacity

oInteragency coordination (and intergovernmental)

oData quality

The CIO should work more directly on developing and promoting cross-agency interactive and public-facing applications and services for citizens and businesses as originally conceived in the E-Government Act.
Data.gov has the potential to be an extremely powerful tool for providing technical data to non-governmental users. However, government information comes in many forms in addition to XML and CSV. Other types of informationsuch as scientific reports or policy documentswill require a different type of information portal.

The Office of Science and Technology Policy (OSTP) should take the lead in redesigning science.gov to be a comprehensive source for the governments scientific reports and documents.

Currently, science.gov serves as a curated list of bookmarks to other agency websites, sorted by broad topic categories. This is undoubtedly a useful service, but the site should be re-envisioned as one-stop shopping for government scientific information. The site could help citizens identify agency subject-matter experts, scientific activities and research programs by topic and geographic area, so that they do not need to know in advance which agency has the information they seek.

Machine Readable Data/Metadata

21st Century Right to Know Recommendations: Government-Wide Data Standards

Calls for "common data formats" are telling but often lack concrete implications. Specific requirements for both data and metadata formats are vital to achieving information that consumers or machines can link together.

The CTO should promote a common data & metadata format to be used across all public data production. The format should be part of the specifications of requirements to data-producing federal programs, so that data consumers can trust APIs and bulk files to be consistent over time and across agencies.

To make data integration easier, the schema must allow for defining machine readable metadata for the three major types of public data:

1) Public reference data should be declared in official namespaces with identifiers of common categories (like US States) and enumerating their members with their official identifiers (State codes), names, descriptions, hierarchical relations and other useful properties such as geo-coordinates. Common categories should be defined in a centralized federal namespace. Individual agencies should define their unique categories in the same format in their own namespaces.

2) Public records provide data for reoccurring observations with the same properties, such as cases, incidents or survey answers. Some examples are building permits, marriages, reported car accidents, economic transactions, and senate bills. The machine readable metadata should define what columns make up primary keys to identify individual rows. These specifications should make it easy for a data consumer to integrate public records with other data with the same official categories.

3) Public statistics are derived numeric measures such as counts, percentages, rates and indexes that are comparable across one or multiple and categories. Some examples are Consumer Price Index, number of deaths, and average income. The definition of numeric measures should be provided in official namespaces. The mathematical formulas used to derive numbers should be presented in a machine readable way as part of the metadata. Processes that include human estimation should be described in footnotes that are tied to the individual estimates.

Datasets should have identifiers and versions. The metadata about a dataset should provide a machine readable inventory of what categories and measures appear in what tables. There should be permalinks that always point at the latest release. Previous versions should be available in an archive.