Skip to content

Latest commit

 

History

History
196 lines (150 loc) · 8.07 KB

File metadata and controls

196 lines (150 loc) · 8.07 KB

Data Sources

This project uses data from various sources that are openly licensed or in the public domain. Below are the sources and their respective information:

arXiv

Description: arXiv is a free distribution service and an open-access archive for scholarly articles in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. All arXiv articles are available under various open licenses or are in the public domain.

API documentation link:

API information:

  • No API key required
  • Query limit: 3 second delay between requests
  • Data format: OAI-PMH XML format with structured metadata fields
  • Metadata includes comprehensive licensing information for each paper

CC Legal Tools

Description: data/cc-lega-tools.csv contains metadata for all of the Creative Commons legal tools. The file can easily be updated with the ./dev/update_legal_tools_data.sh command.

API documentation link:

API information:

  • No API key required
  • No query limits

Additional files:

  • [dev/update_legal_tools_data.sh][dev-cc-data-fetch]: Fetch script to update the CC Legal Tools metadata CSV file
  • data/cc-lega-tools.csv: CC Legal Tools metadata CSV

Europeana

Description: The Europeana Search API provides access to digital cultural heritage metadata records aggregated from museums, libraries, and archives across Europe. This project uses the API to fetch aggregated counts of cultural heritage records by data provider, rights statement, and theme.

Official API Documentation:

API information:

  • API key required
  • Minimum 0.003 seconds between queries
  • Query parameters allow:
    • Full-text searching (query)
    • Retrieving metadata facets (profile=facets)
    • Filtering by data provider, rights statement, and theme
  • Data available through JSON format
  • Offset-based pagination

GCS (Google Custom Search) JSON API

Description: The Custom Search JSON API allows user-defined detailed query and access towards related query data using a programmable search engine.

Admin links:

API documentation links:

API information:

  • API key required
  • Query limit: 100 queries per day
  • Data available through JSON format

Notes:

  • The data from Google Custom Search will only cover 50+ general, most significant categories of CC License for data collection quota constraint. As an additional note, the order of precedence of license the collected data's first column is sorted due to intermediate data analysis progress.

GitHub

Description: A development platform for hosting and managing code.

API documentation link:

API information:

  • API key not required but recommended by GitHub
  • Query limit: 60 requests per hour if unauthenticated, 5000 requests per hour if authenticated
  • Data available through JSON format

Openverse

Description: Openverse is a search engine for openly licensed media, including images and audio. It provides access to over 700 million works from more than 20 sources, all of which are under Creative Commons licenses or in the public domain. The API allows querying for media by source, license type, and other parameters. Because anonymous Openverse API access returns a maximum of ~240 result count per source-license combination, the openverse_fetch.py script currently provides approximate counts. It does not include pagination or license_version breakdown.

API documentation link:

API information:

  • No API key required for basic access
  • Query limit: Rate-limited to prevent abuse (anonymous access provides ~240 results per source-license combination)
  • Data available through JSON format
  • Supports filtering by source, license, media type (images, audio)
  • Media types: images, audio
  • Supported licenses: by, by-nc, by-nc-nd, by-nc-sa, by-nd, by-sa, cc0, nc-sampling+, pdm, sampling+

Smithsonian

Description: The Smithsonian Institution Open Access API offers a metrics API for stats about CC0 objects/media.

API documentation link:

API information:

  • API key required
  • Hourly Limit: 1,000 requests per hour
  • Data available in a JSON format

Wikipedia

Description: The Wikipedia API allows users to query statistics of pages, categories, revisions from a public API endpoint. We have included two urls in the project: The WIKIPEDIA_BASE_URL AND WIKIPEDIA_MATRIX_URL. The WIKIPEDIA_BASE_URL provides access to articles, categories, and metadata from the English version of Wikipedia. It runs on the MediaWiki Action API, but this instance only provides English Wikipedia data. Then the WIKIPEDIA_MATRIX_URL provides access to information of all wikimedia projects including the different language edition of wikipedia. It runs on the Meta-Wiki API.

API documentation link:

API information:

  • No API key required
  • Query limit: It is rate-limited only to prevent abuse
  • Data available through XML or JSON format