COVID-19 dataset clearinghouse
Data cleaning proposal
Instructions for posting a request for a data set to be cleaned
Ideally, the submission should consist of a single plain text file which clearly delineates your request (specify what your “cleaned” data set should contain). This should specify the desired format in which the data should be saved (e.g. csv, npy, mat, json). This text file should also contain a link to a webpage where the raw data to be cleaned can easily be accessed and/or downloaded, and with specific instruction for how to locate the data set on said webpage.
We do not yet have a platform for these requests, so please post them for now at the above blog post or email tao@math.ucla.edu .
Data sets
- COVID-19 data sets on Kaggle
- Safegraph aggregated foot traffic data. Needs non-commercial agreement to execute.
- Coronavirus Disease (COVID-19) – Statistics and Research, Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina
- Novel Coronavirus 2019 time series data on cases, sourced and cleaned from this upstream repository from the Johns Hopkins University Center for Systems Science and Engineering
- COVID Tracking Data (CSV), from the COVID tracking project. (US data only)
- 2019-nCoV Data Processing Pipelines and datasets
Data cleaning requests
From Chris Strohmeier (UCLA), Mar 25
The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19.
We are requesting:
- A tf-idf matrix associated to the subset of the above collection which contain full-text articles (some appear to only have abstracts).
- The rows should correspond to the (e.g. 5000) most commonly used words.
- The columns should correspond to each individual json file.
- The clean data should be stored as a npy or mat file (or both).
- Finally, there should be a csv or text document (or both) explaining the meaning of the individual rows and columns of the matrix (what words do the rows correspond to? What file does each column correspond to).