COVID-19 dataset clearinghouse: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
This is a repository for public data sets relating to the COVID-19 pandemic. It was also initially envisioned as a clearinghouse for matching requests for data cleaning of such datasets with volunteers willing to perform this clearing, but the existing clearinghouse at [https://www.data-against-covid.org/ Data against COVID] is already up and running for this purpose, so we are redirecting such requests to that site in order not to fragment the pools of requests and volunteers. | |||
For discussion of this project, see [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests this blog post]. | |||
== Data sets == | == Data sets == | ||
Line 15: | Line 7: | ||
=== Epidemiology === | === Epidemiology === | ||
* [https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset Novel Corona Virus 2019 Dataset - Day level information on covid-19 affected cases], Kaggle | |||
* [https://ourworldindata.org/coronavirus Coronavirus Disease (COVID-19) – Statistics and Research], Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina | * [https://ourworldindata.org/coronavirus Coronavirus Disease (COVID-19) – Statistics and Research], Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina | ||
* [https://github.com/CSSEGISandData/COVID-19 Novel Coronavirus (COVID-19) Cases], Johns Hopkins University Center for Systems Science and Engineering | * [https://github.com/CSSEGISandData/COVID-19 Novel Coronavirus (COVID-19) Cases], Johns Hopkins University Center for Systems Science and Engineering | ||
Line 46: | Line 37: | ||
** [https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml Patient database] | ** [https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml Patient database] | ||
* [https://github.com/jihoo-kim/Data-Science-for-COVID-19-old Data Science for COVID-19 in South Korea] | * [https://github.com/jihoo-kim/Data-Science-for-COVID-19-old Data Science for COVID-19 in South Korea] | ||
** [https://www.kaggle.com/kimjihoo/coronavirusdataset The data set on Kaggle] | |||
* [https://github.com/pcm-dpc/COVID-19 COVID-19 Italia - Monitoraggio situazione] | * [https://github.com/pcm-dpc/COVID-19 COVID-19 Italia - Monitoraggio situazione] | ||
Line 62: | Line 54: | ||
* [https://www.ncbi.nlm.nih.gov/research/coronavirus/ LitCovid] - a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus | * [https://www.ncbi.nlm.nih.gov/research/coronavirus/ LitCovid] - a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus | ||
* [https://connect.biorxiv.org/relate/content/181 COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv] | * [https://connect.biorxiv.org/relate/content/181 COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv] | ||
=== Medical imagery === | |||
* [https://www.kaggle.com/darshan1504/covid19-detection-xray-dataset COVID-19 Detection X-Ray Dataset], Kaggle | |||
* [https://www.sirm.org/category/senza-categoria/covid-19/ COVID-19: casistica radiologica Italiana], Società Italiana di Radiologia Medica e Interventistica | |||
=== Other data === | === Other data === | ||
Line 71: | Line 68: | ||
** Open geospatial work to support health systems' capacity (providers, supplies, ventilators, beds, meds) to effectively care for rapidly growing COVID19 patient needs | ** Open geospatial work to support health systems' capacity (providers, supplies, ventilators, beds, meds) to effectively care for rapidly growing COVID19 patient needs | ||
** [https://www.covidcaremap.org/maps/us-healthcare-system-capacity/#6.07/40.085/-75.195 Open map data on US health system capacity to care for COVID-19 patients] | ** [https://www.covidcaremap.org/maps/us-healthcare-system-capacity/#6.07/40.085/-75.195 Open map data on US health system capacity to care for COVID-19 patients] | ||
* [http://www.panacealab.org/covid19/ Covid-19 Twitter chatter dataset for scientific use], Panacea Lab, Georgia State University | * [http://www.panacealab.org/covid19/ Covid-19 Twitter chatter dataset for scientific use], Panacea Lab, Georgia State University | ||
Line 90: | Line 86: | ||
* [https://www.ft.com/coronavirus-latest Coronavirus tracked: the latest figures as the pandemic spreads], Financial Times | * [https://www.ft.com/coronavirus-latest Coronavirus tracked: the latest figures as the pandemic spreads], Financial Times | ||
* [https://www.mygov.in/covid-19/ COVID-19] - official Indian government site | * [https://www.mygov.in/covid-19/ COVID-19] - official Indian government site | ||
* [https://www.kaggle.com/imdevskp/covid-19-analysis-visualization-comparisons/data COVID-19 - Analysis, Visualization & Comparisons], Kaggle | |||
=== Other lists === | === Other lists === | ||
* [https://www.kaggle.com/tags/covid19 COVID-19 data sets], Kaggle | |||
* [https://www.reddit.com/r/datasets/comments/exnzrd/coronavirus_datasets/ Reddit thread collecting coronavirus datasets] | * [https://www.reddit.com/r/datasets/comments/exnzrd/coronavirus_datasets/ Reddit thread collecting coronavirus datasets] | ||
* [https://www.programmableweb.com/news/apis-to-track-coronavirus-covid-19/review/2020/03/18 Review of COVID-19 APIs], Wendell Santos | * [https://www.programmableweb.com/news/apis-to-track-coronavirus-covid-19/review/2020/03/18 Review of COVID-19 APIs], Wendell Santos | ||
* [https://www.data-against-covid.org/ Data against COVID-19] | * [https://www.data-against-covid.org/ Data against COVID-19] | ||
== Data cleaning requests == | == Data or Data cleaning requests == | ||
As mentioned at the top of this page, future requests for data or data cleaning should be directed to [https://www.data-against-covid.org/ Data against COVID]. Below are the legacy requests of this project prior to this redirect. | |||
=== From Chris Strohmeier (UCLA), Mar 25 === | === From Chris Strohmeier (UCLA), Mar 25 === | ||
The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19. | The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19. | ||
Line 115: | Line 112: | ||
Contact: c.strohmeier@math.ucla.edu | Contact: c.strohmeier@math.ucla.edu | ||
=== From Juan José Piñero de Armas (U. Católica de Murcia), Mar 27 === | |||
We request information (on a person basis) to perform survival analyses, regressions with random effects, etc. Some data exists for instance at | |||
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/data | |||
https://www.kaggle.com/kimjihoo/coronavirusdataset | |||
https://www.kaggle.com/imdevskp/covid-19-analysis-visualization-comparisons/data | |||
https://www.sirm.org/category/senza-categoria/covid-19/ | |||
but we need much more detail (date when each person was diagnosed, date of infection for the same person, discharge date, date of death, gender, age, treatments, temperatures...) not just summaries or country-aggregated data. | |||
Contact: jjpinero@ucam.edu | |||
== Miscellaneous links == | == Miscellaneous links == | ||
* [https://united-against-covid.org/ United Against COVID-19], which also crowdsources scientific and coding efforts to study the COVID-19 pandemic | * [https://united-against-covid.org/ United Against COVID-19], which also crowdsources scientific and coding efforts to study the COVID-19 pandemic |
Revision as of 09:52, 27 March 2020
This is a repository for public data sets relating to the COVID-19 pandemic. It was also initially envisioned as a clearinghouse for matching requests for data cleaning of such datasets with volunteers willing to perform this clearing, but the existing clearinghouse at Data against COVID is already up and running for this purpose, so we are redirecting such requests to that site in order not to fragment the pools of requests and volunteers.
For discussion of this project, see this blog post.
Data sets
Epidemiology
- Novel Corona Virus 2019 Dataset - Day level information on covid-19 affected cases, Kaggle
- Coronavirus Disease (COVID-19) – Statistics and Research, Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina
- Novel Coronavirus (COVID-19) Cases, Johns Hopkins University Center for Systems Science and Engineering
- Novel Coronavirus 2019 time series data on cases, sourced and cleaned from the above data set
- 2019-nCoV Data Processing Pipelines and datasets
- Countries and state names are normalized with ISO 3166-1 code.
- Location for summaries and analysis of data related to n-CoV 2019, first reported in Wuhan, China, Outbreak and Pandemic Preparedness team at the Institute for Health Metrics and Evaluation, University of Washington
- Daily data on the geographic distribution of COVID-19 cases worldwide, European Centre for Disease Prevention and Control
- Google sheets from DXY.cn
- Contains some patient information [age,gender,etc]
North America
- COVID Tracking Data, from the COVID tracking project
- A daily updated repository with CSV representations of data from the Covid Tracking API.
- COVID-19 in US and Canada
- COVID tracking project
- Covid-19 coronovirus cases in New York State
- Coronavirus Case Data for Every U.S. County, New York Times
Other regional data
- India COVID-19 tracker
- Data Science for COVID-19 in South Korea
- COVID-19 Italia - Monitoraggio situazione
Genomics and homology
- GISAID data (Global Initiative on Sharing All Influenza Data)
- Registration is required.
- Nextstrain build for novel coronavirus (nCoV), based on GISAID data
- Coronavirus Genome Sequence, Kaggle
- Repository of Coronavirus Genomes, Kaggle
- Wuhan coronavirus 2019-nCoV protease homology model, National Institutes of Health
Literature
- LitCovid - a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus
- COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv
Medical imagery
- COVID-19 Detection X-Ray Dataset, Kaggle
- COVID-19: casistica radiologica Italiana, Società Italiana di Radiologia Medica e Interventistica
Other data
- Aggregated foot traffic data, Safegraph
- Needs non-commercial agreement to execute.
- Sample visualization of Safegraph data
- COVID Care Map
- Open geospatial work to support health systems' capacity (providers, supplies, ventilators, beds, meds) to effectively care for rapidly growing COVID19 patient needs
- Open map data on US health system capacity to care for COVID-19 patients
- Covid-19 Twitter chatter dataset for scientific use, Panacea Lab, Georgia State University
Data scrapers and aggregators
- Corona Data Scraper
- Covid19-WebScrape-Plus
- COVID-19, Seektable
Visualizations and summaries
- COVID-19 Coronavirus Pandemic, Worldometer
- Tracking coronavirus: Map, data and timeline, BNO News
- Coronavirus COVID-19 Global Cases, JHU CSSE
- Infection2020
- covy.app
- COVID-19 Global Pandemic Real-Time report, dxy.cn (English version)
- Coronavirus tracked: the latest figures as the pandemic spreads, Financial Times
- COVID-19 - official Indian government site
- COVID-19 - Analysis, Visualization & Comparisons, Kaggle
Other lists
- COVID-19 data sets, Kaggle
- Reddit thread collecting coronavirus datasets
- Review of COVID-19 APIs, Wendell Santos
- Data against COVID-19
Data or Data cleaning requests
As mentioned at the top of this page, future requests for data or data cleaning should be directed to Data against COVID. Below are the legacy requests of this project prior to this redirect.
From Chris Strohmeier (UCLA), Mar 25
The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19.
We are requesting:
- A tf-idf matrix associated to the subset of the above collection which contain full-text articles (some appear to only have abstracts).
- The rows should correspond to the (e.g. 5000) most commonly used words.
- The columns should correspond to each individual json file.
- The clean data should be stored as a npy or mat file (or both).
- Finally, there should be a csv or text document (or both) explaining the meaning of the individual rows and columns of the matrix (what words do the rows correspond to? What file does each column correspond to).
Contact: c.strohmeier@math.ucla.edu
From Juan José Piñero de Armas (U. Católica de Murcia), Mar 27
We request information (on a person basis) to perform survival analyses, regressions with random effects, etc. Some data exists for instance at
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/data https://www.kaggle.com/kimjihoo/coronavirusdataset https://www.kaggle.com/imdevskp/covid-19-analysis-visualization-comparisons/data https://www.sirm.org/category/senza-categoria/covid-19/
but we need much more detail (date when each person was diagnosed, date of infection for the same person, discharge date, date of death, gender, age, treatments, temperatures...) not just summaries or country-aggregated data.
Contact: jjpinero@ucam.edu
Miscellaneous links
- United Against COVID-19, which also crowdsources scientific and coding efforts to study the COVID-19 pandemic