Everything about data including open source, healthcare data sets and more, in one location.
CheXpert Imaging
CheXpert is a public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients from Stanford Hospital.
Genomic Data Commons(GDC) datasets Genetics/Cancer
The GDC Portal is a platform from National Cancer Institute(NCI) with cancer related genomic data for 80,000+ cases.
CARING Research radiologist annotated COVID-19 X-rays from BIMCV dataset Imaging
Annotations by CARING's (Centre for Advanced research in Imaging, Neuroscience and Genomics) expert radiologists on COVID-19 positive X-rays.The corresponding X-rays were released by Medical Imaging Data Bank of the Valencia region (BIMCV).
BIMCV-COVID19 Imaging Datasets Imaging
BIMCV-COVID19+ dataset is a large dataset with chest X-ray images and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV).These iterations of the database include 7377 CR, 9463 DX and 6687 CT studies.
RICORD: The RSNA International COVID-19 Open Annotated Radiology Database Imaging
This database is the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset.Annotated by three radiologists with majority vote adjudication by board certified radiologists,RICORD consists of 240 thoracic CT scans and 1,000 chest radiographs contributed from four international sites.
VinBigData Chest X-ray abnormalities detection Imaging
Provided on Kaggle by the Vingroup Big Data Institute (VinBigData) aims to promote fundamental research and investigate novel and highly-applicable technologies.A dataset consisting of 18,000 images that have been annotated by experienced radiologists.
EchoNet -Dynamic Cardiology
The EchoNet-Dynamic database includes 10,030 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes
PGxCorpus: a Manually Annotated Corpus for Pharmacogenomics Pharmacology/Genetics
941 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly genes variations, gene, drugs and phenotypes), and relationships between those.
CENTAUR LABS General
40+ speciality classified list of open source datasets for healthcare with direct links to the datasets and more information.
DATA WORLD - HEALTHCARE General
More than a 100 healthcare related datasets from around the world, classified and annotated.
Determinants of COVID-19 mortality in the United States dataset (BrainX) General
Dataset created for the purpose of continuing research into COVID-19. However with information from all 50 states and the District of Columbia, many US statistics can be compared.
Drug Induced Liver injury(DILI) Dataset Pharmacy
The DILIrank dataset is an updated version of the LTKB Benchmark dataset. DILIrank consists of 1,036 FDA-approved drugs that are divided into four classes according to their potential for causing drug-induced liver injury (DILI).
SUSTech -SYSU dataset Ophthalmology
Dataset for automatically segmenting and classifying corneal ulcers with 712 ocular staining images and the associated segmentation labels for flaky corneal ulcers.
Harvard Dataverse General
4000+ healthcare datasets made available from Harvard University.Searchable and diverse.
PanNUke Dataset Pathology
Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources.
ACR COVID-19 Imaging Dataset Radiology/Imaging
A dataset with Images,mainly Chest X-rays from COVID-19 patients.
C3.ai COVID-19 Data Lake General/Infectious disease
Multiple data sources for COVID-19 in a unified data model, ready for analysis at one place.
COVID-19 Open Research Dataset Challenge (CORD-19) General/Infectious disease
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
Novel Corona Virus 2019 Dataset General/Infectious disease
From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China.
This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.
The data is available since 22 Jan, 2020.
The RSNA 2019 Brain CT Hemorrhage Dataset. Imaging
Largest collection of Intracranial hemorrhage CT scans.874 035 images with expert annotations. Reference: Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge
PHYSIONET(MIMIC/eICU Collaborative) General/Critical Care
https://physionet.org/about/database/
MIMIC - IV Dataset (https://physionet.org/content/mimiciv/0.4/)
One of the best sources of datasets in healthcare.Primarily from ICU patients.
Includes:
- Clinical datasets such as MIMIC,eICU collaborative and Pediatic ICU datasets.
- Waveform datasets with ECG,EEG,arterial blood pressure waveform.
- ECG datasets with various pathophysiologic changes and drug interactions.
- Fetal datasets including sounds and ECG.
- Gait and Balance datasets include gait dynamics for patients with various neurodegenerative disorders.
- Neuro and Myoelectic datasets with EEG,EMG and evoked potential waveforms.
- Image datasets with Chest X-rays and MRI images.
- Computed Tomography Images for Intracranial Hemorrhage Detection and Segmentation
- Miscellaneous datasets with text, language,posture and other datasets.
ADNI Database Neurology/Imaging
Alzheimer's disease patient's imaging(MRI), clinical, genomic, and biomarker data for the purposes of scientific investigation, teaching, or planning clinical research studies.
http://adni.loni.usc.edu/data-samples/access-data/
RIM-ONE Ophthalomology
RIM-ONE is a database for optic disc and cup segmentation evaluation by Medical Image Analysis group.
AmsterdamUMCdb Critical Care
Contains data related to 23,376 intensive care unit and high dependency unit admissions at Amsterdam University Medical Center of adult patients from 2003-2016.
https://github.com/AmsterdamUMC/AmsterdamUMCdb/wiki
FDA Adverse Event Reporting System (FAERS) Pharmacy
The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA
Malaria Dataset Microbiology
A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.The dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells.
MESSIDOR: Methods to Evaluate Segmentation and Indexing Techniques in the field of Retinal Ophthalmology Ophthalmology
1200 Retinal image dataset with annotation.
http://www.adcis.net/en/third-party/messidor/
RIGA Dataset :Retinal fundus images for glaucoma analysis Ophthalmology
A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) derived from three sources with 750 original images and 4500 manual marked images
https://deepblue.lib.umich.edu/data/concern/data_sets/3b591905z
High-Resolution Fundus (HRF) Image Database Ophthalmology
The public database contains 15 images of healthy patients, 15 images of patients with diabetic retinopathy and 15 images of glaucomatous patients.
https://www5.cs.fau.de/research/data/fundus-images/
DR HAGIS:Diabetic Retinopathy, Hypertension, Age-related macular degeneration and Glacuoma ImageS Ophthalmology
39 images for development of vessel extraction algorithms suitable for retinal screening programmes.
https://personalpages.manchester.ac.uk/staff/niall.p.mcloughlin/
NLST Datasets: National Cancer Institute Cancer
Datasets from National Cancer Institute of over 54000 patients. They include data on participant characteristics, screening exam results, diagnostic procedures, lung cancer, and mortality. Images from over 75,000 CT screening exams are available. Over 1,200 pathology images from a subset of NLST lung cancer patients (~500 of over 2,000 patients) may be viewed.
https://biometry.nci.nih.gov/cdas/datasets/nlst/
NSRR Datasets:National Sleep Research Resource Pulmonary
Polysomnography dataset from NSRR for sleep studies.Large collection of deidentified physiologic signals perfect for ML development.
https://sleepdata.org/datasets
The HAM10000 dataset Dermatology
A large collection of multi-source dermatoscopic images of common pigmented skin lesions containing 10000 images.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T
Associated publication link below:
The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions
Philipp Tschandl, Cliff Rosendahl & Harald Kittler
UCI Machine Learning Repository General
This open source repository has more than 400 datasets including healthcare(100+) and non-healthcare ones in searchable and categorized format.
http://archive.ics.uci.edu/ml/datasets.php
Centers for Medicare and Medicaid(CMS) datasets with ResDAC link. General
CMS datasets provide US Medicare and Medicaid datasets.
ResDAC(The Research Data Assistance Center) provides free support to users of CMS datasets.
Center for disease control(CDC) Datasets General
Center for disease control's datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.
Healthcare Cost and Utilization Project (HCUP) datasets General
Agency for Healthcare Research and Quality's HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.
https://hcup-us.ahrq.gov/databases.jsp
NHS datasets General
UK government's National Health services datasets.NHS choices datasets are useful for NLP and sentiment analysis both for GPs and hospitals.
OASIS Brain MRI dataset Neuroimaging
Brain MRI datasets from Open Accesss series of Imaging Studies(OASIS).
http://www.oasis-brains.org/#data
OpenNEURO Neuro
A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data with over 200 datasets.
National Cancer Institute(NCI)-SEER datasets Cancer
Cancer epidemiology data available through NCI's Surveillance,Epidemiology and End Result Program(SEER)
https://seer.cancer.gov/data-software/
BROAD Institute's Cancer program datasets Cancer and Genomics
Cancer and genomics datasets.
http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi
MURA Imaging
A dataset of 14,000+ anonymized, radiologist labeled musculoskeletal X-rays from 12,000+ patients from Stanford ML group.
https://stanfordmlgroup.github.io/competitions/mura/
Read their article: https://arxiv.org/abs/1712.06957
fastMRI Imaging
1500+ knee MRI anonymized dataset from NYU.
NLTK : Natural language toolkit General
One stop to learn Natural Language processing and more.
Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
DAIR.AI General
An excellent resource for trends and updates in AI, especially NLP by Elvis Saravia.
Data science article collection General
An excellent collection of articles on data science.
eICU Collaborative Dataset Critical Care
eICU collaborative dataset of more than 130,000 patients across 300 hospitals.Deidentified dataset available for collaborative research includes vitals,clinical notes,APACHE score,diagnosis,treatment information and more.
https://www.nature.com/articles/sdata2018178
Google Dataset Search General
Google's powerful search engine to assist with dataset search.
https://toolbox.google.com/datasetsearch
NIH CXR14 dataset Imaging
Over 100,000 anonymized chest x-ray images and their corresponding data from more than 30,000 patients, including many with advanced lung disease.
https://nihcc.app.box.com/v/ChestXray-NIHCC
NIH Deep Lesion Imaging
NIH release of a dataset containing 32,000 CT scan images with annotated lesions belonging to 4400 unique patients.
https://www.nih.gov/news-events/news-releases/nih-clinical-center-releases-dataset-32000-ct-images
Blue Button 2.0 General
A CMS initiative to democratize research and development using beneficiary data.Greater than 70 million patient dataset available.Learn more through links below:
https://www.youtube.com/watch?v=v5b8T6EELp8
National Institute of Health General
The link below is for NIH's strategic plan for data science in healthcare.A must read for anyone using data in healthcare for research and innovation.
https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf
NIH Clinical Center Imaging
Largest open source Chest X-Ray data set available through NIH's clinical center.See the link in the article to access the data.Also available through GITHUB and KAGGLE.
GITHUB General
Thanks to Andrew L. Beam and many other contributors on the GITHUB page.Visit via link below or through the BRAINX COMMUNITY on LinkedIn.
https://github.com/beamandrew/medical-data
https://www.linkedin.com/groups/13599549
KAGGLE General
Kaggle is a good source for de-identified datasets in healthcare.Visit the page using link below and explore.
https://www.kaggle.com/datasets
DataMed General
A biomedical data search engine which searches for datasets across registries.
Mendeley General
A place to store, share or find data.A platform for biomedical research.
https://www.mendeley.com/datasets
Nature General
Detailed data repositories for biomedical research.