BrainX AI

General/

LLM

EquityMedQA dataset for evaluationg harm and biases in LLMs

A collection of seven newly-released datasets comprising both manually-curated and LLM-generated questions enriched for adversarial queries. Both human assessment framework and dataset design process are grounded in an iterative participatory approach and review of possible biases in Med-PaLM 2 answers to adversarial queries.

Know More

General/

Imaging

AI CAD Database: FDA approved devices image interpretation dataset

A curated database of FDA-cleared AI devices for medical image interpretation, a canonical task among the first to be clinically operationalized. 140 FDA clearances from January 2016 to October 2023 for 104 unique AI-enabled CAD products, with some products having multiple clearances over time.Epecifically focused on AI devices with use cases that are historically referred to as variations of “CAD”, a term that stems from computer-aided detection.

Know More

General/

Pulmonary

Coswara dataset: COVID 19 patient audio recordings

A dataset containing diverse set of respiratory sounds and rich meta-data, recorded between April-2020 and February-2022 from 2635 individuals (1819 SARS-CoV-2 negative, 674 positive, and 142 recovered subjects). The respiratory sounds contained nine sound categories associated with variants of breathing, cough and speech. The rich metadata contained demographic information associated with age, gender and geographic location, as well as the health information relating to the symptoms, pre-existing respiratory ailments, comorbidity and SARS-CoV-2 test status.

Know More

Dermatology

Dermatology DDx dataset

Dermatology Images dataset of 1947 cases with annotated differential diagnoses(ddx) from multiple dermatologists across 419 conditions, associated risk categories for each condition, softmax prediction for 4 different models.

Related publications:

[1] Stutz, D., et al. (2023).[Conformal prediction under ambiguous ground truth](https://openreview.net/forum?id=CAd6V2qXxc).TMLR.
[2] Stutz, D., et al. (2023).[Evaluating AI systems under uncertain ground truth: a case study in dermatology](https://arxiv.org/abs/2307.02191).ArXiv, abs/2307.02191.

Know More

Imaging

AbdomenAtlas-8K

The largest multi-organ dataset (by far) with the spleen, liver, kidneys, stomach, gallbladder, pancreas, aorta, and IVC annotated in 8,448 CT volumes, equating to 3.2 million slices.

Know More

General/

LLM

Red Teaming Large Language Models in Medicine

There are a total of 382 unique prompts, with 1146 total responses across three iterations of ChatGPT (GPT-3.5, GPT-4.0, GPT-4.0 with Internet). 19.8% of the responses were labeled as inappropriate, with GPT-3.5 accounting for the highest percentage at 25.7% while GPT-4.0 and GPT-4.0 with internet performing comparably at 16.2% and 17.5% respectively. 11.8% of responses were deemed appropriate with GPT-3.5 but inappropriate in updated models, highlighting the ongoing need to evaluate evolving LLMs.

Know More

Cancer/

LLM

CORAL: expert-Curated medical Oncology Reports to Advance Language model inference

A fine-grained, expert-labeled dataset of 40 de-identified breast and pancreatic cancer progress notes at University of California, San Francisco, and assessed three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) in zero-shot extraction of detailed oncological information from two narrative sections of clinical progress notes.

Know More

General/

LLM

Primock57 dataset

Dataset of 57 mock medical primary care consultations: audio, consultation notes, human utterance-level transcripts.

Know More

Imaging

Medical Segmentation Decathlon

2,633 three-dimensional images collected across multiple anatomies of interest, multiple modalities, and multiple sources representative of real-world clinical applications. 10 datasets including CT scans of Abdomen,Lung and MRI of Brain, Prostate.

Related publications:

A large annotated medical image dataset for the development and evaluation of segmentation algorithms.Amber L. Simpson,et al.arXiv:1902.09063 [cs.CV]

Ma, J., He, Y., Li, F. et al. Segment anything in medical images. Nat Commun15, 654 (2024). https://doi.org/10.1038/s41467-024-44824-z

Github: https://github.com/bowang-lab/MedSAM?tab=readme-ov-file

Know More

LLM

Llama2-MedTuned-Instructions

Llama2-MedTuned-Instructions is an instruction-based dataset developed for training language models in biomedical NLP tasks. It consists of approximately 200,000 samples, each tailored to guide models in performing specific tasks such as Named Entity Recognition (NER), Relation Extraction (RE), and Medical Natural Language Inference (NLI). This dataset represents a fusion of various existing data sources, reformatted to facilitate instruction-based learning.

Know More

Neurology

PADS: Parkinson’s disease smartwatch dataset

The largest smartwatch-based dataset including Parkinson’s, other Movement Disorders and Healthy controls (n>400). over 5000 clinical assessment steps from 504 participants, including PD, DD, and healthy controls (HC).

Know More

General

The MultiCaRe Dataset:Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central

A multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions.

Github: https://github.com/mauro-nievoff/MultiCaRe_Dataset/tree/main

Know More

Imaging/

LLM/

Pulmonary

INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

INSPECT contains data from 19,438 patients, including CT images, sections of radiology reports, and structured electronic health record (EHR) data (including demographics, diagnoses, procedures, and vitals). Using our provided dataset, Stanford University, develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks.INSPECT is the largest multimodal dataset for enabling reproducible research on strategies for integrating 3D medical imaging and EHR data.

Know More

General/

Pulmonary

FluSense-data

FluSense platform collected and analyzed more than 350,000 waiting room thermal images and 21 million non-speech audio samples from the hospital waiting areas.

Know More

General/

LLM

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

MedAlign is a clinician-generated dataset for instruction following with electronic medical records.The MedAlign dataset contains:

1314 clinician-generated instructions, 983 after removing duplicates using ROUGE-L overlap;
276 longitudinal EHRs;
303 clinician-generated responses to instruction-EHR pairs.

Know More

General/

LLM

EHRSHOT

EHRSHOT, which contains de-identified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients.15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation.

Know More

Imaging

Scottish Medical Imaging (SMI) Archive

The Scottish Medical Imaging Archive is a collection of population-based, routinely collected medical radiology images.This archive provides access to “analytics-ready” extracts for images between January 1, 2010, and August 31, 2018, which can be used for health care research and the development or validation of artificial intelligence algorithms.An archive of 57.3 million radiology studies linked to their medical records from the whole Scottish population.Modalities: Computerised Tomography (CT), Magnetic Resonance Imaging (MRI), Positron EmissionTomography (PET), Structured Reports (SRs).

Know More

General/

Pulmonary

The COUGHVID crowdsourcing dataset

The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. Furthermore, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks.

Know More

General/

LLM

MeQSum: Dataset for medical question summarization

MeQSum corpus of 1,000 summarized consumer health questions. In particular, authors show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%.

Related publication: Asma Ben Abacha and Dina Demner-Fushman. 2019. On the Summarization of Consumer Health Questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy. Association for Computational Linguistics.

Know More

Cancer/

Pathology

NCT-CRC-HE-100K dataset of histological images of human colorectal cancer

This is a set of 100,000 non-overlapping image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and normal tissue.All images are 224×224 pixels (px) at 0.5 microns per pixel (MPP). Tissue classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM).These images were manually extracted from N=86 H&E stained human cancer tissue slides from formalin-fixed paraffin-embedded (FFPE) samples from the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany). Tissue samples contained CRC primary tumor slides and tumor tissue from CRC liver metastases; normal tissue classes were augmented with non-tumorous regions from gastrectomy specimen to increase variability.

Know More

General

WorldPop Data

Free and open access to global development data. 44,745 population datasets including birth, pregnancies,child vaccinations and more. WorldPop is based at the University of Southampton and maps populations across the globe. Since 2004, we have partnered with governments, UN agencies and donors to produce almost 45,000 datasets, complementing traditional population sources with dynamic, high-resolution data for mapping human population distributions,

Know More

General

Dataset for estimation of muscle dysmorphia in individuals from Colombia

Dataset for estimation of muscle dysmorphia in individuals from Barranquilla, Colombia

Know More

General

ELSI-Brazil (The Brazilian Longitudinal Study of Aging)

ELSI-Brazil (The Brazilian Longitudinal Study of Aging) aims to investigate the social and biological determinants of the aging process and its consequences to individuals and society. It is a nationally representative longitudinal study of community-dwelling adults aged 50 years or older, residing in 70 municipalities located across the five great geographic regions of Brazil. The baseline data collection was carried out in 2015-16 with 9,412 participants. The second wave was conducted in 2019-21 with 9,949 participants, including the sample replacement.

Related publications:

1 Lima-Costa MF, de Andrade FB, Souza PRB, Neri AL, Duarte YAO, Castro-Costa E, de Oliveira C. The Brazilian Longitudinal Study of Aging (ELSI-Brazil): Objectives and Design. Am J Epidemiol. 2018 Jul 1;187(7):1345-1353. doi: 10.1093/aje/kwx387.

2 Lima-Costa MF, de Melo Mambrini JV, Bof de Andrade F, de Souza PRB, de Vasconcellos MTL, Neri AL, Castro-Costa E, Macinko J, de Oliveira C. Cohort Profile: The Brazilian Longitudinal Study of Ageing (ELSI-Brazil). Int J Epidemiol. 2023 Feb 8;52(1):e57-e65. doi: 10.1093/ije/dyac132.

Know More

Endocrinology/

Imaging

TDID (Thyroid Digital Image Database)

TDID (Thyroid Digital Image Database) is a freely accessible database of ultrasound images of thyroid nodules from National University of Columbia. Currently, this database has a group of B-mode ultrasound images, which include a complete annotation and diagnostic description of the suspicious images of thyroid lesions, made by expert radiologists. From March 2014 to date, information from 389 patients has been collected.

Know More

General

Global Health Data Exchange

The Global Health Data Exchange (GHDx) is a catalog of global health and demographic data. The goal of the GHDx is to help people locate data by cataloging information about data including the topics covered, by providing links to data providers or explaining how to acquire the data, and in cases where we have permission, providing the data directly for download. Use the GHDx to research population census data, surveys, registries, indicators and estimates, administrative health data, and financial data related to health.

Know More

General

DATASUS: Brazilian Ministry of Health dataset

DATASUS provides information that can serve to support objective analyzes of the health situation, evidence-based decision making and the development of health action programs.Measuring the population’s health status is a tradition in public health. It began with the systematic recording of mortality and survival data (Vital Statistics – Mortality and Live Births). With advances in the control of infectious diseases (Epidemiological and Morbidity information) and with a better understanding of the concept of health and its population determinants, the analysis of the health situation began to incorporate other dimensions of the health status.

Data on morbidity, disability, access to services, quality of care, living conditions and environmental factors have become metrics used in the construction of Health Indicators, which translate into relevant information for the quantification and evaluation of health information.

This section also contains information on the population’s Health Care, registrations (Care Network), hospital and outpatient networks, registration of health establishments, as well as information on financial resources and Demographic and Socioeconomic information.

Know More

General

All of Us Research Hub

All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.Most All of Us participants contribute biosamples such as blood and/or saliva. DNA from these samples is extracted and sent to genome centers for genomic analysis, including whole genome sequencing (WGS) and genome-wide genotyping.The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model.

740,000+ participants, 400,000+ electronic health records, 520,000+ biosamples.

Know More

General/

Genetics/

Imaging

UK Biobank

UK Biobank has collected and continues to collect extensive environmental, lifestyle, and genetic data on half a million participants.It includes data for:

Imaging: Brain, heart and full body MR imaging, plus full body DEXA scan of the bones and joints and an ultrasound of the carotid arteries.
Genetics: Whole genome sequencing for all 500,000 participants, whole exome sequencing for 470,000 participants, genotyping (800,000 genome-wide variants and imputation to 90 million variants).
Health linkages: Linkage to a wide range of electronic health-related records, including death, cancer, hospital admissions and primary care records.
Biomarkers: Data on more than 30 key biochemistry markers from all participants, taken from samples collected at recruitment and the first repeat assessment.
Activity monitor: Physical activity data over a 7-day period collected via a wrist-worn activity monitor for 100,000 participants plus a seasonal follow-up on a subset.
Online questionnaires: Data on a range of exposures and health outcomes that are difficult to assess via routine health records, including diet, food preferences, work history, pain, cognitive function, digestive health and mental health.
Repeat baseline assessments: A full baseline assessment is undertaken during the imaging assessment of 100,000 participants.
Samples: Blood & urine was collected from all participants, and saliva for 100,000.

Know More

Pulmonary

ICBHI 2017 Challenge: Respiratory Sound Database

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics – ICBHI 2017.The Respiratory Sound Database contains audio samples, collected independently by two research teams in two different countries, over several years.The database consists of a total of 5.5 hours of recordings containing 6898 respiratory cycles, of which 1864 contain crackles, 886 contain wheezes, and 506 contain both crackles and wheezes, in 920 annotated audio samples from 126 subjects.

Know More

Imaging

VQA-RAD: Visual Question Answering (VQA) for radiology images

A manually constructed dataset where clinicians asked naturally occurring questions about radiology images and provided reference answers. Manual categorization of images and questions provides insight into clinically relevant tasks and the natural language to phrase them. The dataset contains 104 head axial single-slice CTs or MRIs, 107 chest x-rays, and 104 abdominal axial CTs. The final VQA-RAD dataset contains 3,515 total visual questions. Of these, 1,515 (43.1%) are free-form.

Know More

Imaging

SinoCT: Head CT dataset

This dataset contains over 9,000 head CT scans, each labeled as normal or abnormal. Each scan contains a reconstructed image (stored in our institution’s PACS and saved as DICOMs) and a corresponding sinogram (simulated via GE’s CatSim software and saved as numpy arrays). The reconstructed images are 512×512 pixels with a variable number of axial slices per scan. The sinograms are 984×888 pixels with a variable number of axial slices per scan. The full dataset is 1.3T.

Know More

LLM

MedInstruct-52k

A diverse medical task dataset comprising 52,000 instruction response pairs and,MedInstruct-test, a set of clinician-crafted novel medical tasks,to facilitate the building and evaluation of future domain-specific instruction-following models.

Know More

General/

Surgery

MedShapeNet – A Large-scale Dataset of 3D Medical Shapes for Computer Vision

MedShapeNet contains over 100,000 medical shapes, including bones, organs, vessels, muscles, etc., as well as surgical instruments.

Know More

Pathology

OpenPath: Pathology Images

OpenPath, a large dataset of 208,414 pathology images paired with natural language descriptions.

Know More

General

Med-HALT(Medical Domain Hallucination Test) dataset

This is a dataset used in the Med-HALT research paper. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs’ problem-solving and information retrieval abilities. This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. The authors propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate hallucinations.

Know More

Imaging

EMBED: Mammographic dataset

EMBED contains 364,000 screening and diagnostic mammographic exams for 110,000 patients from four hospitals over an 8-year period. The EMBED AWS Open Data release represents 20% of the dataset divided into two equal cohorts at the patient level. This release of the dataset includes 2D and C-view images.

Know More

Cardiology

MIMIC-IV-ECHO: Echocardiogram Matched Subset

The MIMIC-IV-ECHO module contains more than 500,000 echocardiograms across 7,243 studies from 4,579 distinct patients. This subset contains echocardiograms for patients who appear in the MIMIC-IV Clinical Database and were admitted between 2017 and 2019.

Related publication: Gow, B., Pollard, T., Greenbaum, N., Moody, B., Johnson, A., Herbst, E., Waks, J. W., Eslami, P., Chaudhari, A., Carbonati, T., Berkowitz, S., Mark, R., & Horng, S. (2023). MIMIC-IV-ECHO: Echocardiogram Matched Subset (version 0.1). PhysioNet. https://doi.org/10.13026/ef48-v217.

Know More

Genetics

The Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD), originally launched in 2014 as the Exome Aggregation Consortium (ExAC), is the result of a coalition of investigators willing to share aggregate exome and genome sequencing data from a variety of large-scale sequencing projects, and make summary data available for the wider scientific community.

v4 release is composed of 730,947 exomes and 76,215 genomes (GRCh38)
gnomAD v4 structural variants (SV) represent 63,046 genomes (GRCh38)
gnomAD v4 copy number variants (CNV) represent variants in less than 1% of 464,297 exomes (GRCh38)

Know More

General

CodiEsp corpus: gold standard Spanish clinical cases coded in ICD10 (CIE10)

The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each.

Know More

Gastroenterology

The Kvasir Datasets(Endoscopy)

3 key datasets for endocscopy:

1. Kvasir-dataset-v2 contains 8,000 images, 8 classes, 1,000 images for upper and lower endoscopy in each class:

2.The Kvasir-Capsule dataset

3.Kvasir SEG dataset

Know More

Gastroenterology

The Nerthus Dataset: Evaluate the quality of bowel preparation for colonoscopy (video dataset)

It contains 21 videos with a total number of 5, 525 frames annotated and verified by medical doctors (experienced endoscopists). The videos are divided into four classes of predefined bowel-preparation qualities.

Know More

Gastroenterology

The SEE-AI Project Dataset(Small Bowel Endoscopy Images)

This dataset comprises 18,481 images extracted from 523 small bowel capsule endoscopy videos. It has annotated 12,3320 images with 23,033 disease lesions and combined with 6,161 normal mucosa images. The annotations are provided in YOLO format.

Know More

General

AWS(Amazon) Marketplace Datasets

More than 80 open source healthcare datasets available through the AWS Open Data Sponsorship Program.

Know More

General

NHS-LLM and OpenGPT datasets

3 datasets:

NHS UK Q/A, 24,665 question and answer pairs, Prompt used: f53cf99826, Generated via OpenGPT using data available on the NHS UK Website. Download here.(Click on view Raw data)
NHS UK Conversations, 2,354 unique conversations, Prompt used: f4df95ec69, Generated via OpenGPT using data available on the NHS UK Website. Download here. (Click on view Raw data)
Medical Task/Solution, 4,688 pairs generated via OpenGPT using GPT-4, prompt used: 5755564c19. Download here. (Click on view Raw data)

Know More

Neurology

AMP®-Parkinson’s Disease Progression Prediction

Data to predict the course of Parkinson’s disease (PD) using protein abundance data. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.This is a time-series code dataset with Kaggle’s time-series API.

Know More

Neurology

Parkinson’s Freezing of Gait Prediction datasets

The data series include three datasets, collected under distinct circumstances:

The tDCS FOG (tdcsfog) dataset, comprising data series collected in the lab, as subjects completed a FOG-provoking protocol.
The DeFOG (defog) dataset, comprising data series collected in the subject’s home, as subjects completed a FOG-provoking protocol
The Daily Living (daily) dataset, comprising one week of continuous 24/7 recordings from sixty-five subjects. Forty-five subjects exhibit FOG symptoms and also have series in the defog dataset, while the other twenty subjects do not exhibit FOG symptoms and do not have series elsewhere in the data.

Know More

Cardiology

AHA Precision Medicine Platform

The Precision Medicine Platform is the only research interface with access to The American Heart Association’s Get With The Guidelines registry data.

2,600+ Hospitals (50% of all US Hospitals)

20+ Years of data collection

13,000,000+ National patient records

90% of stroke discharges

22% of cardiovascular discharges

Know More

General/

Genetics

All of Us Research database

The National Institutes of Health’s All of Us Research Program is building one of the largest biomedical data resources of its kind.

600,000+ participants

350,000+ EHR records

450,000+ biomedical specimen data

Know More

Cancer/

Imaging

NYUMets datasets

3 metastatic cancer datasets available through AWS API.

Time Series Dataset – Each row in the time series dataset represents a point in time, in units of days indexed from each patient’s initial gamma knife radiosurgery. Dataset variables include clinical details related to medication changes, imaging timing/references to raw imaging files, procedure timing, clinical follow up, and outcomes.
Individual Dataset – Each row represents an individual patient with demographic details and summary clinical data.
Gamma Knife Details Dataset – Each row represents an individual gamma knife target to provide further details about available gamma knife radiosurgery.

Know More

Dermatology

Dermofit Image Library

The Dermofit Image Library is a collection of 1,300 focal high quality skin lesion images collected under standardised conditions with internal colour standards. The lesions span across ten different classes including melanomas, seborrhoeic keratosis and basal cell carcinomas. Each image has a gold standard diagnosis based on expert opinion (including dermatologists and dermatopathologists). Images consist of a snapshot of the lesion surrounded by some normal skin.The Dermofit Image Library is available under an academic licence. There is a one-off £75 licence fee associated with this product.

Know More

Imaging

VinDr-CXR:An open dataset of chest X-rays with radiologist’s annotations

A dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases.

Know More

Cardiology/

Pediatrics

EchoNet-Pediatric

The EchoNet-Peds database includes 7,643 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes. The database includes patients ranging from 0-18 years (43% female) with a wide range of sizes.

Know More

Imaging

BraTS(Brain Tumor Segmentation) data

All BraTS multimodal scans are available as NIfTI files (.nii.gz) which were were acquired with different clinical protocols and various scanners from multiple (n=19) institutions.The overall survival (OS) data, defined in days, are included in a comma-separated value (.csv) file with correspondences to the pseudo-identifiers of the imaging data. The .csv file also includes the age of patients, as well as the resection status.

Know More

Pathology

CAMELYON data sets: WSI images

The data in this challenge contains whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections.Depending on the particular data set (see below), ground truth is provided:

On a lesion-level: with detailed annotations of metastases in WSI.
On a patient-level: with a pN-stage label per patient.

All ground truth annotations were carefully prepared under supervision of expert pathologists. WSI are provided as TIFF images. Lesion-level annotations are provided as XML files. For training, 100 patients will be provided and another 100 patients for testing.The test data set contains 500 slides. 1000 slides with 5 slides per patient .

Know More

Imaging

Chest X-rays (Indiana University)

The dataset contains 7,471 chest X-ray images in .png file format and 3955 patients radiology text reports available in .XML format. Each image has been paired with four captions such as Impressions, Findings, Comparison and Indication that provide clear descriptions of the salient entities and events.

Original data source : https://openi.nlm.nih.gov/

Know More

General/

Imaging

Open-i: National Library of Medicine

Open-i provides access to over 3.7 million images from about 1.2 million PubMed Central^® articles; 7,470 chest x-rays with 3,955 radiology reports; 67,517 images from NLM History of Medicine collection; and 2,064 orthopedic illustrations.

Know More

Imaging

Brain tissue segmentation MRI dataset

A synthetic dataset of brain images simulated across 42 different MR protocols and based on 500 different reference brains from the Human Connectome Project (HCP) (Van Essen et al., 2012), leading to 21,000 simulated brain images,

Know More

Imaging

The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset

An open-source data collection consisting a total of 955 T1-weighted MRIs (Magnetic Resonance Imaging) with manually segmented diverse lesions and metadata

Related publication: Liew, Sook-Lei. The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset – Release 2.0, 2021. Inter-university Consortium for Political and Social Research [distributor], 2022-08-08. https://doi.org/10.3886/ICPSR36684.v5

Know More

Cancer/

Imaging

Breast Cancer MRI Dataset: Duke

The dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, having the following data components:

Demographic, clinical, pathology, treatment, outcomes, and genomic data: Collected from a variety of sources including clinical notes, radiology report, and pathology reports.
Pre-operative dynamic contrast enhanced (DCE)-MRI: Downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release in DICOM format.
Locations of lesions in DCE-MRI: Annotations on the DCE-MRI images by radiologists.
Imaging features from DCE-MRI: A set of 529 computer-extracted imaging features by inhouse software.

Know More

General

National Health and Nutrition Examination Survey (NHANES) Data

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The survey examines a nationally representative sample of about 5,000 persons each year. Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases.

Know More

General

Protective Policy Index (PPI) global dataset for COVID-19

This is an original dataset of stringency of public health policy measures that were adopted in response to COVID-19 worldwide by governments at national and sub-national levels. The data set covers governments’ policy responses between January 24, 2020 and December 31, 2020.

Know More

Cardiology/

General/

Pathology

Nightingale Open Science Datasets

Multiple datasets available:

silent-cchs-ecg: Diagnosing ‘silent’ heart attack (48,000 ECG waveforms)
brca-psj-path: Identifying high-risk breast cancer (175,000 biopsy slides)
arrest-ntuh-ecg: Subtyping cardiac arrest (24,106 ECG waveforms)
fracture-aimi-xray: Predicting fractures (64,000 chest x-rays)
covid-psj-xray: Emergency triage of Covid-19 patients (7,500 chest x-rays)

Know More

General/

Pulmonary

COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening

A dataset consisting of 53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants through our COVID-19 Sounds app. It also provides participants’ self-reported COVID-19 testing status with 2,106 samples tested positive.

Know More

Imaging

RadGraph: Extracting Clinical Entities and Relations from Radiology Reports

This dataset contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Additionally,there is an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs.

Related publication: Jain, S., Agrawal, A., Saporta, A., Truong, S. Q., Nguyen Duong, D., Bui, T., Chambon, P., Lungren, M., Ng, A., Langlotz, C., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports (version 1.0.0). PhysioNet. https://doi.org/10.13026/hm87-5p47.

Know More

General

Papers with code medical datasets

200+ datasets of various types with links and papers.Includes search options for datatypes, language and more.

Know More

Dermatology

PH² – a dermoscopic image database

The PH² database includes the manual segmentation, the clinical diagnosis, and the identification of several dermoscopic structures, performed by expert dermatologists, in a set of 200 dermoscopic images.

Know More

General

VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Vision-based Fallen Person (VFP290K) dataset consists of 294,713 frames of fallen persons extracted from 178 videos, including 131 scenes in 49 locations. It demonstrated the effectiveness of the features through extensive experiments analyzing the performance shift based on object detection models.

Related publication: VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection

Know More

Critical Care

HiRID, a high time-resolution ICU dataset

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand adult patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland (ICU), an interdisciplinary 60-bed unit admitting >6,500 patients per year. The dataset contains de-identified demographic information and a total of 681 routinely collected physiological variables, diagnostic test results and treatment parameters from almost 34 thousand admissions during the period from January 2008 to June 2016. Data is stored with a uniquely high time resolution of one entry every two minutes.

Related publication: Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., & Merz, T. (2021). HiRID, a high time-resolution ICU dataset (version 1.1.1). PhysioNet. https://doi.org/10.13026/nkwc-js72.

Know More

Critical Care

The eICU Collaborative Research Database

eICU Collaborative Research Database, a multi-center intensive care unit (ICU)database with high granularity data for over 200,000 admissions to ICUs monitored by eICU Programs across the United States.

Related publication: The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI: http://dx.doi.org/10.1038/sdata.2018.178.

Know More

Critical Care

MIMIC -IV

The Medical Information Mart for Intensive Care (MIMIC)-IV database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC).

Related publication: Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2020). MIMIC-IV (version 0.4). PhysioNet. https://doi.org/10.13026/a3wn-hq05.

Know More

General/

Neurology/

Ophthalomology

EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction

A dataset of paired Electroencephalography (EEG) and video-infrared eye tracking (ET) recordings from 356 subjects for more than 47 hours in total. A benchmark consisting of 3 evaluation tasks with increasing difficulty is introduced alongside the dataset.

Know More

Anesthesiology/

General/

Neurology

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

Q-Pain, a dataset for assessing bias in medical QA in the context of pain management. 55 medical question-answer pairs across five different types of pain management: each question includes a detailed patient-specific medical scenario (“vignette”) designed to enable the substitution of multiple different racial and gender “profiles” and to evaluate whether bias is present when answering whether or not to prescribe medication.

Related publication: Logé, C., Ross, E., Dadey, D. Y. A., Jain, S., Saporta, A., Ng, A., & Rajpurkar, P. (2021). Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management (version 1.0.0). PhysioNet. https://doi.org/10.13026/2tdv-hj07.

Know More

Imaging

Chest ImaGenome Dataset

Dataset contributes significantly to the research community by providing 1) 1,256 combinations of relation annotations between 29 CXR anatomical locations (objects with bounding box coordinates) and their attributes, structured as a scene graph per image, 2) over 670,000 localized comparison relations (for improved, worsened, or no change) between the anatomical locations across sequential exams, as well as 3) a manually annotated gold standard scene graph dataset from 500 unique patients.

Related publication: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. https://doi.org/10.13026/wv01-y230.

Know More

General

Therapeutics Data Commons (TDC)

TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and diverse types of data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards.

Know More

Imaging

Report-Annotated Duke Chest CT (RAD-ChestCT)

The RAD-ChestCT dataset is a imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset.

Know More

Dermatology

MED-NODE

A dataset consists of 70 melanoma and 100 naevus images from the digital image archive of the Department of Dermatology of the University Medical Center Groningen (UMCG) used for the development and testing of the MED-NODE system for skin cancer detection from macroscopic images. The file contains 170 images (70 melanoma and 100 nevi cases).

Related publications: I. Giotis, N. Molders, S. Land, M. Biehl, M.F. Jonkman and N. Petkov: “MED-NODE: A computer-assisted melanoma diagnosis system using non-dermoscopic images”, Expert Systems with Applications, 42 (2015), 6578-6585

Know More

General

BigBIO: Biomedical NLP datasets

BIGBIO a community library of 126+ biomedical NLP datasets currently covering 12 task categories and 10+ languages with • programmatic access. BIGBIO enables reproducible data-centric machine learning workflows, by focusing on programmatic access to datasets and their metadata in a uniform format.

Know More

General

AIMe registry for artificial intelligence in biomedical research

A community-driven platform for reporting biomedical AI systems.

Know More

Dermatology

PAD-UFES-20: a skin lesion dataset collected from smartphones

The dataset consists of 2,298 samples of six different types of skin lesions. Each sample consists of a clinical image and up to 22 clinical features including the patient’s age, skin lesion location, Fitzpatrick skin type, and skin lesion diameter. ll BCC, SCC, and MEL are biopsy-proven.In total, there are 1,373 patients, 1,641 skin lesions, and 2,298 images present in the dataset. The remaining ones may have clinical diagnosis according to a consensus of a group of dermatologists. In total, approximately 58% of the samples in this dataset are biopsy-proven.

Know More

General/

Microbiology

International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC) COVID-19 dataset

The database includes data from more than 705,000 patients, collected in more than 60 countries and 1,500 centres worldwide. Patient data are available from acute hospital admissions with COVID-19 and outpatient follow-ups. The data include signs and symptoms, pre-existing comorbidities, vital signs, chronic and acute treatments, complications, dates of hospitalization and discharge, mortality, viral strains, vaccination status, and other data.

Know More

Dermatology

SNU dataset

2201 images with diagnoses based on biopsy or clinical impression.174 disease classes for the model training.

Know More

General

BioRED: a rich biomedical relation extraction dataset

Biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of 600 PubMed abstracts.

Know More

Imaging

RadImageNet

The RadImageNet database includes 1.35 million annotated CT, MRI, and ultrasound images of musculoskeletal, neurologic, oncologic, gastrointestinal, endocrine, and pulmonary pathology. The RadImageNet database contains medical images of 3 modalities, 11 anatomies, and 165 pathologic labels.

Know More

Imaging

BRAX, a Brazilian labeled chest X-ray dataset

BRAX dataset provides 40,967 images, 24,959 imaging studies for 19,351 patients presenting to the Hospital Israelita Albert Einstein. All images have been verified by trained radiologists and de-identified to protect patient privacy. Fourteen labels were derived from free-text radiology reports written in Brazilian Portuguese using Natural Language Processing.

Related publication: BRAX, a Brazilian labeled chest X-ray dataset

Know More

Imaging

MONAI: Medical Open Network for Artificial Intelligence

The MONAI framework is the open-source foundation being created by Project MONAI. MONAI is a freely available, community-supported, PyTorch-based framework for deep learning in healthcare imaging.Project MONAI also includes MONAI Label, an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets, and build AI models in a standardized MONAI paradigm.

Know More

Imaging

UPENN-GBM: MRI scans for Glioblastoma (GBM) patients

This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format.

630 patients, 3301 studies, 820,000 + images.

Know More

General/

Imaging/

Pathology/

Surgery

Grand Challenge: Image analysis datasets and algorithms

A platform for end-to-end development of machine learning solutions in biomedical imaging.Grand Challenge was developed in 2010 to make it easy for organizers of challenges to set up a website for a particular challenge and to bring all information on challenges in the domain of biomedical image analysis available at one place.This system has been operational since 2017 and has been used by over 300 challenges,70,000 users with more than 1000 algorithms.

Know More

Dermatology

Seven-Point Checklist Dermatology Dataset

A database for evaluating computerized image-based prediction of the 7-point skin lesion malignancy checklist. The dataset includes over 2000 clinical and dermoscopy color images, along with corresponding structured metadata tailored for training and evaluating computer aided diagnosis (CAD) systems.

Know More

Imaging/

Neurology

OpenNeuroDatasets

A free and open platform for validating and sharing BIDS-compliant MRI, PET, MEG, EEG, and iEEG data.720 public datasets and growing.

Webpage: https://openneuro.org/

Know More

Cardiology

EchoNet – LVH

The EchoNet-LVH dataset includes 12,000 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac chamber size and wall thickness.

Know More

Imaging

Japanese Society of Radiological Technology (JSRT) database

The database includes 154 conventional chest radiographs with a lung nodule (100 malignant and 54 benign nodules) and 93 radiographs without a nodule The database also includes additional information such as; patient age, gender, diagnosis (malignant or benign), X and Y coordinates of nodule, simple diagram of nodule location. Lung nodule images were classified into five groups according to the degrees of subtlety.

Know More

Anesthesiology

Behavioral and autonomic dynamics during propofol-induced unconsciousness dataset

Data was collected from nine healthy volunteers during a study of propofol-induced unconsciousness. For all subjects, approximately 3 hours of data were recorded while using target-controlled infusion protocol.Data includes continuous electrocardiogram (ECG); interventions included in the study for patient safety, such as administering phenylephrine (a vasopressor);heart rate variability (HRV) and electrodermal activity (EDA).

Related publication: Subramanian, S., Purdon, P., Barbieri, R., & Brown, E. (2021). Behavioral and autonomic dynamics during propofol-induced unconsciousness (version 1.0). PhysioNet. https://doi.org/10.13026/2rbc-1r03.

Know More

Ophthalomology

A global review of publicly available datasets for ophthalmological imaging

94 open access ophthalmological imaging datasets containing 507 724 images and 125 videos from 122 364 patients.

Know More

Cardiology

PTB-XL: EKG dataset

The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. Total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements.

Know More

Cardiology/

Dermatology/

General/

Imaging

Stanford AIMI Shared Datasets

A collection of de-identified annotated medical imaging data to foster transparent and reproducible collaborative research. X-rays, CT scans, MRIs,Echocardiography and Dermatology images.

Know More

Dermatology

DDI – Diverse Dermatology Images: Stanford AIMI Dataset

Diverse Dermatology Images (DDI) dataset—the first publicly available, deeply curated, and pathologically confirmed image dataset with diverse skin tones. The DDI was retrospectively selected from reviewing pathology reports in Stanford Clinics from 2010-2020. It has a total of 656 images representing 570 unique patients.

Know More

General

Huggingface datasets

Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.Currently over 2658 datasets, and more than 34 metrics available.At least 13 datasets with “medical” term search.Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model.

Know More

Pulmonary

DCSM Sleep Staging Dataset

The DCSM dataset consists of 255 randomly selected and fully anonymized overnight lab-based PSG recordings from patients visiting the DCSM for the diagnosis of non-specific sleep related disorders. The DCSM dataset represents a diverse cohort of Danish patients with respect to demographic characteristics, diagnostic background and sleep/non-sleep related medication usage, collected between 2015-2018.

Know More

Pulmonary

Dreem Open Datasets

Two publicly-available datasets, DOD-H including 25 healthy volunteers and DOD-O including 55 patients suffering from obstructive sleep apnea (OSA). Both datasets have been scored by 5 sleep technologists from different sleep centers. We developed a framework to compare automated approaches to a consensus of multiple human scorers.

Know More

Cardiology/

Imaging

Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (M&Ms) Dataset

375 heterogeneous cardiac magnetic resonance (CMR) datasets acquired by using four different scanner vendors in six hospitals and three different countries (Spain, Canada and Germany).

Know More

Cancer/

Genetics/

Imaging

The Cancer Imaging Archive ( TCAI) dataset collection

TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available. Over 100+ datasets, many of which are public.

Know More

General

n2c2 NLP Research Data Sets

Unstructured notes from the Research Patient Data Registry at Partners Healthcare,Boston,USA (originally developed during the i2b2 project). Clinical Natural Language Processing (NLP) data sets were originally created at a former NIH-funded National Center for Biomedical Computing (NCBC) known as i2b2: Informatics for Integrating Biology and the Bedside. Beginning in 2018, they are officially known as n2c2 (National NLP Clinical Challenges).

Know More

General

emrQA dataset

A publicly available EMR Question Answering (QA) corpus by creating a large-scale dataset, emrQA, using a novel semi-automated generation framework that allows for minimal expert involvement and re-purposes existing annotations available for other clinical NLP tasks.EmrQA has 1 million question-logical form and 400,000+ question answer evidence pairs. The dataset uses existing NLP task annotations from the i2b2 Challenge datasets.

Know More

Anesthesiology

VSCapture: An open source tool for Data acquisition from anesthesia monitor

VSCapture, an open source tool developed in C# programming language on the .NET/Mono platform that allows the tool to run on Windows, Macintosh OS X, Linux Ubuntu operating systems.

The University of Queensland Vital Signs Dataset contains a wide range of patient monitoring data and vital signs that were recorded during 32 surgical cases where patients underwent anaesthesia at the Royal Adelaide Hospital.

Know More

Cancer/

Pathology

Prostate cANcer graDe Assessment (PANDA) Challenge dataset

12,625 whole-slide images (WSIs) of prostate biopsies were available for model development (the development set), 393 for performance evaluation during the competition phase (the tuning set), 545 as the internal validation set in the postcompetition phase and 1,071 for external validation from 6 different sites.

Related publication: Bulten, W., Kartasalo, K., Chen, PH.C. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat Med (2022). https://doi.org/10.1038/s41591-021-01620-2

Know More

Cardiology/

General

Hero DMC Heart Institute(HDHI): Hospital admissions dataset

This is a dataset from tertiary care medical college and hospital in India’s cardiology unit which had 14,845 admissions corresponding to 12,238 patients.

Related publication: Bollepalli, S.C.; Sahani, A.K.; Armoundas, A.A. ,et al. An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission to a Cardiac Unit. Diagnostics 2022, 12, 241.

https://doi.org/10.3390/diagnostics12020241

Know More

Dermatology

International Skin Imaging Collaboration(ISIC) Dataset

The dataset included over 69,000 dermatology images.International Skin Imaging Collaboration (ISIC) is a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI).

Know More

Imaging

CQ500 dataset

A dataset of 491 Head CT scans with 193,317 slices, anonymized dicoms for all the scans and the corresponding radiologists’ reads done by three radiologists with an experience of 8, 12 and 20 years in cranial CT interpretation respectively.

Know More

Critical Care/

Imaging

COVID-Net

Publicly available suite of tailored deep neural network models for tackling different challenges ranging from screening to risk stratification to treatment planning for patients with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

Chest x-rays: 16,352 CXR images across 14,979 patients Clic k here
Chest CT: 201,103 CT slices from 4,501 patients C l i ck here
Chest point-of-care ultrasound: 29,651 POCUS images Click here
COVID-Net ICU:1925 records from 385 patients Click here

Also,expanded to open source TB-Net initiative for tuberculosis screening, Fibrosis-Net initiative for pulmonary fibrosis progression prediction, and Cancer-Net initiative for cancer screening.

Know More

Emergency Department

MIMIC-IV-ED

MIMIC-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2016. 448,972 ED stays with vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses available

Know More

Imaging

Chest X-ray dataset with eye tracking

Chest X-ray dataset with eye tracking and report dictation. Built on MIMIC Chest X-ray dataset.1,083 CXR images.

Related publication:

Karargyris, A., Kashyap, S., Lourentzou, I. et al. Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development.Sci Data 8, 92 (2021).

Know More

Imaging

RICORD: RSNA International COVID-19 Open Annotated Radiology Database

This database is the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset.Annotated by three radiologists with majority vote adjudication by board certified radiologists,RICORD consists of 240 thoracic CT scans and 1,000 chest radiographs contributed from four international sites.

Know More

Anesthesiology

VItalDb dataset

A comprehensive dataset of 6,388 surgical patients composed of intraoperative biosignals and clinical information from the Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul, Korea .

Know More

Pathology

NuCLS

The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from The Cancer Genome Atlas( TCGA). These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students.

Know More

Imaging

CheXpert

CheXpert is a public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients from Stanford Hospital.

Know More

Cancer/

Genetics

Genomic Data Commons(GDC) datasets

The GDC Portal is a platform from National Cancer Institute(NCI) with cancer related genomic data for 80,000+ cases.

Know More

Imaging

CARING Research radiologist annotated COVID-19 X-rays from BIMCV dataset

Annotations by CARING’s (Centre for Advanced research in Imaging, Neuroscience and Genomics) expert radiologists on COVID-19 positive X-rays.The corresponding X-rays were released by Medical Imaging Data Bank of the Valencia region (BIMCV).

Know More

Imaging

BIMCV-COVID19 Imaging Datasets

BIMCV-COVID19+ dataset is a large dataset with chest X-ray images and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV).These iterations of the database include 7377 CR, 9463 DX and 6687 CT studies.

Know More

Imaging

VinBigData Chest X-ray abnormalities detection

Provided on Kaggle by the Vingroup Big Data Institute (VinBigData) aims to promote fundamental research and investigate novel and highly-applicable technologies.A dataset consisting of 18,000 images that have been annotated by experienced radiologists.

Know More

Cardiology

EchoNet -Dynamic

The EchoNet-Dynamic database includes 10,030 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes.

Know More

Genetics/

Pharmacology

PGxCorpus: a Manually Annotated Corpus for Pharmacogenomics

941 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly genes variations, gene, drugs and phenotypes), and relationships between those.

Know More

General

CENTAUR LABS

40+ speciality classified list of open source datasets for healthcare with direct links to the datasets and more information.

Know More

General

DATA WORLD – HEALTHCARE

More than a 100 healthcare related datasets from around the world, classified and annotated.

Know More

General

Determinants of COVID-19 mortality in the United States dataset (BrainX)

Dataset created for the purpose of continuing research into COVID-19. However with information from all 50 states and the District of Columbia, many US statistics can be compared.

Know More

Pharmacology

Drug Induced Liver injury(DILI) Dataset

The DILIrank dataset is an updated version of the LTKB Benchmark dataset. DILIrank consists of 1,036 FDA-approved drugs that are divided into four classes according to their potential for causing drug-induced liver injury (DILI).

Know More

Ophthalomology

SUSTech -SYSU dataset

Dataset for automatically segmenting and classifying corneal ulcers with 712 ocular staining images and the associated segmentation labels for flaky corneal ulcers.

Know More

General

Harvard Dataverse

4000+ healthcare datasets made available from Harvard University.Searchable and diverse.

Know More

Pathology

PanNUke Dataset

Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources.

Know More

Imaging

ACR COVID-19 Imaging Dataset

A dataset with Images,mainly Chest X-rays from COVID-19 patients.

Know More

General

C3.ai COVID-19 Data Lake

Multiple data sources for COVID-19 in a unified data model, ready for analysis at one place.

Know More

General

COVID-19 Open Research Dataset Challenge (CORD-19)

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

Know More

General

Novel Corona Virus 2019 Dataset

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.

The data is available since 22 Jan, 2020.

Know More

Imaging

The RSNA 2019 Brain CT Hemorrhage Dataset.

Largest collection of Intracranial hemorrhage CT scans.874 035 images with expert annotations.

Reference: Construction of a Machine Learning Dataset through Collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge

Know More

Cardiology/

General/

Neurology

PHYSIONET(MIMIC/eICU Collaborative)

One of the most comprehensive source of many datasets in healthcare.Primarily from ICU patients.

https://physionet.org/about/database/

MIMIC – IV Dataset (https://physionet.org/content/mimiciv/0.4/)

Includes:

Clinical datasets such as MIMIC,eICU collaborative and Pediatic ICU datasets.
Waveform datasets with ECG,EEG,arterial blood pressure waveform.
ECG datasets with various pathophysiologic changes and drug interactions.
Fetal datasets including sounds and ECG.
Gait and Balance datasets include gait dynamics for patients with various neurodegenerative disorders.
Neuro and Myoelectic datasets with EEG,EMG and evoked potential waveforms.
Image datasets with Chest X-rays and MRI images.
Computed Tomography Images for Intracranial Hemorrhage Detection and Segmentation
Miscellaneous datasets with text, language,posture and other datasets

Know More

Imaging/

Neurology

ADNI Database

Alzheimer’s disease patient’s imaging(MRI), clinical, genomic, and biomarker data for the purposes of scientific investigation, teaching, or planning clinical research studies.

http://adni.loni.usc.edu/data-samples/access-data/

Know More

Ophthalomology

RIM-ONE

RIM-ONE is a database for optic disc and cup segmentation evaluation by Medical Image Analysis group.

Know More

Critical Care

AmsterdamUMCdb

Contains data related to 23,376 intensive care unit and high dependency unit admissions at Amsterdam University Medical Center of adult patients from 2003-2016.

Know More

Pharmacology

FDA Adverse Event Reporting System (FAERS)

The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA

Know More

Microbiology

Malaria Dataset

A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.The dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells.

Know More

Ophthalomology

MESSIDOR: Methods to Evaluate Segmentation and Indexing Techniques in the field of Retinal Ophthalmology

1200 Retinal image dataset with annotation.

Know More

Ophthalomology

RIGA Dataset :Retinal fundus images for glaucoma analysis

A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) derived from three sources with 750 original images and 4500 manual marked images

Know More

Ophthalomology

High-Resolution Fundus (HRF) Image Database

The public database contains 15 images of healthy patients, 15 images of patients with diabetic retinopathy and 15 images of glaucomatous patients.

Know More

Ophthalomology

DR HAGIS:Diabetic Retinopathy, Hypertension, Age-related macular degeneration and Glacuoma ImageS

39 images for development of vessel extraction algorithms suitable for retinal screening programmes.

Know More

Cancer

NLST Datasets: National Cancer Institute

Datasets from National Cancer Institute of over 54000 patients. They include data on participant characteristics, screening exam results, diagnostic procedures, lung cancer, and mortality. Images from over 75,000 CT screening exams are available. Over 1,200 pathology images from a subset of NLST lung cancer patients (~500 of over 2,000 patients) may be viewed.

Know More

Pulmonary

NSRR Datasets:National Sleep Research Resource

Polysomnography dataset from NSRR for sleep studies.Large collection of deidentified physiologic signals perfect for ML development.

Know More

Dermatology

The HAM10000 dataset

A large collection of multi-source dermatoscopic images of common pigmented skin lesions containing 10000 images.

Know More

General

UCI Machine Learning Repository

This open source repository has more than 400 datasets including healthcare(100+) and non-healthcare ones in searchable and categorized format.

Know More

General

Centers for Medicare and Medicaid(CMS) datasets with ResDAC link.

CMS datasets provide US Medicare and Medicaid datasets.

ResDAC(The Research Data Assistance Center) provides free support to users of CMS datasets.Link: https://www.resdac.org/learn

Know More

General

Center for disease control(CDC) Datasets

Center for Disease Control’s datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.

Know More

General

Healthcare Cost and Utilization Project (HCUP) datasets

Agency for Healthcare Research and Quality’s HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.

Know More

General

NHS datasets

UK government’s National Health services datasets.NHS choices datasets are useful for NLP and sentiment analysis both for GPs and hospitals.

Know More

Imaging

OASIS Brain MRI dataset

Brain MRI datasets from Open Accesss series of Imaging Studies(OASIS).

Know More

Cancer

National Cancer Institute(NCI)-SEER datasets

Cancer epidemiology data available through NCI’s Surveillance,Epidemiology and End Result Program(SEER).

Know More

Cancer/

Genetics

BROAD Institute’s Cancer program datasets

Cancer and genomics datasets.

Know More

Imaging

MURA

A dataset of 14,000+ anonymized, radiologist labeled musculoskeletal X-rays from 12,000+ patients from Stanford ML group.

Related publication: https://arxiv.org/abs/1712.06957

Know More

Imaging

fastMRI

1500+ knee MRI anonymized dataset from NYU.

Know More

General

NLTK : Natural language toolkit

One stop to learn Natural Language processing and more.

Related publication: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

Know More

General

DAIR.AI

An excellent resource for trends and updates in AI, especially NLP by Elvis Saravia.

Know More

General

Data science article collection

An excellent collection of articles on data science.

Know More

General

Google Dataset Search

Google’s powerful search engine to assist with dataset search.

Know More

Imaging

NIH CXR14 dataset

Over 100,000 anonymized chest x-ray images and their corresponding data from more than 30,000 patients, including many with advanced lung disease.

Know More

Imaging

NIH Deep Lesion

NIH release of a dataset containing 32,000 CT scan images with annotated lesions belonging to 4400 unique patients.

Know More

General

Blue Button 2.0

A CMS initiative to democratize research and development using beneficiary data.Greater than 70 million patient dataset available.

Know More

General

National Institute of Health

The link below is for NIH’s strategic plan for data science in healthcare.A must read for anyone using data in healthcare for research and innovation

Know More

Imaging

NIH Clinical Center

Largest open source Chest X-Ray data set available through NIH’s clinical center.See the link in the article to access the data.Also available through GITHUB and KAGGLE.

Know More

General