General/
LLM
Primock57 dataset
Dataset of 57 mock medical primary care consultations: audio, consultation notes, human utterance-level transcripts.
General/
LLM
Dataset of 57 mock medical primary care consultations: audio, consultation notes, human utterance-level transcripts.
Imaging
2,633 three-dimensional images collected across multiple anatomies of interest, multiple modalities, and multiple sources representative of real-world clinical applications. 10 datasets including CT scans of Abdomen,Lung and MRI of Brain, Prostate.
Related publications:
LLM
Llama2-MedTuned-Instructions is an instruction-based dataset developed for training language models in biomedical NLP tasks. It consists of approximately 200,000 samples, each tailored to guide models in performing specific tasks such as Named Entity Recognition (NER), Relation Extraction (RE), and Medical Natural Language Inference (NLI). This dataset represents a fusion of various existing data sources, reformatted to facilitate instruction-based learning.
Related publication: Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing. Omid Rohanian, Mohammadmahdi Nouriborji, David A. Clifton. arXiv:2401.00579 [cs.CL]
Related models: https://huggingface.co/nlpie/Llama2-MedTuned-7b ; https://huggingface.co/nlpie/Llama2-MedTuned-13b
Neurology
The largest smartwatch-based dataset including Parkinson’s, other Movement Disorders and Healthy controls (n>400). over 5000 clinical assessment steps from 504 participants, including PD, DD, and healthy controls (HC).
Related publication: Varghese, J., Brenner, A., Fujarski, M. et al. Machine Learning in the Parkinson’s disease smartwatch (PADS) dataset. npj Parkinsons Dis. 10, 9 (2024).
General
A multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions.
Github: https://github.com/mauro-nievoff/MultiCaRe_Dataset/tree/main
Imaging/
LLM/
Pulmonary
INSPECT contains data from 19,438 patients, including CT images, sections of radiology reports, and structured electronic health record (EHR) data (including demographics, diagnoses, procedures, and vitals). Using our provided dataset, Stanford University, develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks.INSPECT is the largest multimodal dataset for enabling reproducible research on strategies for integrating 3D medical imaging and EHR data.
General/
LLM
MedAlign is a clinician-generated dataset for instruction following with electronic medical records.The MedAlign dataset contains:
Related publication: MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. Scott L. Fleming, Alejandro Lozano, Nigam H. Shah ,et al. https://arxiv.org/abs/2308.14089
General/
LLM
EHRSHOT, which contains de-identified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients.15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation.
Related publication: EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models.Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason A. Fries, Nigam H. Shah arXiv:2307.02028
Imaging
The Scottish Medical Imaging Archive is a collection of population-based, routinely collected medical radiology images.This archive provides access to “analytics-ready” extracts for images between January 1, 2010, and August 31, 2018, which can be used for health care research and the development or validation of artificial intelligence algorithms.An archive of 57.3 million radiology studies linked to their medical records from the whole Scottish population.Modalities: Computerised Tomography (CT), Magnetic Resonance Imaging (MRI), Positron EmissionTomography (PET), Structured Reports (SRs).
General/
LLM
MeQSum corpus of 1,000 summarized consumer health questions. In particular, authors show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%.
Related publication: Asma Ben Abacha and Dina Demner-Fushman. 2019. On the Summarization of Consumer Health Questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy. Association for Computational Linguistics.
Cancer/
Pathology
This is a set of 100,000 non-overlapping image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and normal tissue.All images are 224×224 pixels (px) at 0.5 microns per pixel (MPP). Tissue classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM).These images were manually extracted from N=86 H&E stained human cancer tissue slides from formalin-fixed paraffin-embedded (FFPE) samples from the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany). Tissue samples contained CRC primary tumor slides and tumor tissue from CRC liver metastases; normal tissue classes were augmented with non-tumorous regions from gastrectomy specimen to increase variability.
General
Dataset for estimation of muscle dysmorphia in individuals from Barranquilla, Colombia
General
ELSI-Brazil (The Brazilian Longitudinal Study of Aging) aims to investigate the social and biological determinants of the aging process and its consequences to individuals and society. It is a nationally representative longitudinal study of community-dwelling adults aged 50 years or older, residing in 70 municipalities located across the five great geographic regions of Brazil. The baseline data collection was carried out in 2015-16 with 9,412 participants. The second wave was conducted in 2019-21 with 9,949 participants, including the sample replacement.
Related publications:
Endocrinology/
Imaging
TDID (Thyroid Digital Image Database) is a freely accessible database of ultrasound images of thyroid nodules from National University of Columbia. Currently, this database has a group of B-mode ultrasound images, which include a complete annotation and diagnostic description of the suspicious images of thyroid lesions, made by expert radiologists. From March 2014 to date, information from 389 patients has been collected.
General
The Global Health Data Exchange (GHDx) is a catalog of global health and demographic data. The goal of the GHDx is to help people locate data by cataloging information about data including the topics covered, by providing links to data providers or explaining how to acquire the data, and in cases where we have permission, providing the data directly for download. Use the GHDx to research population census data, surveys, registries, indicators and estimates, administrative health data, and financial data related to health.
General
DATASUS provides information that can serve to support objective analyzes of the health situation, evidence-based decision making and the development of health action programs.Measuring the population’s health status is a tradition in public health. It began with the systematic recording of mortality and survival data (Vital Statistics – Mortality and Live Births). With advances in the control of infectious diseases (Epidemiological and Morbidity information) and with a better understanding of the concept of health and its population determinants, the analysis of the health situation began to incorporate other dimensions of the health status.
Data on morbidity, disability, access to services, quality of care, living conditions and environmental factors have become metrics used in the construction of Health Indicators, which translate into relevant information for the quantification and evaluation of health information.
This section also contains information on the population’s Health Care, registrations (Care Network), hospital and outpatient networks, registration of health establishments, as well as information on financial resources and Demographic and Socioeconomic information.
General
All of Us Research Program collects data from a wide variety of sources, including surveys, electronic health records (EHRs), biosamples, physical measurements, and wearables like Fitbit.Most All of Us participants contribute biosamples such as blood and/or saliva. DNA from these samples is extracted and sent to genome centers for genomic analysis, including whole genome sequencing (WGS) and genome-wide genotyping.The All of Us Data and Research Center leverages the OMOP CDM to empower researchers by using existing, standardized vocabularies and a harmonized data representation. These factors enable connection to other ontologies, datasets, and tools that use the same codes or data model.
740,000+ participants, 400,000+ electronic health records, 520,000+ biosamples.
General/
Genetics/
Imaging
UK Biobank has collected and continues to collect extensive environmental, lifestyle, and genetic data on half a million participants.It includes data for:
Pulmonary
The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics – ICBHI 2017.The Respiratory Sound Database contains audio samples, collected independently by two research teams in two different countries, over several years.The database consists of a total of 5.5 hours of recordings containing 6898 respiratory cycles, of which 1864 contain crackles, 886 contain wheezes, and 506 contain both crackles and wheezes, in 920 annotated audio samples from 126 subjects.
Imaging
A manually constructed dataset where clinicians asked naturally occurring questions about radiology images and provided reference answers. Manual categorization of images and questions provides insight into clinically relevant tasks and the natural language to phrase them. The dataset contains 104 head axial single-slice CTs or MRIs, 107 chest x-rays, and 104 abdominal axial CTs. The final VQA-RAD dataset contains 3,515 total visual questions. Of these, 1,515 (43.1%) are free-form.
Related publication: Lau, J., Gayen, S., Ben Abacha, A. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data5, 180251 (2018).
Imaging
This dataset contains over 9,000 head CT scans, each labeled as normal or abnormal. Each scan contains a reconstructed image (stored in our institution’s PACS and saved as DICOMs) and a corresponding sinogram (simulated via GE’s CatSim software and saved as numpy arrays). The reconstructed images are 512×512 pixels with a variable number of axial slices per scan. The sinograms are 984×888 pixels with a variable number of axial slices per scan. The full dataset is 1.3T.
LLM
A diverse medical task dataset comprising 52,000 instruction response pairs and,MedInstruct-test, a set of clinician-crafted novel medical tasks,to facilitate the building and evaluation of future domain-specific instruction-following models.
Related publication:ALPACARE:INSTRUCTION-TUNED LARGE LANGUAGE MODELS FOR MEDICAL APPLICATION
General/
Surgery
MedShapeNet contains over 100,000 medical shapes, including bones, organs, vessels, muscles, etc., as well as surgical instruments.
Related publication: MedShapeNet – A Large-scale Dataset of 3D Medical Shapes for Computer Vision
Pathology
OpenPath, a large dataset of 208,414 pathology images paired with natural language descriptions.
General
This is a dataset used in the Med-HALT research paper. Med-HALT provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. Med-HALT includes two categories of tests reasoning and memory-based hallucination tests, designed to assess LLMs’ problem-solving and information retrieval abilities. This research paper focuses on the challenges posed by hallucinations in large language models (LLMs), particularly in the context of the medical domain. The authors propose a new benchmark and dataset, Med-HALT (Medical Domain Hallucination Test), designed specifically to evaluate hallucinations.
Imaging
EMBED contains 364,000 screening and diagnostic mammographic exams for 110,000 patients from four hospitals over an 8-year period. The EMBED AWS Open Data release represents 20% of the dataset divided into two equal cohorts at the patient level. This release of the dataset includes 2D and C-view images.
Cardiology
The MIMIC-IV-ECHO module contains more than 500,000 echocardiograms across 7,243 studies from 4,579 distinct patients. This subset contains echocardiograms for patients who appear in the MIMIC-IV Clinical Database and were admitted between 2017 and 2019.
Related publication: Gow, B., Pollard, T., Greenbaum, N., Moody, B., Johnson, A., Herbst, E., Waks, J. W., Eslami, P., Chaudhari, A., Carbonati, T., Berkowitz, S., Mark, R., & Horng, S. (2023). MIMIC-IV-ECHO: Echocardiogram Matched Subset (version 0.1). PhysioNet. https://doi.org/10.13026/ef48-v217.
Genetics
The Genome Aggregation Database (gnomAD), originally launched in 2014 as the Exome Aggregation Consortium (ExAC), is the result of a coalition of investigators willing to share aggregate exome and genome sequencing data from a variety of large-scale sequencing projects, and make summary data available for the wider scientific community.
General
The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each.
Gastroenterology
3 key datasets for endocscopy:
1. Kvasir-dataset-v2 contains 8,000 images, 8 classes, 1,000 images for upper and lower endoscopy in each class:
Gastroenterology
It contains 21 videos with a total number of 5, 525 frames annotated and verified by medical doctors (experienced endoscopists). The videos are divided into four classes of predefined bowel-preparation qualities.
Gastroenterology
This dataset comprises 18,481 images extracted from 523 small bowel capsule endoscopy videos. It has annotated 12,3320 images with 23,033 disease lesions and combined with 6,161 normal mucosa images. The annotations are provided in YOLO format.
General
More than 80 open source healthcare datasets available through the AWS Open Data Sponsorship Program.
General
3 datasets:
Neurology
Data to predict the course of Parkinson’s disease (PD) using protein abundance data. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients. Each patient contributed several samples over the course of multiple years while they also took assessments of PD severity.This is a time-series code dataset with Kaggle’s time-series API.
Neurology
The data series include three datasets, collected under distinct circumstances:
tdcsfog
) dataset, comprising data series collected in the lab, as subjects completed a FOG-provoking protocol.defog
) dataset, comprising data series collected in the subject’s home, as subjects completed a FOG-provoking protocoldaily
) dataset, comprising one week of continuous 24/7 recordings from sixty-five subjects. Forty-five subjects exhibit FOG symptoms and also have series in the defog
dataset, while the other twenty subjects do not exhibit FOG symptoms and do not have series elsewhere in the data.Cardiology
General/
Genetics
The National Institutes of Health’s All of Us Research Program is building one of the largest biomedical data resources of its kind.
600,000+ participants
350,000+ EHR records
450,000+ biomedical specimen data
Cancer/
Imaging
3 metastatic cancer datasets available through AWS API.
Dermatology
The Dermofit Image Library is a collection of 1,300 focal high quality skin lesion images collected under standardised conditions with internal colour standards. The lesions span across ten different classes including melanomas, seborrhoeic keratosis and basal cell carcinomas. Each image has a gold standard diagnosis based on expert opinion (including dermatologists and dermatopathologists). Images consist of a snapshot of the lesion surrounded by some normal skin.The Dermofit Image Library is available under an academic licence. There is a one-off £75 licence fee associated with this product.
Imaging
A dataset of more than 100,000 chest X-ray scans that were retrospectively collected from two major hospitals in Vietnam. Out of this raw data, 18,000 images that were manually annotated by a total of 17 experienced radiologists with 22 local labels of rectangles surrounding abnormalities and 6 global labels of suspected diseases.
Cardiology/
Pediatrics
The EchoNet-Peds database includes 7,643 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes. The database includes patients ranging from 0-18 years (43% female) with a wide range of sizes.
Imaging
All BraTS multimodal scans are available as NIfTI files (.nii.gz) which were were acquired with different clinical protocols and various scanners from multiple (n=19) institutions.The overall survival (OS) data, defined in days, are included in a comma-separated value (.csv) file with correspondences to the pseudo-identifiers of the imaging data. The .csv file also includes the age of patients, as well as the resection status.
Pathology
The data in this challenge contains whole-slide images (WSI) of hematoxylin and eosin (H&E) stained lymph node sections.Depending on the particular data set (see below), ground truth is provided:
All ground truth annotations were carefully prepared under supervision of expert pathologists. WSI are provided as TIFF images. Lesion-level annotations are provided as XML files. For training, 100 patients will be provided and another 100 patients for testing.The test data set contains 500 slides. 1000 slides with 5 slides per patient .
Imaging
The dataset contains 7,471 chest X-ray images in .png file format and 3955 patients radiology text reports available in .XML format. Each image has been paired with four captions such as Impressions, Findings, Comparison and Indication that provide clear descriptions of the salient entities and events.
Original data source : https://openi.nlm.nih.gov/
General/
Imaging
Open-i provides access to over 3.7 million images from about 1.2 million PubMed Central® articles; 7,470 chest x-rays with 3,955 radiology reports; 67,517 images from NLM History of Medicine collection; and 2,064 orthopedic illustrations.
Imaging
A synthetic dataset of brain images simulated across 42 different MR protocols and based on 500 different reference brains from the Human Connectome Project (HCP) (Van Essen et al., 2012), leading to 21,000 simulated brain images,
Imaging
An open-source data collection consisting a total of 955 T1-weighted MRIs (Magnetic Resonance Imaging) with manually segmented diverse lesions and metadata
Related publication: Liew, Sook-Lei. The Anatomical Tracings of Lesions after Stroke (ATLAS) Dataset – Release 2.0, 2021. Inter-university Consortium for Political and Social Research [distributor], 2022-08-08. https://doi.org/10.3886/ICPSR36684.v5
Cancer/
Imaging
The dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast cancer patients, over a decade, having the following data components:
Related publication: Saha, A., Harowicz, M.R., Grimm, L.J., Kim, C.E., Ghate, S.V., Walsh, R. and Mazurowski, M.A., 2018. A machine learning approach to radiogenomics of breast cancer.
General
The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The survey examines a nationally representative sample of about 5,000 persons each year. Findings from this survey will be used to determine the prevalence of major diseases and risk factors for diseases.
General
This is an original dataset of stringency of public health policy measures that were adopted in response to COVID-19 worldwide by governments at national and sub-national levels. The data set covers governments’ policy responses between January 24, 2020 and December 31, 2020.
Cardiology/
General/
Pathology
Multiple datasets available:
silent-cchs-ecg
: Diagnosing ‘silent’ heart attack (48,000 ECG waveforms)brca-psj-path
: Identifying high-risk breast cancer (175,000 biopsy slides)arrest-ntuh-ecg
: Subtyping cardiac arrest (24,106 ECG waveforms)fracture-aimi-xray
: Predicting fractures (64,000 chest x-rays)covid-psj-xray
: Emergency triage of Covid-19 patients (7,500 chest x-rays)General/
Pulmonary
A dataset consisting of 53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants through our COVID-19 Sounds app. It also provides participants’ self-reported COVID-19 testing status with 2,106 samples tested positive.
Related publication: COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening
Imaging
This dataset contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Additionally,there is an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs.
Related publication: Jain, S., Agrawal, A., Saporta, A., Truong, S. Q., Nguyen Duong, D., Bui, T., Chambon, P., Lungren, M., Ng, A., Langlotz, C., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports (version 1.0.0). PhysioNet. https://doi.org/10.13026/hm87-5p47.
General
200+ datasets of various types with links and papers.Includes search options for datatypes, language and more.
Dermatology
The PH² database includes the manual segmentation, the clinical diagnosis, and the identification of several dermoscopic structures, performed by expert dermatologists, in a set of 200 dermoscopic images.
General
Vision-based Fallen Person (VFP290K) dataset consists of 294,713 frames of fallen persons extracted from 178 videos, including 131 scenes in 49 locations. It demonstrated the effectiveness of the features through extensive experiments analyzing the performance shift based on object detection models.
Related publication: VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection
Critical Care
HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand adult patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland (ICU), an interdisciplinary 60-bed unit admitting >6,500 patients per year. The dataset contains de-identified demographic information and a total of 681 routinely collected physiological variables, diagnostic test results and treatment parameters from almost 34 thousand admissions during the period from January 2008 to June 2016. Data is stored with a uniquely high time resolution of one entry every two minutes.
Related publication: Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., & Merz, T. (2021). HiRID, a high time-resolution ICU dataset (version 1.1.1). PhysioNet. https://doi.org/10.13026/nkwc-js72.
Critical Care
eICU Collaborative Research Database, a multi-center intensive care unit (ICU)database with high granularity data for over 200,000 admissions to ICUs monitored by eICU Programs across the United States.
Related publication: The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI: http://dx.doi.org/10.1038/sdata.2018.178.
Critical Care
The Medical Information Mart for Intensive Care (MIMIC)-IV database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC).
Related publication: Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2020). MIMIC-IV (version 0.4). PhysioNet. https://doi.org/10.13026/a3wn-hq05.
General/
Neurology/
Ophthalomology
A dataset of paired Electroencephalography (EEG) and video-infrared eye tracking (ET) recordings from 356 subjects for more than 47 hours in total. A benchmark consisting of 3 evaluation tasks with increasing difficulty is introduced alongside the dataset.
Related publication: EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement Prediction
Anesthesiology/
General/
Neurology
Q-Pain, a dataset for assessing bias in medical QA in the context of pain management. 55 medical question-answer pairs across five different types of pain management: each question includes a detailed patient-specific medical scenario (“vignette”) designed to enable the substitution of multiple different racial and gender “profiles” and to evaluate whether bias is present when answering whether or not to prescribe medication.
Related publication: Logé, C., Ross, E., Dadey, D. Y. A., Jain, S., Saporta, A., Ng, A., & Rajpurkar, P. (2021). Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management (version 1.0.0). PhysioNet. https://doi.org/10.13026/2tdv-hj07.
Imaging
Dataset contributes significantly to the research community by providing 1) 1,256 combinations of relation annotations between 29 CXR anatomical locations (objects with bounding box coordinates) and their attributes, structured as a scene graph per image, 2) over 670,000 localized comparison relations (for improved, worsened, or no change) between the anatomical locations across sequential exams, as well as 3) a manually annotated gold standard scene graph dataset from 500 unique patients.
Related publication: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. https://doi.org/10.13026/wv01-y230.
General
TDC includes 66 AI-ready datasets spread across 22 learning tasks and spanning the discovery and development of safe and effective medicines. TDC also provides an ecosystem of tools and community resources, including 33 data functions and diverse types of data splits, 23 strategies for systematic model evaluation, 17 molecule generation oracles, and 29 public leaderboards.
Related publication: Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development
Imaging
The RAD-ChestCT dataset is a imaging dataset developed by Duke MD/PhD student Rachel Draelos during her Computer Science PhD supervised by Lawrence Carin. The full dataset includes 35,747 chest CT scans from 19,661 adult patients. This Zenodo repository contains an initial release of 3,630 chest CT scans, approximately 10% of the dataset.
Related publication: Draelos et al., “Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes,” Medical Image Analysis 2021. DOI: 10.1016/j.media.2020.101857
Dermatology
A dataset consists of 70 melanoma and 100 naevus images from the digital image archive of the Department of Dermatology of the University Medical Center Groningen (UMCG) used for the development and testing of the MED-NODE system for skin cancer detection from macroscopic images. The file contains 170 images (70 melanoma and 100 nevi cases).
Related publications: I. Giotis, N. Molders, S. Land, M. Biehl, M.F. Jonkman and N. Petkov: “MED-NODE: A computer-assisted melanoma diagnosis system using non-dermoscopic images”, Expert Systems with Applications, 42 (2015), 6578-6585
General
BIGBIO a community library of 126+ biomedical NLP datasets currently covering 12 task categories and 10+ languages with • programmatic access. BIGBIO enables reproducible data-centric machine learning workflows, by focusing on programmatic access to datasets and their metadata in a uniform format.
Related Publication: BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing
General
A community-driven platform for reporting biomedical AI systems.
Dermatology
The dataset consists of 2,298 samples of six different types of skin lesions. Each sample consists of a clinical image and up to 22 clinical features including the patient’s age, skin lesion location, Fitzpatrick skin type, and skin lesion diameter. ll BCC, SCC, and MEL are biopsy-proven.In total, there are 1,373 patients, 1,641 skin lesions, and 2,298 images present in the dataset. The remaining ones may have clinical diagnosis according to a consensus of a group of dermatologists. In total, approximately 58% of the samples in this dataset are biopsy-proven.
Related publication: PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones
General/
Microbiology
The database includes data from more than 705,000 patients, collected in more than 60 countries and 1,500 centres worldwide. Patient data are available from acute hospital admissions with COVID-19 and outpatient follow-ups. The data include signs and symptoms, pre-existing comorbidities, vital signs, chronic and acute treatments, complications, dates of hospitalization and discharge, mortality, viral strains, vaccination status, and other data.
Related publication: ISARIC-COVID-19 dataset: A Prospective, Standardized, Global Dataset of Patients Hospitalized with COVID-19
Dermatology
2201 images with diagnoses based on biopsy or clinical impression.174 disease classes for the model training.
General
Biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of 600 PubMed abstracts.
Related dataset: Ling Luo, et al. BioRED: a rich biomedical relation extraction dataset, Briefings in Bioinformatics, 2022
Imaging
The RadImageNet database includes 1.35 million annotated CT, MRI, and ultrasound images of musculoskeletal, neurologic, oncologic, gastrointestinal, endocrine, and pulmonary pathology. The RadImageNet database contains medical images of 3 modalities, 11 anatomies, and 165 pathologic labels.
Imaging
BRAX dataset provides 40,967 images, 24,959 imaging studies for 19,351 patients presenting to the Hospital Israelita Albert Einstein. All images have been verified by trained radiologists and de-identified to protect patient privacy. Fourteen labels were derived from free-text radiology reports written in Brazilian Portuguese using Natural Language Processing.
Related publication: BRAX, a Brazilian labeled chest X-ray dataset
Imaging
The MONAI framework is the open-source foundation being created by Project MONAI. MONAI is a freely available, community-supported, PyTorch-based framework for deep learning in healthcare imaging.Project MONAI also includes MONAI Label, an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets, and build AI models in a standardized MONAI paradigm.
Imaging
This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format.
630 patients, 3301 studies, 820,000 + images.
General/
Imaging/
Pathology/
Surgery
A platform for end-to-end development of machine learning solutions in biomedical imaging.Grand Challenge was developed in 2010 to make it easy for organizers of challenges to set up a website for a particular challenge and to bring all information on challenges in the domain of biomedical image analysis available at one place.This system has been operational since 2017 and has been used by over 300 challenges,70,000 users with more than 1000 algorithms.
Dermatology
A database for evaluating computerized image-based prediction of the 7-point skin lesion malignancy checklist. The dataset includes over 2000 clinical and dermoscopy color images, along with corresponding structured metadata tailored for training and evaluating computer aided diagnosis (CAD) systems.
Imaging/
Neurology
Cardiology
The EchoNet-LVH dataset includes 12,000 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac chamber size and wall thickness.
Related publication: High-Throughput Precision Phenotyping of Left Ventricular Hypertrophy with Cardiovascular Deep Learning
Imaging
The database includes 154 conventional chest radiographs with a lung nodule (100 malignant and 54 benign nodules) and 93 radiographs without a nodule The database also includes additional information such as; patient age, gender, diagnosis (malignant or benign), X and Y coordinates of nodule, simple diagram of nodule location. Lung nodule images were classified into five groups according to the degrees of subtlety.
Anesthesiology
Data was collected from nine healthy volunteers during a study of propofol-induced unconsciousness. For all subjects, approximately 3 hours of data were recorded while using target-controlled infusion protocol.Data includes continuous electrocardiogram (ECG); interventions included in the study for patient safety, such as administering phenylephrine (a vasopressor);heart rate variability (HRV) and electrodermal activity (EDA).
Related publication: Subramanian, S., Purdon, P., Barbieri, R., & Brown, E. (2021). Behavioral and autonomic dynamics during propofol-induced unconsciousness (version 1.0). PhysioNet. https://doi.org/10.13026/2rbc-1r03.
Ophthalomology
94 open access ophthalmological imaging datasets containing 507 724 images and 125 videos from 122 364 patients.
Cardiology
The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. Total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements.
Cardiology/
Dermatology/
General/
Imaging
A collection of de-identified annotated medical imaging data to foster transparent and reproducible collaborative research. X-rays, CT scans, MRIs,Echocardiography and Dermatology images.
Dermatology
Diverse Dermatology Images (DDI) dataset—the first publicly available, deeply curated, and pathologically confirmed image dataset with diverse skin tones. The DDI was retrospectively selected from reviewing pathology reports in Stanford Clinics from 2010-2020. It has a total of 656 images representing 570 unique patients.
General
Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.Currently over 2658 datasets, and more than 34 metrics available.At least 13 datasets with “medical” term search.Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model.
Pulmonary
The DCSM dataset consists of 255 randomly selected and fully anonymized overnight lab-based PSG recordings from patients visiting the DCSM for the diagnosis of non-specific sleep related disorders. The DCSM dataset represents a diverse cohort of Danish patients with respect to demographic characteristics, diagnostic background and sleep/non-sleep related medication usage, collected between 2015-2018.
Pulmonary
Two publicly-available datasets, DOD-H including 25 healthy volunteers and DOD-O including 55 patients suffering from obstructive sleep apnea (OSA). Both datasets have been scored by 5 sleep technologists from different sleep centers. We developed a framework to compare automated approaches to a consensus of multiple human scorers.
Cardiology/
Imaging
375 heterogeneous cardiac magnetic resonance (CMR) datasets acquired by using four different scanner vendors in six hospitals and three different countries (Spain, Canada and Germany).
Cancer/
Genetics/
Imaging
TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as patient outcomes, treatment details, genomics and image analyses are also provided when available. Over 100+ datasets, many of which are public.
General
Unstructured notes from the Research Patient Data Registry at Partners Healthcare,Boston,USA (originally developed during the i2b2 project). Clinical Natural Language Processing (NLP) data sets were originally created at a former NIH-funded National Center for Biomedical Computing (NCBC) known as i2b2: Informatics for Integrating Biology and the Bedside. Beginning in 2018, they are officially known as n2c2 (National NLP Clinical Challenges).
General
A publicly available EMR Question Answering (QA) corpus by creating a large-scale dataset, emrQA, using a novel semi-automated generation framework that allows for minimal expert involvement and re-purposes existing annotations available for other clinical NLP tasks.EmrQA has 1 million question-logical form and 400,000+ question answer evidence pairs. The dataset uses existing NLP task annotations from the i2b2 Challenge datasets.
Related publication: Pampari, A., Raghavan, P., Liang, J.J., & Peng, J. (2018). emrQA: A Large Corpus for Question Answering on Electronic Medical Records. EMNLP.
Anesthesiology
VSCapture, an open source tool developed in C# programming language on the .NET/Mono platform that allows the tool to run on Windows, Macintosh OS X, Linux Ubuntu operating systems.
Related Publication: Data acquisition from S/5 GE Datex anesthesia monitor using VSCapture.
Related Dataset: The University of Queensland Vital Signs Dataset.
The University of Queensland Vital Signs Dataset contains a wide range of patient monitoring data and vital signs that were recorded during 32 surgical cases where patients underwent anaesthesia at the Royal Adelaide Hospital.
Cancer/
Pathology
12,625 whole-slide images (WSIs) of prostate biopsies were available for model development (the development set), 393 for performance evaluation during the competition phase (the tuning set), 545 as the internal validation set in the postcompetition phase and 1,071 for external validation from 6 different sites.
Related publication: Bulten, W., Kartasalo, K., Chen, PH.C. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat Med (2022). https://doi.org/10.1038/s41591-021-01620-2
Cardiology/
General
This is a dataset from tertiary care medical college and hospital in India’s cardiology unit which had 14,845 admissions corresponding to 12,238 patients.
Related publication: Bollepalli, S.C.; Sahani, A.K.; Armoundas, A.A. ,et al. An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission to a Cardiac Unit. Diagnostics 2022, 12, 241.
https://doi.org/10.3390/diagnostics12020241
Dermatology
The dataset included over 69,000 dermatology images.International Skin Imaging Collaboration (ISIC) is a global partnership that has organized the world’s largest repository of publicly available dermoscopic images, hosted the first public benchmarks for melanoma detection in dermoscopic images, titled “Skin Lesion Analysis Towards Melanoma Detection”, at the IEEE International Symposium of Biomedical Imaging (ISBI).
Imaging
A dataset of 491 Head CT scans with 193,317 slices, anonymized dicoms for all the scans and the corresponding radiologists’ reads done by three radiologists with an experience of 8, 12 and 20 years in cranial CT interpretation respectively.
Related publication: Development and Validation of Deep Learning Algorithms for Detection of Critical Findings in Head CT scan.
Critical Care/
Imaging
Publicly available suite of tailored deep neural network models for tackling different challenges ranging from screening to risk stratification to treatment planning for patients with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
Also,expanded to open source TB-Net initiative for tuberculosis screening, Fibrosis-Net initiative for pulmonary fibrosis progression prediction, and Cancer-Net initiative for cancer screening.
Emergency Department
MIMIC-ED is a large, freely available database of emergency department (ED) admissions at the Beth Israel Deaconess Medical Center between 2011 and 2016. 448,972 ED stays with vital signs, triage information, medication reconciliation, medication administration, and discharge diagnoses available
Imaging
Chest X-ray dataset with eye tracking and report dictation. Built on MIMIC Chest X-ray dataset.1,083 CXR images.
Related publication:
Imaging
This database is the first multi-institutional, multi-national expert annotated COVID-19 imaging dataset.Annotated by three radiologists with majority vote adjudication by board certified radiologists,RICORD consists of 240 thoracic CT scans and 1,000 chest radiographs contributed from four international sites.
Anesthesiology
A comprehensive dataset of 6,388 surgical patients composed of intraoperative biosignals and clinical information from the Department of Anesthesiology and Pain Medicine, Seoul National University College of Medicine, Seoul, Korea .
Imaging
CheXpert is a public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients from Stanford Hospital.
Cancer/
Genetics
The GDC Portal is a platform from National Cancer Institute(NCI) with cancer related genomic data for 80,000+ cases.
Imaging
Annotations by CARING’s (Centre for Advanced research in Imaging, Neuroscience and Genomics) expert radiologists on COVID-19 positive X-rays.The corresponding X-rays were released by Medical Imaging Data Bank of the Valencia region (BIMCV).
Imaging
BIMCV-COVID19+ dataset is a large dataset with chest X-ray images and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV).These iterations of the database include 7377 CR, 9463 DX and 6687 CT studies.
Imaging
Provided on Kaggle by the Vingroup Big Data Institute (VinBigData) aims to promote fundamental research and investigate novel and highly-applicable technologies.A dataset consisting of 18,000 images that have been annotated by experienced radiologists.
Cardiology
The EchoNet-Dynamic database includes 10,030 labeled echocardiogram videos and human expert annotations (measurements, tracings, and calculations) to provide a baseline to study cardiac motion and chamber sizes.
Related publication: Video-based AI for beat-to-beat assessment of cardiac function
Genetics/
Pharmacology
941 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly genes variations, gene, drugs and phenotypes), and relationships between those.
General
40+ speciality classified list of open source datasets for healthcare with direct links to the datasets and more information.
General
More than a 100 healthcare related datasets from around the world, classified and annotated.
General
Dataset created for the purpose of continuing research into COVID-19. However with information from all 50 states and the District of Columbia, many US statistics can be compared.
Pharmacology
The DILIrank dataset is an updated version of the LTKB Benchmark dataset. DILIrank consists of 1,036 FDA-approved drugs that are divided into four classes according to their potential for causing drug-induced liver injury (DILI).
Ophthalomology
Dataset for automatically segmenting and classifying corneal ulcers with 712 ocular staining images and the associated segmentation labels for flaky corneal ulcers.
General
4000+ healthcare datasets made available from Harvard University.Searchable and diverse.
Pathology
Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources.
Imaging
A dataset with Images,mainly Chest X-rays from COVID-19 patients.
General
Multiple data sources for COVID-19 in a unified data model, ready for analysis at one place.
General
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
General
This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.
The data is available since 22 Jan, 2020.
Imaging
Largest collection of Intracranial hemorrhage CT scans.874 035 images with expert annotations.
Cardiology/
General/
Neurology
One of the most comprehensive source of many datasets in healthcare.Primarily from ICU patients.
https://physionet.org/about/database/
MIMIC – IV Dataset (https://physionet.org/content/mimiciv/0.4/)
Includes:
Imaging/
Neurology
Alzheimer’s disease patient’s imaging(MRI), clinical, genomic, and biomarker data for the purposes of scientific investigation, teaching, or planning clinical research studies.
Ophthalomology
RIM-ONE is a database for optic disc and cup segmentation evaluation by Medical Image Analysis group.
Critical Care
Contains data related to 23,376 intensive care unit and high dependency unit admissions at Amsterdam University Medical Center of adult patients from 2003-2016.
Pharmacology
The FDA Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA
Microbiology
A repository of segmented cells from the thin blood smear slide images from the Malaria Screener research activity.The dataset contains a total of 27,558 cell images with equal instances of parasitized and uninfected cells.
Ophthalomology
1200 Retinal image dataset with annotation.
Ophthalomology
A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) derived from three sources with 750 original images and 4500 manual marked images
Ophthalomology
The public database contains 15 images of healthy patients, 15 images of patients with diabetic retinopathy and 15 images of glaucomatous patients.
Ophthalomology
39 images for development of vessel extraction algorithms suitable for retinal screening programmes.
Cancer
Datasets from National Cancer Institute of over 54000 patients. They include data on participant characteristics, screening exam results, diagnostic procedures, lung cancer, and mortality. Images from over 75,000 CT screening exams are available. Over 1,200 pathology images from a subset of NLST lung cancer patients (~500 of over 2,000 patients) may be viewed.
Pulmonary
Polysomnography dataset from NSRR for sleep studies.Large collection of deidentified physiologic signals perfect for ML development.
Dermatology
A large collection of multi-source dermatoscopic images of common pigmented skin lesions containing 10000 images.
Related publication:The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions
General
This open source repository has more than 400 datasets including healthcare(100+) and non-healthcare ones in searchable and categorized format.
General
CMS datasets provide US Medicare and Medicaid datasets.
ResDAC(The Research Data Assistance Center) provides free support to users of CMS datasets.Link: https://www.resdac.org/learn
General
Center for Disease Control’s datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.
General
Agency for Healthcare Research and Quality’s HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.
General
UK government’s National Health services datasets.NHS choices datasets are useful for NLP and sentiment analysis both for GPs and hospitals.
Imaging
Brain MRI datasets from Open Accesss series of Imaging Studies(OASIS).
Cancer
Cancer epidemiology data available through NCI’s Surveillance,Epidemiology and End Result Program(SEER).
Imaging
A dataset of 14,000+ anonymized, radiologist labeled musculoskeletal X-rays from 12,000+ patients from Stanford ML group.
Related publication: https://arxiv.org/abs/1712.06957
General
One stop to learn Natural Language processing and more.
Related publication: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
General
An excellent resource for trends and updates in AI, especially NLP by Elvis Saravia.
Imaging
Over 100,000 anonymized chest x-ray images and their corresponding data from more than 30,000 patients, including many with advanced lung disease.
Imaging
NIH release of a dataset containing 32,000 CT scan images with annotated lesions belonging to 4400 unique patients.
General
A CMS initiative to democratize research and development using beneficiary data.Greater than 70 million patient dataset available.
General
The link below is for NIH’s strategic plan for data science in healthcare.A must read for anyone using data in healthcare for research and innovation
Imaging
Largest open source Chest X-Ray data set available through NIH’s clinical center.See the link in the article to access the data.Also available through GITHUB and KAGGLE.
General
One of the the largest and most advanced software development platform in the world with many datasets and repositories.