Everything about data including open source, healthcare data sets and more, in one location.

 

The HAM10000 dataset

A large collection of multi-source dermatoscopic images of common pigmented skin lesions containing 10000 images.

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T

Associated publication link below:

The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

Philipp Tschandl, Cliff Rosendahl & Harald Kittler

 

UCI Machine Learning Repository

This open source repository has more than 400 datasets including healthcare(100+) and non-healthcare ones in searchable and categorized format.

http://archive.ics.uci.edu/ml/datasets.php

Centers for Medicare and Medicaid(CMS) datasets with ResDAC link.

CMS datasets provide US Medicare and Medicaid datasets.

https://data.cms.gov

ResDAC(The Research Data Assistance Center) provides free support to users of CMS datasets.

https://www.resdac.org/learn

Center for disease control(CDC) Datasets

Center for disease control's datasets.Useful for incidence,prevalance of various disorders and mortality data from across the US.

https://data.cdc.gov

 

Healthcare Cost and Utilization Project (HCUP) datasets

Agency for Healthcare Research and Quality's HCUP datasets used to identify, track, and analyze US national trends in health care utilization, access, charges, quality, and outcomes.

https://hcup-us.ahrq.gov/databases.jsp

 

NHS datasets

UK government's National Health services datasets.NHS choices datasets are useful for NLP and sentiment analysis both for GPs and hospitals.

https://data.gov.uk/dataset/73740ffe-cecb-4cba-afb9-51ea996187a1/nhs-england-nhs-choices-hospitals-patient-comments-and-ratings

OASIS Brain MRI dataset

Brain MRI datasets from Open Accesss series of Imaging Studies(OASIS).

http://www.oasis-brains.org/#data

 

OpenNEURO

A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data with over 200 datasets.

https://openneuro.org

 

National Cancer Institute(NCI)-SEER datasets

Cancer epidemiology data available through NCI's Surveillance,Epidemiology and End Result Program(SEER)

https://seer.cancer.gov/data-software/

 

BROAD Institute's Cancer program datasets

Cancer and genomics datasets.

http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi

 

MIMIC-CXR

The largest publicly available dataset of de-identified Chest x-rays. 370,000+ chest x-rays with 14 labels.

https://physionet.org/physiobank/database/mimiccxr/

 

MURA

A dataset of 14,000+ anonymized, radiologist labeled musculoskeletal X-rays from 12,000+ patients from Stanford ML group.

https://stanfordmlgroup.github.io/competitions/mura/

Read their article: https://arxiv.org/abs/1712.06957

 

fastMRI

1500+ knee MRI anonymized dataset from NYU.

https://fastmri.med.nyu.edu

NLTK : Natural language toolkit

One stop to learn Natural Language processing and more.

Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

https://www.nltk.org

DAIR.AI

An excellent resource for trends and updates in AI, especially NLP by Elvis Saravia.

https://medium.com/dair-ai

Data science article collection

An excellent collection of articles on data science.

https://www.datasciencecentral.com/profiles/blogs/30-seminal-articles-every-data-scientist-should-read

eICU Collaborative Dataset

eICU collaborative dataset of more than 130,000 patients across 300 hospitals.Deidentified dataset available for collaborative research includes vitals,clinical notes,APACHE score,diagnosis,treatment information and more.

https://www.nature.com/articles/sdata2018178

Google Dataset Search

Google's powerful search engine to assist with dataset search.

https://toolbox.google.com/datasetsearch

 

NIH Deep Lesion

NIH release of  a dataset containing 32,000 CT scan images with annotated lesions  belonging to 44oo unique patients.

https://www.nih.gov/news-events/news-releases/nih-clinical-center-releases-dataset-32000-ct-images

 

Blue Button 2.0

A CMS initiative to democratize research and development using beneficiary data.Greater than 70 million patient dataset available.Learn more through links below:

https://bluebutton.cms.gov

https://www.youtube.com/watch?v=v5b8T6EELp8

National Institute of Health

The link below is for NIH's strategic plan for data science in healthcare.A must read for anyone using data in healthcare for research and innovation.

https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf

NIH Clinical Center

Largest open source Chest X-Ray data set available through NIH's clinical center.See the link in the article to access the data.Also available through GITHUB and KAGGLE.

https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community

 

GITHUB

Thanks to Andrew L. Beam and many other contributors on the GITHUB page.Visit via link below or through the BRAINX COMMUNITY on LinkedIn.

https://github.com/beamandrew/medical-data

https://www.linkedin.com/groups/13599549

KAGGLE

Kaggle is a good source for de-identified datasets in healthcare.Visit the page using link below and explore.

https://www.kaggle.com/datasets

 

MIMIC/Physionet

Excellent data set for text based and waveform based projects which has been used in research worldwide.

https://mimic.physionet.org

DataMed

A biomedical data search engine which searches for datasets across registries.

https://datamed.org/index.php

 

Mendeley

A place to store, share or find data.A platform for biomedical  research.

https://www.mendeley.com/datasets

 

Nature

Detailed data repositories for biomedical research.

https://www.nature.com/sdata/policies/repositories