Whenever you hear the term AI, you must think about the data behind it.
In this post, I am sharing a collection of open source data sets available, to actually train the Machine Learning model to perform various actions.
A data set is a collection of data. In ML projects, we need a training data set.
1. Xray-Images
- https://ceb.nlm.nih.gov/repositories/tuberculosis-chest-x-ray-image-data-sets/
- https://www.kaggle.com/nih-chest-xrays
- http://academictorrents.com/details/557481faacd824c83fbf57dcf7b6da9383b3235a
- https://nihcc.app.box.com/v/ChestXray-NIHCC
2. US Government
- Data.gov
- NOAA – ncfc.noaa.gov/cdo-web (motions, inflation, environmental data)
- US Census Data – census.gov/data.html (demographics)
- Bls.gov/data – (employment/un-employment, product categories)
3. UK Government
- UK Dataservice – www.ukdataservice.ac.uk (census data)
- WorldBank – datacatalog.worldbank.org (census, demographics, geographic, health, income, GDP)
- IMF imf.org/en/Data (economic, currency, finance, commodities)
- OpenData.go.ke
- Data.world
Find your Fun Application ideas using these dataset:
- Kaggle.com/datasets (variety)
- snap.stanford.edu/data/web-Amazon.html (35 Million product reviews)
- Group lens.org/datasets/movielens (20M MOVIE ratings)
Yelp.com/dataset - IMDB – ai.stanford.edu/~amaas/data/sentiment/ (25M Movie ratings)
- Twitter Sentiments – help.sentiment140.com/for-students (160k Tweets)
- AirBnb – insiderairbnb.com/get-the-data.html
- UCI ML Datasets – mar.cs.umass.edu/ml
- EMAIL dataset – cs.cmu.edu/~enron/ (500k Emails)
- SpamBase – archive.ics.uci.edu/ml/datasets/Spambase (emails)
- reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/ (200K questions and answers)
- Gutenberg EBOOKS – Gutenberg.org/wiki/Gutenberg:Offline_Catalogs (LARGE collection of ebooks)
Training Images using Natural Language Processing:
- ImageNet – httpimage-net.org (14M images).
- Google – ai.googleblog.com/2016/09/introducing-open-images-dataset.html (9M images URLs with labels)
- Microsoft Coco – cocodataset.org (330k Images, mostly labelled)
- Stanford Dogs – vision.stanford.edu/aditya86/ImageNetDogs (120 dog breeds, 20K images)
Please comment below if you are pridicting something out of it.
10 thoughts on “Open Source Data Sets for Machine Learning Training Model”
I’ve added this write-up to my bookmarks
Thanks for telling this message and making it public
I am really happy to say it’s an interesting post to read . I learn new information from your article , you are doing a great job . Keep it up
Thanks for sharing nice information and nice article and very useful information…
Excellent read, Positive site, where did u come up with the information on this posting? I have read a few of the articles on your website now, and I really like your style. Thanks a million and please keep up the effective work
Excellent post. Gained a lot of knowledge from it. Looking ahead for more of such interesting postings
Was in search for this information from a long time. Thank you for such informative post. Looking forward for more of such informative postings
SV
Informative post. Concept has been explained very well.Looking forward for such informative posts
Was looking for this post since a while. Very well explained. Looking forward to see more of such interesting posts from you..