Subscribe For Newsletter

Also Checkout Recent forum posts
No topics yet!

Open Source Data Sets for Machine Learning Training Model

Whenever you hear the term AI, you must think about the data behind it.

In this post, I am sharing a collection of open source data sets available, to actually train the Machine Learning model to perform various actions.

A data set is a collection of data. In ML projects, we need a training data set.

1. Xray-Images

  • https://ceb.nlm.nih.gov/repositories/tuberculosis-chest-x-ray-image-data-sets/
  • https://www.kaggle.com/nih-chest-xrays
  • http://academictorrents.com/details/557481faacd824c83fbf57dcf7b6da9383b3235a
  • https://nihcc.app.box.com/v/ChestXray-NIHCC

2. US Government

  • Data.gov
  • NOAA – ncfc.noaa.gov/cdo-web (motions, inflation, environmental data)
  • US Census Data – census.gov/data.html (demographics)
  • Bls.gov/data – (employment/un-employment, product categories)

3. UK Government

  • UK Dataservice – www.ukdataservice.ac.uk (census data)
  • WorldBank – datacatalog.worldbank.org (census, demographics, geographic, health, income, GDP)
  • IMF imf.org/en/Data (economic, currency, finance, commodities)
  • OpenData.go.ke
  • Data.world

Find your Fun Application ideas using these dataset:

  • Kaggle.com/datasets (variety)
  • snap.stanford.edu/data/web-Amazon.html (35 Million product reviews)
  • Group lens.org/datasets/movielens (20M MOVIE ratings)
    Yelp.com/dataset
  • IMDB – ai.stanford.edu/~amaas/data/sentiment/ (25M Movie ratings)
  • Twitter Sentiments – help.sentiment140.com/for-students (160k Tweets)
  • AirBnb – insiderairbnb.com/get-the-data.html
  • UCI ML Datasets – mar.cs.umass.edu/ml
  • EMAIL dataset – cs.cmu.edu/~enron/ (500k Emails)
  • SpamBase – archive.ics.uci.edu/ml/datasets/Spambase (emails)
  • reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/ (200K questions and answers)
  • Gutenberg EBOOKS – Gutenberg.org/wiki/Gutenberg:Offline_Catalogs (LARGE collection of ebooks)

Training Images using Natural Language Processing:

  1. ImageNet – httpimage-net.org (14M images).
  2. Google – ai.googleblog.com/2016/09/introducing-open-images-dataset.html (9M images URLs with labels)
  3. Microsoft Coco – cocodataset.org (330k Images, mostly labelled)
  4. Stanford Dogs – vision.stanford.edu/aditya86/ImageNetDogs (120 dog breeds, 20K images)

    Please comment below if you are pridicting something out of it.

Share Post On:

10 thoughts on “Open Source Data Sets for Machine Learning Training Model

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related News