Let us begin by finding machine learning datasets that are problem-specific, and hopefully cleaned and pre-processed.
It surely is a strenuous task to find specific datasets like MS-COCO for all varieties of problems. Therefore, we need to be intelligent about how we use datasets. For example, using Wikipedia for NLP tasks is probably the best NLP dataset there possibly is. In this article, we discuss some of the various sources for Machine Learning Datasets, and how we can proceed further with the same. A word of caution, be careful while reading the terms and conditions that each of these datasets impose, and follow accordingly. This is in the best interest of everyone indeed.
Google has been the search engine giant, and they helped all the ML practitioners out there by doing what they are legends at, helping us find datasets. The search engine does a fabulous job at getting datasets related to the keywords from various sources, including government websites, Kaggle, and other open-source repositories.
2. .gov Datasets:
With the United States, China and many more countries becoming AI superpowers, data is being democratised. The rules and regulations related to these datasets are usually stringent as they are actual data collected from various sectors of a nation. Thus, cautious use is recommended. We list some of the countries that are openly sharing their datasets.
Indian Government Dataset
Australian Government Dataset
EU Open Data Portal
New Zealand’s Government Dataset
Singapore Government Dataset
Kaggle is known for hosting machine learning and deep learning challenges. The relevance of Kaggle in this context is that they provide datasets, and at the same time provide a community of learners and ML practitioners, whose work shall help us with our progress. Each challenge has a specific dataset, and it is usually cleaned so that we don’t have to do the bland work of cleaning necessarily and can instead focus on refining the algorithm. The datasets are easily downloadable. Under the resources section, there are prerequisites and links to learning material, which helps us whenever we are stuck with either the algorithm or the implementation. Kaggle is a fantastic website for beginners to venture into applications of machine learning and deep learning and is a detailed resource pool for intermediate practitioners of machine learning.
4. Amazon Datasets (Registry of Open Data on AWS)
Amazon has listed some of the datasets available on their servers as publicly accessible. Therefore, when using AWS resources for calibrating and tweaking models, using these locally available datasets will fasten the data loading process by tens of times. The registry contains several datasets classified according to the field of applications like satellite images, ecological resources, etc.
UCI Machine Learning Repository provides easy to use and cleaned datasets. These have been the go-to datasets for a long time in academia.
An exciting feature that this website provides is it lists the paper which used the dataset. Therefore, all research scientists and people from academia will find this resource handy. The datasets available cannot be used for commercial purposes. For more details, check the websites of the datasets provided.
The subreddit can be used as a secondary guide when all other options lead nowhere. People usually discuss the various available datasets and how to use existing datasets for new tasks. A lot of insights regarding the necessary tweaking required for datasets to work in different environments can be obtained as well. Overall, this should be the last resource point for datasets.
Let’s focus on datasets specific to the major domains that have seen accelerated progress in the last two decades. Having domain-specific datasets available enhances the robustness of the model, and thus more realistic and accurate results are possible. The areas include computer vision, NLP and, Data analytics.
Datasets for other applications
Computer Vision Datasets
There are several computer vision datasets available. The choice of the dataset depends on the level of competence we are working with. The pre-loaded datasets on Keras and scikit-learn are sufficient for learning, experimenting and implementing new models. The downside with these datasets is that the chances of overfitting of the model are high due to the low complexity in the datasets. Therefore, for intermediate ML practitioners and organisations solving specific problems can refer to various sources:
COCO dataset: COCO or Common Objects in COntext is large-scale object detection, segmentation, and captioning dataset. The dataset contains almost 330k images out of which more than 200k are labelled images. This dataset contains segmented images as Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images.
Here is a link to the source COCO.
Imagenet dataset: ImageNet is a large database or dataset of over 14 million images. It was designed by academics intended for computer vision research. It was the first of its kind in terms of scale. Images are organized and labelled in a hierarchy.ImageNet contains more than 20,000 categories with a typical category, such as “balloon” or “strawberry”, consisting of several hundred images.
Here is a link to the source Imagenet
CIFAR-10: The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The 10 different classes represent aeroplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class since the images in CIFAR-10 are low-resolution (32×32). This dataset can allow researchers to quickly try different algorithms to see what works.
Here is a link to the source CIFAR-10
Open Images(V6): Open Image is a dataset of nearly 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. It contains a total of 16 million bounding boxes for 600 object classes on 1.9 million images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.3 per image on average). Open Images also offers visual relationship annotations, indicating pairs of objects in particular relations (e.g. “woman playing guitar”, “beer on table”), object properties (e.g. “table is wooden”), and human actions (e.g. “woman is jumping”). In total it has 3.3M annotations from 1,466 distinct relationship triplets.
Here is a link to the source Open Images
Computer vision online: A variety of resources and datasets are available on the website. It lists most of the open-source datasets and redirects the user to the dataset’s webpage. The datasets available can be used for classification, detection, segmentation, image captioning and many more challenging tasks.
Here is a link to the source Computer vision online
YACVID: This website lists almost all the available datasets. It makes it easy for finding relevant datasets by providing the option of searching with the help of tags associated with each dataset. We highly recommend our readers to try this website out.
Here is a link to the source YACVID
Natural Language Processing
NLP is growing at a phenomenal pace, and recently language modelling has had its Imagenet moment, wherein people can start building applications with state of the art conversational NLP agents. When it comes to NLP, several scenarios require task-specific catered datasets. NLP deals with sentiment analysis, audio processing, translation, and many more challenging tasks. Therefore, it is necessary to have a massive list of datasets:
Stanford Question Answering Dataset (SQuAD): Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowd workers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible but also determine when no answer is supported by the paragraph and abstain from answering.
Here is a link to the source SQuAD
Yelp Reviews: This dataset is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp’s data and share their discoveries. In the dataset, you’ll find information about businesses across 11 metropolitan areas in four countries.
Here is a link to the source Yelp Reviews
The Blog Authorship Corpus: The Blog Authorship Corpus is a collection of posts from 19,320 bloggers. These blogs were gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words – or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups:
- 8240 “10s” blogs (ages 13-17),
- 8086 “20s” blogs(ages 23-27)
- 2994 “30s” blogs (ages 33-47).
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
Here is a link to the source The Blog Authorship Corpus
Appen: The datasets on this website are cleaned and provide a vast database to choose from. The appealing and easy-to-use interface makes this a highly recommended choice.
Here is a link to the website.
Apart from these majority of the datasets in the domain are listed in the following GitHub repository.
Statistic and Data Science
Data Science covers a range of tasks including creating recommendation engines, predicting parameters given the data, like time-series data, and doing exploratory and analytical research. Small organisations and individual practitioners don’t have what the big giants have, that is the data, and hence open datasets such as these is a huge boon to create actual models that reflect real data, and not simulated data.
http://rs.io/100-interesting-data-sets-for-statistics/: There are various datasets available for specific tasks, and it’s a wonderful resource point.
http://deeplearning.net/datasets/: These are benchmark datasets and can be used for comparing the results of the model built with the benchmark results.
This is an exhaustive list of datasets for machine learning, analytics, and other applications. We wish you the best of luck while implementing models. Also, we hope you come up with models that can match the benchmark results.
If you are interested in learning Machine Learning concepts and pursue a career into the domain, upskill with Great Learning’s PG Program in Artificial Intelligence and Machine Learning