You are getting into Data Science, you are looking for datasets that you could use for your own learning and to sharpen your skills, Google has you covered. In this article I will tell you about the Google tools you can use to find dataset to sharpen your Data Science skills.
Google has a datasets search engine named Dataset Search, it’s a free tool for searching over 25 million publicly available datasets. Dataset Search will enable you to find datasets stored across the Web through a simple keyword search.
The aim of the Dataset Search is to create a data sharing ecosystem that will encourage data publishers to follow best practices for data storage and publication.
The Dataset Search tool includes filters to narrow your searches related to publish date, formats and usage rights. The search results includes fields like when the dataset was updated, how is the provider of the dataset, available download formats, what period, area the datasets cover and a description of the dataset.
Google’s Dataset Search does not curate or provide direct access to the 25 million datasets directly. Google relies on the dataset publishers to use the open standards of schema.org to describe their dataset’s metadata. Google then indexes and makes that metadata searchable across publishers.
Datasets publishers are required to host the datasets themselves, due to that fact you will find half of the datasets in the search results were from for-profit aggregators, Other dataset publishers include government agencies and research institutions.
According to Google, most of the datasets are related to “geosciences, biology, and agriculture.”
You can publish your own datasets, by simply use the open-standards of schema.org. The number of publicly available datasets is likely to continue growing as more publishers conform to the standard.
Another great source for sharpening your skills is Kaggle, Kaggle is not like Google’s Datasets Search. Kaggle is an online community of data scientists and machine learning practitioner sharing work and you will find all the code & data you need to do your data science work. Currently Kaggle offer over 19,000 public datasets and 200,000 public notebooks to conquer any analysis in no time.
Kaggle also offer competitions to solve a specific task, with some prise money. This is a great way to get some money if you are successful in the composition for your own project your working on.
I use Kaggle from time to time, and when I wrote the article about Wine Regions of the World, I used an open dataset available from Kaggle.
There you have it, there are many other website that offers open datasets, as I primarily working with the Google Cloud Platform my bias is towards the Google offerings for data science.