This sample notebook demonstrates working with Google BigQuery datasets.
OpenAQ is an open-source project to surface live, real-time air quality data from around the world. Their “mission is to enable previously impossible science, impact policy and empower the public to fight air pollution.” The data includes air quality measurements from 5490 locations in 47 countries.
Scientists, researchers, developers, and citizens can use this data to understand the quality of air near them currently. The dataset only includes the most current measurement available for the location (no historical data).
Dataset Source: openaq.org
Category: Science
Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — https://openaq.org/#/about?_k=s3aspo — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Update Frequency: Hourly
This kernel shows how easy it can be to run a SQL query against a BigQuery table and get a pandas dataframe as the result. If you're interested in digging deeper, check out these references:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
It's helpful to inspect schema and a sample of the data we're working with
%%bigquery --project <your project id> df
SELECT * FROM
`bigquery-public-data.openaq.global_air_quality`
where timestamp >= '2019-01-01'
LIMIT 10
The OpenAQ dataset is updated hourly to show a nearly live look at government-reported air quality around the world. With this dataset, you can answer questions like:
Where are the european hotspots for poor air quality right now (using concentrations of PM10: Particulate Matter with a size of 10 micrometers or less)?
from IPython.display import IFrame
IFrame('your url', width=700, height=500)
Where are the global hotspots for poor air quality right now (using concentrations of PM10: Particulate Matter with a size of 10 micrometers or less)?
%%bigquery --project <your project id>
#standardSQL
SELECT
location, city, country, value,
CONCAT(CAST(latitude AS STRING), ', ', CAST(longitude AS STRING)) AS latlong
FROM
`bigquery-public-data.openaq.global_air_quality`
WHERE
pollutant = "pm10" AND timestamp >= '2019-01-01'
ORDER BY
value DESC
Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Panda Definition
df.describe()
This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
df.head(10)
from IPython.display import IFrame
IFrame('your url', width=700, height=500)