Air Quality Data

This sample notebook demonstrates working with Google BigQuery datasets.

Dataset Details: bigquery-public-data:openaq

Description

OpenAQ is an open-source project to surface live, real-time air quality data from around the world. Their “mission is to enable previously impossible science, impact policy and empower the public to fight air pollution.” The data includes air quality measurements from 5490 locations in 47 countries.

Scientists, researchers, developers, and citizens can use this data to understand the quality of air near them currently. The dataset only includes the most current measurement available for the location (no historical data).

Dataset Source: openaq.org

Category: Science

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — https://openaq.org/#/about?_k=s3aspo — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Update Frequency: Hourly

This kernel shows how easy it can be to run a SQL query against a BigQuery table and get a pandas dataframe as the result. If you're interested in digging deeper, check out these references:

First we have to define the use of bigquery

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated

Understand the data structure

It's helpful to inspect schema and a sample of the data we're working with

Understanding the EPA Data

%%bigquery --project torbjornzetterlund-1023 df
SELECT * FROM
  `bigquery-public-data.openaq.global_air_quality`
  where timestamp >= '2019-01-01'
  LIMIT 10

OpenAQ dataset

The OpenAQ dataset is updated hourly to show a nearly live look at government-reported air quality around the world. With this dataset, you can answer questions like:

Where are the european hotspots for poor air quality right now (using concentrations of PM10: Particulate Matter with a size of 10 micrometers or less)?

from IPython.display import IFrame
IFrame('https://datastudio.google.com/embed/reporting/1LFVmdo4GWsyWI9k55FFWdkkRo3HQwJzM/page/J4uE', width=700, height=500)

Global Hotspots

Where are the global hotspots for poor air quality right now (using concentrations of PM10: Particulate Matter with a size of 10 micrometers or less)?

%%bigquery --project torbjornzetterlund-1023
#standardSQL
SELECT 
 location, city, country, value, 
 CONCAT(CAST(latitude AS STRING), ', ', CAST(longitude AS STRING)) AS latlong  
FROM 
  `bigquery-public-data.openaq.global_air_quality` 
WHERE 
  pollutant = "pm10" AND timestamp >= '2019-01-01'
ORDER BY 
  value DESC
location city country value latlong
0 Gartnerløkka Kristiansand NO 1516.6917 58.14703, 7.98662
1 ES1610A Illes Balears ES 971.0000 39.57008, 2.65583
2 Арвайхээр Арвайхээр хот MN 834.0000 46.258915, 102.78922
3 Vindhyachal STPS, Singrauli - MPPCB Singrauli IN 726.0000 24.10897, 82.64558
4 Cerrillos Cerrillos CL 664.0000 -33.492847, -70.71944
... ... ... ... ... ...
3556 Groningen-Europaweg Groningen NL -999.0000 53.2178, 6.5789
3557 Kollumerwaard-Hooge Zuidwal Kollumerwaard NL -999.0000 53.3304, 6.27681
3558 Environment Park Townsville AU -1111.0000 -19.2629, 146.831
3559 Cannon Hill South East Queensland AU -1111.0000 -27.4654, 153.087
3560 CAL-JOAR - Caldas - E U Joaquín Aristizabal Caldas CO -9999.0000 6.0930777, -75.637764

3561 rows × 5 columns

Panda DataFrame

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Panda Definition

df.describe()
value latitude longitude
count 10.000000 10.000000 10.000000
mean 100.920027 49.360041 20.569345
std 246.712867 0.079072 1.031945
min 0.416660 49.293564 19.960083
25% 8.940828 49.293564 19.960083
50% 21.772400 49.333886 19.994537
75% 40.811650 49.406284 20.724357
max 801.230000 49.483597 22.439453

pandas.DataFrame.head

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

df.head(10)
location city country pollutant value timestamp unit source_name latitude longitude averaged_over_in_hours
0 Zakopane, ul. Sienkiewicza Zakopane PL co 801.23000 2020-01-15 13:00:00+00:00 µg/m³ GIOS 49.293564 19.960083 None
1 Zakopane, ul. Sienkiewicza Zakopane PL no2 26.27990 2020-01-15 13:00:00+00:00 µg/m³ GIOS 49.293564 19.960083 None
2 Zakopane, ul. Sienkiewicza Zakopane PL o3 56.78670 2020-01-15 13:00:00+00:00 µg/m³ GIOS 49.293564 19.960083 None
3 Zakopane, ul. Sienkiewicza Zakopane PL pm10 41.50630 2020-01-15 12:00:00+00:00 µg/m³ GIOS 49.293564 19.960083 None
4 Zakopane, ul. Sienkiewicza Zakopane PL so2 10.82920 2020-01-15 13:00:00+00:00 µg/m³ GIOS 49.293564 19.960083 None
5 Polanczyk-mobilna Polańczyk PL pm10 8.31137 2020-01-15 13:00:00+00:00 µg/m³ GIOS 49.374207 22.439453 None
6 Polanczyk-mobilna Polańczyk PL pm25 7.84754 2020-01-15 13:00:00+00:00 µg/m³ GIOS 49.374207 22.439453 None
7 Krynica, ul. Bulwary Dietla Krynica-Zdrój PL pm10 17.26490 2020-01-15 13:00:00+00:00 µg/m³ GIOS 49.416977 20.956146 None
8 Nowy Targ, Plac Słowackiego Nowy Targ PL bc 0.41666 2019-12-31 22:00:00+00:00 µg/m³ GIOS 49.483597 20.028992 None
9 Nowy Targ, Plac Słowackiego Nowy Targ PL pm10 38.72770 2020-01-15 12:00:00+00:00 µg/m³ GIOS 49.483597 20.028992 None
from IPython.display import IFrame
IFrame('https://datastudio.google.com/embed/reporting/1BsclZNxZ8Uvc-7gtUxBlQHN9uXJNOhWE/page/J4uE', width=700, height=500)