Air Quality Data

This sample notebook demonstrates working with Google BigQuery datasets.

Dataset Details: bigquery-public-data:openaq

Description

OpenAQ is an open-source project to surface live, real-time air quality data from around the world. Their “mission is to enable previously impossible science, impact policy and empower the public to fight air pollution.” The data includes air quality measurements from 5490 locations in 47 countries.

Scientists, researchers, developers, and citizens can use this data to understand the quality of air near them currently. The dataset only includes the most current measurement available for the location (no historical data).

Dataset Source: openaq.org

Category: Science

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — https://openaq.org/#/about?_k=s3aspo — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Update Frequency: Hourly

This kernel shows how easy it can be to run a SQL query against a BigQuery table and get a pandas dataframe as the result. If you're interested in digging deeper, check out these references:

First we have to define the use of bigquery

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated

Understand the data structure

It's helpful to inspect schema and a sample of the data we're working with

Understanding the EPA Data

%%bigquery --project torbjornzetterlund-1023 df
SELECT * FROM
  `bigquery-public-data.openaq.global_air_quality`
  where timestamp >= '2019-01-01'
  LIMIT 10

OpenAQ dataset

The OpenAQ dataset is updated hourly to show a nearly live look at government-reported air quality around the world. With this dataset, you can answer questions like:

Where are the european hotspots for poor air quality right now?
How does one city compare to others?

Where are the european hotspots for poor air quality right now (using concentrations of PM10: Particulate Matter with a size of 10 micrometers or less)?

from IPython.display import IFrame
IFrame('https://datastudio.google.com/embed/reporting/1LFVmdo4GWsyWI9k55FFWdkkRo3HQwJzM/page/J4uE', width=700, height=500)

Global Hotspots

Where are the global hotspots for poor air quality right now?
How does one city compare to others?

Where are the global hotspots for poor air quality right now (using concentrations of PM10: Particulate Matter with a size of 10 micrometers or less)?

%%bigquery --project torbjornzetterlund-1023
#standardSQL
SELECT 
 location, city, country, value, 
 CONCAT(CAST(latitude AS STRING), ', ', CAST(longitude AS STRING)) AS latlong  
FROM 
  `bigquery-public-data.openaq.global_air_quality` 
WHERE 
  pollutant = "pm10" AND timestamp >= '2019-01-01'
ORDER BY 
  value DESC

	location	city	country	value	latlong
0	Gartnerløkka	Kristiansand	NO	1516.6917	58.14703, 7.98662
1	ES1610A	Illes Balears	ES	971.0000	39.57008, 2.65583
2	Арвайхээр	Арвайхээр хот	MN	834.0000	46.258915, 102.78922
3	Vindhyachal STPS, Singrauli - MPPCB	Singrauli	IN	726.0000	24.10897, 82.64558
4	Cerrillos	Cerrillos	CL	664.0000	-33.492847, -70.71944
...	...	...	...	...	...
3556	Groningen-Europaweg	Groningen	NL	-999.0000	53.2178, 6.5789
3557	Kollumerwaard-Hooge Zuidwal	Kollumerwaard	NL	-999.0000	53.3304, 6.27681
3558	Environment Park	Townsville	AU	-1111.0000	-19.2629, 146.831
3559	Cannon Hill	South East Queensland	AU	-1111.0000	-27.4654, 153.087
3560	CAL-JOAR - Caldas - E U Joaquín Aristizabal	Caldas	CO	-9999.0000	6.0930777, -75.637764

3561 rows × 5 columns

Panda DataFrame

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Panda Definition

df.describe()

	value	latitude	longitude
count	10.000000	10.000000	10.000000
mean	100.920027	49.360041	20.569345
std	246.712867	0.079072	1.031945
min	0.416660	49.293564	19.960083
25%	8.940828	49.293564	19.960083
50%	21.772400	49.333886	19.994537
75%	40.811650	49.406284	20.724357
max	801.230000	49.483597	22.439453

pandas.DataFrame.head

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

df.head(10)

	location	city	country	pollutant	value	timestamp	unit	source_name	latitude	longitude	averaged_over_in_hours
0	Zakopane, ul. Sienkiewicza	Zakopane	PL	co	801.23000	2020-01-15 13:00:00+00:00	µg/m³	GIOS	49.293564	19.960083	None
1	Zakopane, ul. Sienkiewicza	Zakopane	PL	no2	26.27990	2020-01-15 13:00:00+00:00	µg/m³	GIOS	49.293564	19.960083	None
2	Zakopane, ul. Sienkiewicza	Zakopane	PL	o3	56.78670	2020-01-15 13:00:00+00:00	µg/m³	GIOS	49.293564	19.960083	None
3	Zakopane, ul. Sienkiewicza	Zakopane	PL	pm10	41.50630	2020-01-15 12:00:00+00:00	µg/m³	GIOS	49.293564	19.960083	None
4	Zakopane, ul. Sienkiewicza	Zakopane	PL	so2	10.82920	2020-01-15 13:00:00+00:00	µg/m³	GIOS	49.293564	19.960083	None
5	Polanczyk-mobilna	Polańczyk	PL	pm10	8.31137	2020-01-15 13:00:00+00:00	µg/m³	GIOS	49.374207	22.439453	None
6	Polanczyk-mobilna	Polańczyk	PL	pm25	7.84754	2020-01-15 13:00:00+00:00	µg/m³	GIOS	49.374207	22.439453	None
7	Krynica, ul. Bulwary Dietla	Krynica-Zdrój	PL	pm10	17.26490	2020-01-15 13:00:00+00:00	µg/m³	GIOS	49.416977	20.956146	None
8	Nowy Targ, Plac Słowackiego	Nowy Targ	PL	bc	0.41666	2019-12-31 22:00:00+00:00	µg/m³	GIOS	49.483597	20.028992	None
9	Nowy Targ, Plac Słowackiego	Nowy Targ	PL	pm10	38.72770	2020-01-15 12:00:00+00:00	µg/m³	GIOS	49.483597	20.028992	None

from IPython.display import IFrame
IFrame('https://datastudio.google.com/embed/reporting/1BsclZNxZ8Uvc-7gtUxBlQHN9uXJNOhWE/page/J4uE', width=700, height=500)