Beach Wijk aan Zee

Data Project: Splitting addresses with Google Cloud Dataprep

This article is part of the article I wrote about how I became an environmental data collector, and the issue I run into during the project and how I finally resolved them. After I collected the data with the wearable, portable indoor-outdoor air quality monitor Atmotube PRO.

The Dataset

I uploaded the Atmotube .csv file to Google Drive and open the file with Google Sheet, while inspecting the data I needed to create a new column that I named coordinates and then I concatenated the latitude and longitude columns and stored the value on the coordinates.

The reason for this enhancement was to be able to use a Map in Data Studio, of course, I could do the concatenation in Data Studio. I do like my datasets to be “clean” before I use them with Data Studio.

Google Sheet

As the Atmotube only gives me latitude and longitude, I need to get the full address from the latitude and longitude, I wanted to get the address from the latitude and longitude.

To get the address I did it in Google Sheet by creating a script that used Geocoding to get me the address from latitude and longitude.

Geocode Address

Here is my code for the App script to get the address from latitude and longitude.

// Geocode Addresses
// Bias the geocoding results in favor of these geographic regions.
// The regions are specified as ccTLD codes.
// 
// See: https://en.wikipedia.org/wiki/Country_code_top-level_domain
//
// Used:
// https://mbrownnyc.wordpress.com/misc/iso-3166-cctld-csv/
// http://www.convertcsv.com/csv-to-json.htm
// to generate the functions for menu item handling.
// Forward Geocoding -- convert address to GPS position.

function addressToPosition() {
  var sheet = SpreadsheetApp.getActiveSheet();
  var cells = sheet.getActiveRange();
  
  var popup = SpreadsheetApp.getUi();
  
  // Must have selected at least 3 columns (Address, Lat, Lng).
  // Must have selected at least 1 row.
  
  var columnCount = cells.getNumColumns();
  var rowCount = cells.getNumRows();

  if (columnCount < 3) {
    popup.alert("Select at least 3 columns: Address in the leftmost column(s); the geocoded Latitude, Longitude will go into the last 2 columns.");
    return;
  }
  
  var addressRow;  
  var addressColumn;
  
  var latColumn = columnCount - 1; // Latitude  goes into the next-to-last column.
  var lngColumn = columnCount;     // Longitude goes into the last column.
  
  var geocoder = Maps.newGeocoder().setRegion(getGeocodingRegion());
  var location;

  var addresses = sheet.getRange(cells.getRow(), cells.getColumn(), rowCount, columnCount - 2).getValues();
  
  // For each row of selected data...
  for (addressRow = 1; addressRow <= rowCount; ++addressRow) {
    var address = addresses[addressRow - 1].join(' ');

    // Replace problem characters.
    address = address.replace(/'/g, "%27");

    Logger.log(address);
    
    // Geocode the address and plug the lat, lng pair into the 
    // last 2 elements of the current range row.
    location = geocoder.geocode(address);
   
    // Only change cells if geocoder seems to have gotten a 
    // valid response.
    if (location.status == 'OK') {
      lat = location["results"][0]["geometry"]["location"]["lat"];
      lng = location["results"][0]["geometry"]["location"]["lng"];
      
      cells.getCell(addressRow, latColumn).setValue(lat);
      cells.getCell(addressRow, lngColumn).setValue(lng);
    } else {
      Logger.log(location.status);
    }
  }
};

// Reverse Geocode -- GPS position to nearest address.
function positionToAddress() {
  var sheet = SpreadsheetApp.getActiveSheet();
  var cells = sheet.getActiveRange();

  var popup = SpreadsheetApp.getUi();
  
  // Must have selected at least 3 columns (Address, Lat, Lng).
  // Must have selected at least 1 row.

  var columnCount = cells.getNumColumns();

  if (columnCount < 3) {
    popup.alert("Select at least 3 columns: Latitude, Longitude in the first 2 columns; the reverse-geocoded Address will go into the last column.");
    return;
  }

  var latColumn     = 1;
  var lngColumn     = 2;

  var addressRow;
  var addressColumn = columnCount;

  var geocoder = Maps.newGeocoder().setRegion(getGeocodingRegion());
  var location;
  
  for (addressRow = 1; addressRow <= cells.getNumRows(); ++addressRow) {
    var lat = cells.getCell(addressRow, latColumn).getValue();
    var lng = cells.getCell(addressRow, lngColumn).getValue();
    
    // Geocode the lat, lng pair to an address.
    location = geocoder.reverseGeocode(lat, lng);
   
    // Only change cells if geocoder seems to have gotten a 
    // valid response.
    Logger.log(location.status);
    if (location.status == 'OK') {
      var address = location["results"][0]["formatted_address"];

      cells.getCell(addressRow, addressColumn).setValue(address);
    }
  }
};

function generateMenu() {
  {
    name: "Geocode Selected Cells (Address to Latitude, Longitude)",
    functionName: "addressToPosition"
  },
  {
    name: "Geocode Selected Cells (Latitude, Longitude to Address)",
    functionName: "positionToAddress"
  }];
  
  return entries;
}

function updateMenu() {
  SpreadsheetApp.getActiveSpreadsheet().updateMenu('Geocode', generateMenu())
}

/**
 * Adds a custom menu to the active spreadsheet, containing a single menu item
 * for invoking the readRows() function specified above.
 * The onOpen() function, when defined, is automatically invoked whenever the
 * spreadsheet is opened.
 *
 * For more information on using the Spreadsheet API, see
 * https://developers.google.com/apps-script/service_spreadsheet
 */
function onOpen() {
  SpreadsheetApp.getActiveSpreadsheet().addMenu('Geocode', generateMenu());
  SpreadsheetApp.getUi()
     .createMenu();
};

Just one problem

The problem I faced was that the address was stored in one column, like this in the Google Sheet.

Hebridenlaan xx, xxxx LV Amsterdam, Netherlands

The user requirements were to split the cells into 4 columns address, city, postal code and country, I could do that in Google Sheet using the Split text to columns. There was just a problem, I could split on comma ‘,’ using the Split text to columns.

google sheet atmotube
google sheet atmotube

The problem was the following

Amsterdam, Hoekenes, xxx GC Amsterdam, Netherlands
Van Suchtelen van de Haarestraat xxx, xxxx GR Amsterdam, Netherlands
Am Hansakai xx, xxxxx Rostock, Germany

I had addresses that did not have a street number, instead, show a location in a city. I needed to split into columns counting the comma ‘,’ from the end to the beginning. As postal code and city would be in one column after a split column on comma I needed to do another split to separate the postal code and city.

Google Cloud Dataprep

I turned to Google Cloud Dataprep as Google Cloud Platform is my cloud platform of choice. The Google Cloud Dataprep is a data preparation solution that cleans and prepares data for BigQuery.

Creating a flow

The first thing I had to do was to create a flow and add a new dataset to the flow. I connected the Google Sheet as my dataset, it’s a very handy option in Cloud Dataprep to use your datasets.

dataprep-atmotube-flow
dataprep-atmotube-flow

Adding new steps to Recipe

To clean my data and prepare it in detail, I created a recipe in Dataprep to step by step prepare my data for BigQuery.

cloud dataprep split on delimiter
cloud dataprep split on delimiter
cloud dataprep recipe
cloud dataprep recipe

To extract x times in a pattern between delimiters comma in my case (, ) from the end. I used the regular expression :

extractpatterns type: custom col: ADDRESS on: /(,[^,]+){2}$/
extractpatterns type: custom col: ADDRESS on: `(,({alphanum-underscore}| )+){2}{end}`

Then I just have to split the result, and remove this pattern in the initial column, to get what I wanted. When we look at it from the Dataprep UI:

google cloud dataprep
google cloud dataprep split on delimiter
google cloud dataprep split on delimiter
google cloud dataprep split on delimiter

After the recipe steps were completed, I started the processing of the Cloud Dataprep work by clicking the Run Job button, I selected to output the data to a BigQuery table that would be created from the columns that I know had in the DataPrep.

Google BigQuery

After the Dataprep run processing was completed, I had a new table in Google BigQuery which I now could use to create a report or do data analysis.

Google BigQuery
Google BigQuery

Google Data Studio

With the transformed dataset stored in Google BigQuery, I now could create a Data Studio report.

data studio dashboard air quality
data studio dashboard air quality

I also set up a schedule in Data Studio to send me the report on a schedule. At the moment the data is not flowing in real-time.

Data Studio Schedule Data Report
Data Studio Schedule Data Report

Conclusion

With the data safely in BigQuery, I could now create a Python Notebook which I use often to do my analysis as well as to show others how I have done a specific task. You can read more about how I collected and analyzed the data with a Python Notebook I became an environmental data collector,

What needs to improve, currently the data has a manual element in which I have to download the data from Atmotube and I have to update the Google Sheet manually. The plan is to get the Atmotube mobile app on my mobile phone to stream the data to Google BigQuery, only then I will have real-time reporting working. That is for another day.


Posted

in

,

by

Comments

One response to “Data Project: Splitting addresses with Google Cloud Dataprep”

  1. […] beginning of the Corona epidemic, I wrote an article about a personal data project I had worked on: The air I breathe how polluted is it?. I became an environmental data collector by purchasing a wearable, portable indoor-outdoor air […]

Leave a Reply

Your email address will not be published. Required fields are marked *