This was a very interesting project that I worked on, the use case for the project was to make the maritime noon report data available in Google BigQuery, sounds simple enough. However there were a lot of requirements, from collecting application data from the Greenpeace fleet in different formats leveraging Cloud Functions and API calls, landing the data in Cloud Storage, normalizing the data for reporting and translating each different ships’ schemas into one common schema in BigQuery thanks to Google Cloud Dataprep by Trifacta, and finally automating the end to end process with Cloud Scheduler. If you want to learn more read on.
What is a maritime noon report, it’s a report that gives a snapshot of what has happened onboard a ship since the previous noon i.e., in the last 24 hours. Noon report is used by almost every shipping company today to understand and monitor what is happening on their ships. These reports are traditionally sent by the captain every day, based on data gathered manually by the crew. The content and format of the noon report is usually created by the shipping companies and can vary much between different shipping companies.
In this how-to, I will explain how we at Greenpeace International standardized our noon report and moved it to the Google Cloud Platform. The solution had many components, I will not go through the details about the Emily application running the Noon report where data is entered daily and the Emily shore server. The focus on this how-to is about the services used in the Google Cloud Platform (GCP) and how the various GCP services were used.
The GCP Architect Services
Let’s have an overlook of the GCP Architecture for the noon report as you can see in the diagram below. I will go into the GCP specific throughout this how-to, so you can replicate parts.
Emily shore server
The purpose of the shore server is to collect daily the noon report data from the Greenpeace ships. The mechanism used to request the data from the ships is with a restFul API with a unique API key for authentication.
Using crontab on the Emily shore server the API call is made every two hours, to keep the bandwidth at the minimum the API request is made with a sync date. The sync date is tracked on the shore server, if there was a timeout or error in the API call, the sync date does not update. And next time the crontab runs it takes the same date.
Syncing every two hours is to compensate for the timezone the ships may be located in when a staff member on board a ship enters in the noon report data. There is a manual sync dashboard in the Emily application were data can be re-synced from the ships. The sync dashboard looks like this.
To create a data pipeline from the shore server to Cloud Storage, a Cloud Function was created that has two purposes to request data from the Shore server though a restFUl API request and to store the result of the API request as a .json unstructured file in Cloud Storage.
Security is a big issue when your create a Cloud Functions, you can read more about Securing HTTP trigger in Google Cloud Functions and Using Secret Manager in a Google Cloud Function
I’m not going into details of the Cloud Function as it is very much self-explanatory, you can find the cloud function code on GitHub.
Cloud Function that uses an http trigger, request that on an restFul API from a remote server and stored the result in Cloud Storage. The CLoud Function uses Google Secret Manager to hide any secrets.
The Cloud Function GitHub REPO comes with an .yaml file so you can setup an automatic deployment build process - you can read more how to do in this article Using Cloud Build as you CI/CO for multiple Cloud Functions
The cloud functions save the file request by the API call in the Google Cloud Storage in a predefined bucket. The file is replaced daily.
Using Cloud Scheduler a frequency is set to trigger the cloud function to run which then replace the current file in Cloud Storage. You can manually run the Cloud Scheduler from the Cloud Scheduler dashboard.
Google Dataprep by Trifacta
Google Dataprep by Trifacta was the key services to make the transition from old tables structure to a single table structure, and to migrate historical data from 3 different table structures to a single table structure and to take unstructured data and transform it into structured data and to store it in a dataset in BigQuery.
SQL vs Normalized
Just a word on the noon report data that Dataprep is going to turn into normalized data that are easier to use when creating reports. The noon report on the ships and shore server is running in a MySQL environment, that means several tables are used that index each other, the API call to the shore server only returns index data.
A table record could look like this, with indexes pointing to a different table that has the value.
Normalised data shows a record that can be understood, like this.
|Rainbow Warrior||At Sea||North sea||Amsterdam|
Dataprep by Trifacta
To achieve making the data structured an Extract Transform Load (ETL) tool or similar was required, there are a few choices in the Google Cloud Platform to achieve these transformations, I choose Cloud Dataprep by Trifacta to do the ETL transformation job, for the reason that Cloud Dataprep is a data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.
The great thing with Cloud Dataprep is the fact it’s serverless and works at any scale, there is no infrastructure to deploy or manage. To use Dataprep you do not need to write code. Dataprep is fairly simple to use, you can break it down into a few areas:
A flow is a collection of datasets, recipes, and other objects that you use to generate your results. The flow is owned by you, you can share them, you can also export and import flows. You can also schedule a flow to run at a specific time in a specific time zone.
The data library is a collection of datasets that you have imported, they could be a BigQuery dataset, a CSV or JSON file that is stored in Google Cloud Storage or you imported directly from your computer. In the Cloud Dataprep premium version, you have the ability to connect to more options e.g. Salesforce, MySQL server, Google Sheet to mention a few.
A recipe is a number of steps in which you transform your data, the recipe UI is very intuitive you get a good view of your data, highlighting data you get a suggestion on what step you want to take for that specific data column, you can create many steps to build up a recipe.
The run option is the step when you process your recipe and publish your now structured dataset to BigQuery or Cloud Storage. Dataprep uses Google Cloud Dataflow running your recipe, instead of building your Dataflow by creating the flow you can use Dataprep UI to achieve some great transformation when you run it.
In the next part of this how-to article, I will explain in a bit more detail on how to transform the Noonreport data, there were three steps that I had to go through. The first two steps are only repeated once, the last step is a continuious process that is repeated daily.
- Merging table structure
- Daily get new noon report data
Merging table structure
Noon report format has no standard as to what data is collected, each shipping company has there own. It could also be that each ship has its own table structure depending on the data collected from a ship, and no ship is alike so in our case, we ended up with a noon report from our ships that had a difference in its table structure.
The first task was to create a common table structure, we created a new table structure and deployed them to the MySQL databases on our ships and our shore server. When the new structure was defined, a data flow built with many recipes in Cloud Dataprep to merge not only the data from the table structure to the new table structure.
To merge the table structures in Dataprep, I imported the data from the old tables and the table structure each for each table structure I applied a recipe to do some cleanup of the data, then I created a recipe for merging all the tables together to one structure by using the union step in the recipe. The union step looks like this:
The output of this step was one table structure. To complete this step a Job was run to publish the data in a BigQuery table that was named merged_noonreport. Merging and standardizing the table structure for the noon report was only done once.
Daily get new noon report data
The Cloud Scheduler runs every 2 hours and triggers a Cloud Function that reads data from the shore server and stores the .json response file in a Cloud Storage bucket, the .json file is an unstructured dataset, and in Cloud Dataprep there is a recipe that turns the unstructured data into structured data.
The Dataprep schedule is set to run once a day, and it will run if it detects a new .json file in the Cloud Storage Bucket, the Dataprep will automatically run the job and publish the file to BigQuery using the dataset name raw_noonreport
Creating the structured normalized master log data, I created a Dataprep flow that uses two BigQuery tables as input datasets with the name merged_noonreport and raw_noonreport, that I had created in the previous two transformations flow in Dataprep.
The noon report transformation is the step to create the structure normalized ships_masterlog dataset, one of the main recipe steps used in this transformation was the lookup recipe step which I used to transform the unstructured data into structured data.
In Dataprep I created for each dataset a recipe for each dataset that I was going to use in the main recipe for using the lookup step, I did this to limit the number of columns that came from each dataset that was being used for the lookup step. I kept the dataset for the lookup step to two columns if I had used each datasets full column, the final recipe I would have to in the recipe delete all the lookup columns that were not required.
The lookup datasets were each original MySQL tables that were exported from the Emily MySql shore server database as CSV files and stored in Cloud Storage before being imported to Cloud Dataprep.
P.S. In the Dataprep Premium version, you could import your MySQL tables to Google Sheet, and use the Google Sheet as the lookup table in your Dataprep flow.
BigQuery is the final destination of the transformed data, it is the data warehouse for storing datasets that are used for the transformations with Dataprep. BigQuery also acts as a datamart for structured normalized data and with noon report, the structured normalized dataset is named ships_masterlog.
With the structured normalized dataset ships_masterlog, reporting with Data Studio or import from the dataset to Google Sheets is obtainable, the structured normalized data can be used also in BigQuery to create a machine model of some data elements in the dataset to predict for instance fuel consumption under certain weather condition. That will be something to add to the list for the future.
Below shows just the datasets that were created in BigQuery. The raw_noonreport and merged_noonreport are unstructured data that are used by Dataprep for the structuring normalization of the data.
The datasets are not that large, even running it daily many of the services that are used are covered by the free tier, there is a cost for Dataprep that uses Dataflow to process data into BigQuery and there is a small cost for Cloud Storage storing the files and BigQuery, the total cost is less than 50 Euro a month.
That was the step that was taken to create a maritime noon report table in BigQuery using the Dataprep tool. I hope that you learned from this and can apply some of these solutions for your Data Project.
If you want to be able to control your Google Cloud Platform cost, this is a great article to read –
Disable Billing for Google Cloud Project when billing exceeds the budget limit
Want to get in contact, the comment box below is the best way to interact with me.