I was recently working on a data project that required us to request data from an internal application using a RESTful API and transform the data to be stored in the Google data warehouse solution BigQuery. As I had some time to experiment, I decided to go down the path of using a Google Cloud Function with HTTP trigger and schedule the work with Google Cloud Scheduler as the requirement stipulated that the updates had to be daily.
While working on the Cloud Function I learned I could create the function in many different ways and also using other Google Service in conjunction with the Cloud Function, and I also learned that to take advantage of built-in features of the Google BigQuery coding library the JSON has to be a newline delimited JSON.
What is JSON
JSON is built on two structures: A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. The array of data looks like this, a string that represents the name/value pairs.
Newline delimited JSON
Newline delimited JSON is a more structured string of data that has been delimited by a newline, it makes it more readable also when using tools an code libraries it is easier to extracts the data.
The issue that I run into was that the JSON data that I received from requesting the data from the Application with the API, was just a long string of data, it was not newline delimited.
I could have a loop through the response data and built up a table structure in my Cloud Function, which is doable if the are changes made to the API in the future, I would have to go into the Cloud Function code and make changes. From a developer perspective that would have been fun to code, I also had other tools in the Google Cloud Platform to try out without coding.
I went another route by creating a Cloud Function that read the Application API and stored the.JSON response as a file in a Google Cloud Storage Bucket (Google term for folder). And I then used Google Cloud Dataprep to transform the.JSON file before publishing the data through Cloud Dataprep into the Google Data Warehouse Solution BigQuery.
When importing data from Cloud Storage – Cloud Dataprep try to automatically detect the data structure. With the newline delimiter JSON data, Cloud Dataprep detects the structure as shown below.
With non newline delimiter JSON data, Cloud Dataprep when detecting the data shows up like this.
If you look at the previous image, at the bottom menu bar you have an option to Detect or not the structure, by unchecking the detect structure. My non newline delimiter JSON data showed up like this.
Which looks much better, I had a scenario when the data was not showing at all when I had the Detect structure checked, so I had to uncheck the option and my data showed up.
That is all, I wanted with this article to show Cloud Dataprep Detect Structure option. Thank you for reading, comments are always welcome below.