https://torbjornzetterlund.com/wp-content/uploads/2017/12/null-2-960x600_c.jpeg

Analyze and visualize BigQuery dataset with Google Cloud Datalab


In this article, I will provide you with the basic skills on how to use Google Cloud Datalab to analyse a basic BigQuery dataset. I’m going to cover how you can bring into Cloud Datalab BigQuery datasets, how to visualize that data within the same notebook.

Have you used IPython or Jupyter notebooks before?

Then you are familiar with what a notebook is, for the newcomer to data analysis with notebook, data scientists are working these collaborative, self-describing, executable notebooks whenever they want to do some kind of data analysis or machine learning tasks.

Google Cloud Datalab has done a nice integration of the IPython Jupyter notebook system with Google’s BigQuery. Google has integrated standard IPython libraries such as graphics and scikit-learn and Google’s own machine learning toolkit TensorFlow.

To use it you will need a Google cloud account, you can get started with the free account, that will be sufficient if you are interested in just trying it out.

You may ask, why do I need a Google account when I can use Jupyter, IPython and TensorFlow on my own resources?

The reason is that you can easily access BigQuery sized data collections directly from the notebook running on your laptop. To get started you can read about Google Cloud Datalab, It will tell you how to install giving you two choices: you may either install the Datalab package locally on your machine or you may install it on a VM in the Google cloud.

If you continue reading, I will explain how to install Google Cloud Datalab in the Google cloud and give you an example of how to create your first notebook.

We all love Open Source, do we not! Google Cloud Datalab is based on Jupyter, and it’s open source.

Why Google Cloud Datalab

You could use other tools to do your data analysis, with Google Cloud Datalab you can do things like run an experiment, run a query, look at the output, update your documentation, add links, and then run more experiments and share those results and collaborate with others as well.

Google Cloud Datalab is more interactive and dynamic than just doing something like writing a SQL query. Collaborating with other data scientist with notebooks is simple and straightforward.

Lets get Started

Lets get started an jump straight in, we are going to start Cloud Datalab, the prerequisite is that you have a Google Cloud Platform (GCP) account and that you have created a project.

Goto your GCP home, and click on the Cloud Shell.

You have to click in the top right hand corner of your Google Cloud Platform account, the terminal window is opening at the bottom of your screen, when you get the prompt, all you gotta do is type in:

datalab create cloud-datalab-vm

Give your instance a name of your choice.

cloud-datalab-vm

Specify a machine type, that you want to use. You can look here for details about machine types

--machine-type n1-highmem-8

And specify what zone that your instance should be running in as well. You can look here for details on Regions and Zones.

--zone us-central1-a

You could of course give the full information to create by:

datalab create cloud-datalab-vm --machine-type n1-highmem-8 --zone us-central1-a

Next you machine will be started – just follow the instructions prompted to you.

This will create an SSH tunnel and may prompt you to create an rsa key pair. To manage these keys, see https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys
Waiting for Datalab to be reachable at http://localhost:8081/
This tool needs to create the directory [/home/tzetter/.ssh] before
being able to generate SSH keys.

Next you want to click the preview button

One common error that I have seen is that the – Cloud Source Repositories API is not enabled, if you get that error goto APIs & Services in GCP and enable the API.

Enable Cloud Source API

Enable Cloud Source API

Another type of error is that when you click preview, that you can not connect to the port, you need the to change the port – you find under the preview button the option to change port. Your console message will give you the hint, which port it will start the Google Cloud Datalab on.

change_port

change_port

If you are successful when clicking preview, you will see the Google Cloud Datalab notebook started, Voila now you can get started with creating your analysis notebooks.

Getting started with your first Cloud Datalab notebooks

Now that you have the Google Cloud Datalab running, the first step would be to take a look at the samples – and make yourself familiar with the samples notebooks.

A simple Example

If you continue reading, I will show you how to create your first notebook with this simple example of a notebook that I created for you, using a public available dataset from BigQuery, called Bay Area Bike Share Trips Data.

I started with a blank notebook by clicking on +Notebook on the menu bar, then I gave it a name. A good practice is to start your blank notebook with adding a markdown box and define what your notebook is about, and what the outcome of your analysis are.

Next I added a code box, I declared that I wanted to use BigQuery, you can skip this box if you run in the cloud, if you run your notebook locally on your laptop, then declare the use of BigQuery.

My next code box, I added a statement for describing the database table that I’m going to use, this is helpful in that you get the details of your table, and will be helpful in defining your query.

It’s good practise to add a markdown box, between the code boxes so you can explain what you are doing, this will help others that you sharing your notebook with to understand what is going on.

In my example, the next code box I have defined a sql statement, you can write your own new sql statement, or if you already have one that you created and saved in BigQuery, you can copy and paste that sql statement from BigQuery to your notebook.

What you have to add is a line prior to your sql query, this is your namespace, in this example I call it “cyclesharing”, that is the namespace in which the result of the sql query is stored.

To see the result visualised on your notebook, then you can use the command

%%bg execute -q cyclesharing

This is handy so you see what the result of your query is, if you going directly from a query command to a visualization, it is not a required step.

The namespace “cyclesharing” exist in the background and you can use it to create visualizations, here are a simple column chart to visualizing the namespace “cyclesharing” data.

You will notice that the output here isn’t just command-line output, you can output charts, graphs there are options available to you.

This is the end of me showing how to create your first notebook, you can copy this basic notebook from GitHub.

Save Notebook source code

You don’t want to lose your work, you can download your notebook to your local machine or you can commit your changes to a version control system, you can do that with a code repository using Git. Google Cloud Datalab stores the code to the Google Cloud Platform code repository.

Shutdown your Google Cloud Datalab notebook

You can shutdown your notebook that you created, by selecting from the menu bar notebooks, then click on the button shutdown for the notebook you want to stop. No need to have your notebook active, if you not using it.

Shutdown your Google Cloud Datalab notebook VM

Don’t forget to shutdown your notebook VM, you can go to Compute Engine within GCP and view what instances are running.

The Google Cloud Datalab notebooks themselves run on Compute Engine. And you can think of that like a rented infrastructure. You should not keep the Compute Engine instance running all the time, as you will be paying for those compute cycles. You can simple stop the Google CLoud Datalab VM instance, when you are not using it and fire it back up when you need to do some work.

You can shutdown your Google Cloud Datalab VM from the Compute Engine view, or you could shutdown your instance from within the Google Cloud Datalab.

When the Compute Engine instance with your Google Cloud Datalab notebook goes away, what do you think happens to the notebook?

Well, it disappears, so make sure you made a copy or saved the notebook, before you shutdown your Compute Engine VM instance.

Resources

Here are some resources that are useful:

Conclusion

Data is the new gold mine for organizations, being able to get trends in historical data that then are use to make decision, adding Machine Learning to the mix – you can use your historical data to predict the future. Who does not want to predict the future.

If you interested in following me, you can download my mobile app for Android

The advantages for you, using my mobile app is that you get notified when a new article is published or an old article is updated.

You can also follow my Flipboard Magazine all about technology from an old school technologist

What’s your view on, I like to hear from you, you could comment in the comment box below or contact me through the contact form.

Menu