In this article, I will provide you with the basic skills on how to use Google Cloud Datalab to analyze a basic BigQuery dataset. I’m going to cover how you can bring into Cloud Datalab BigQuery datasets, how to visualize that data within the same notebook.
Have you used IPython or Jupyter notebooks before?
Then you are familiar with what a notebook is, for the newcomer to data analysis with a notebook, data scientists are working these collaborative, self-describing, executable notebooks whenever they want to do some kind of data analysis or machine learning tasks.
Google Cloud Datalab has done a nice integration of the IPython Jupyter notebook system with Google’s BigQuery. Google has integrated standard IPython libraries such as graphics and scikit-learn and Google’s own machine learning toolkit TensorFlow.
To use it you will need a Google cloud account, you can get started with the free account, that will be sufficient if you are interested in just trying it out.
You may ask, why do I need a Google account when I can use Jupyter, IPython, and TensorFlow on my own resources?
The reason is that you can easily access BigQuery sized data collections directly from the notebook running on your laptop. To get started you can read about Google Cloud Datalab, It will tell you how to install giving you two choices: you may either install the Datalab package locally on your machine or you may install it on a VM in the Google cloud.
If you continue reading, I will explain how to install Google Cloud Datalab in the Google cloud and give you an example of how to create your first notebook.
We all love Open Source, do we not! Google Cloud Datalab is based on Jupyter, and it’s open source.
Why Google Cloud Datalab
You could use other tools to do your data analysis, with Google Cloud Datalab you can do things like run an experiment, run a query, look at the output, update your documentation, add links, and then run more experiments and share those results and collaborate with others as well.
Google Cloud Datalab is more interactive and dynamic than just doing something like writing a SQL query. Collaborating with other data scientist with notebooks is simple and straightforward.
Let’s get Started
Let’s get started a jump straight in, we are going to start Cloud Datalab, the prerequisite is that you have a Google Cloud Platform (GCP) account and that you have created a project.
Goto your GCP home, and click on the Cloud Shell.
You have to click in the top right hand corner of your Google Cloud Platform account, the terminal window is opening at the bottom of your screen, when you get the prompt, all you gotta do is type in:
datalab create cloud-datalab-vm
Give your instance a name of your choice.
Specify a machine type, that you want to use. You can look here for details about machine types
And specify what zone that your instance should be running in as well. You can look here for details on Regions and Zones.
You could, of course, give the full information to create by:
datalab create cloud-datalab-vm --machine-type n1-highmem-8 --zone us-central1-a
Next, your machine will be started – just follow the instructions prompted to you.
This will create an SSH tunnel and may prompt you to create an rsa key pair. To manage these keys, see https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys Waiting for Datalab to be reachable at http://localhost:8081/ This tool needs to create the directory [/home/tzetter/.ssh] before being able to generate SSH keys.
Next, you want to click the preview button
One common error that I have seen is that the – Cloud Source Repositories API is not enabled if you get that error goto APIs & Services in GCP and enable the API.
Another type of error is that when you click preview, that you can not connect to the port, you need them to change the port – you find under the preview button the option to change the port. Your console message will give you the hint, which port it will start the Google Cloud Datalab on.
If you are successful when clicking preview, you will see the Google Cloud Datalab notebook started, Voila now you can get started with creating your analysis notebooks.
Getting started with your first Cloud Datalab notebooks
Now that you have the Google Cloud Datalab running, the first step would be to take a look at the samples – and make yourself familiar with the samples notebooks.
A Simple Example
If you continue reading, I will show you how to create your first notebook with this simple example of a notebook that I created for you, using a publicly available dataset from BigQuery, called Bay Area Bike Share Trips Data.
I started with a blank notebook by clicking on +Notebook on the menu bar, then I gave it a name. A good practice is to start your blank notebook with adding a markdown box and define what your notebook is about, and what the outcome of your analysis is.
Next, I added a code box, I declared that I wanted to use BigQuery, you can skip this box if you run in the cloud, if you run your notebook locally on your laptop, then declare the use of BigQuery.
My next code box, I added a statement for describing the database table that I’m going to use, this is helpful in that you get the details of your table, and will be helpful in defining your query.
It’s good practice to add a markdown box, between the code boxes so you can explain what you are doing, this will help others that you sharing your notebook with to understand what is going on.
In my example, the next code box I have defined a SQL statement, you can write your own new SQL statement, or if you already have one that you created and saved in BigQuery, you can copy and paste that SQL statement from BigQuery to your notebook.
What you have to add is a line prior to your SQL query, this is your namespace, in this example, I call it “cyclesharing”, that is the namespace in which the result of the SQL query is stored.
To see the result visualized on your notebook, then you can use the command
%%bg execute -q cyclesharing
This is handy so you see what the result of your query is, if you going directly from a query command to visualization, it is not a required step.
The namespace “cyclesharing” exist in the background and you can use it to create visualizations, here are a simple column chart to visualizing the namespace “cyclesharing” data.
You will notice that the output here isn’t just command-line output, you can output charts, graphs there are options available to you.
This is the end of me showing how to create your first notebook, you can copy this basic notebook from GitHub.
Save Notebook source code
You don’t want to lose your work, you can download your notebook to your local machine or you can commit your changes to a version control system, you can do that with a code repository using Git. Google Cloud Datalab stores the code to the Google Cloud Platform code repository.
Shutdown your Google Cloud Datalab notebook
You can shut down the notebook that you created, by selecting from the menu bar notebooks, then click on the button shutdown for the notebook you want to stop. No need to have your notebook active, if you not using it.
Shutdown your Google Cloud Datalab notebook VM
Don’t forget to shut down your notebook VM, you can go to Compute Engine within GCP and view what instances are running.
The Google Cloud Datalab notebooks themselves run on Compute Engine. And you can think of that like a rented infrastructure. You should not keep the Compute Engine instance running all the time, as you will be paying for those compute cycles. You can simply stop the Google CLoud Datalab VM instance, when you are not using it and fire it back up when you need to do some work.
You can shut down your Google Cloud Datalab VM from the Compute Engine view, or you could shut down your instance from within the Google Cloud Datalab.
When the Compute Engine instance with your Google Cloud Datalab notebook goes away, what do you think happens to the notebook?
Well, it disappears, so make sure you made a copy or saved the notebook before you shut down your Compute Engine VM instance.
Here are some resources that are useful:
- Data set for Bay Area Bike Share Trips Data
- How to work with Google Cloud Datalab
- A gallery of interesting Jupyter Notebooks
Data is the new gold mine for organizations, being able to get trends in historical data that they are used to make a decision, adding Machine Learning to the mix – you can use your historical data to predict the future. Who does not want to predict the future?
What’s your view on, I like to hear from you, you could comment in the comment box below or contact me through the contact form.