Web Cornucopia: Hadoop

Before you begin

Activate Google BigQuery API

Go to the Google API Console at https://developers.google.com/console and enable the BigQuery API. If you have an existing project with the BigQuery API enabled you can use it, or else you can create a new project and enable it.

Exercise Google Dremel by executing some Queries

Let us run some queries on public data exposed by Goole just to get a feel for Big Query using Google Dremel.

Open the Big Query Browser tool at https://bigquery.cloud.google.com/

Google has exposed several public tables that you can run queries against to exercise Big Query API. A full desciption of the same is

listed here https://developers.google.com/bigquery/docs/sample-tables.

A complete description of Big Query's SQL syntax can be had from https://developers.google.com/bigquery/query-reference

In the query text, tables should be qualified in the format datasetId.tableId.

Run a few queries

To run a query, click the Compose Query button, then enter the query SQL in the textbox at the top of the page and click Run Query. The results (or an error code) will be displayed below the query box.

Heaviest 10 children

SELECT weight_pounds, state, year, gestation_weeks FROM publicdata:samples.natality
ORDER BY weight_pounds DESC LIMIT 10;

Finding Juliet

SELECT word FROM publicdata:samples.shakespeare WHERE word="Juliet";

How many works of Shakespeare are there?

SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;

Which are the longest works?

SELECT corpus, sum(word_count) AS wordcount FROM publicdata:samples.shakespeare GROUP BY corpus ORDER BY wordcount DESC;

References

https://developers.google.com/bigquery/browser-tool-quickstart

Pre-requisites - Before you start

At a minimum you would need:

I used Ubuntu Server 13.10 (Saucy Salamander) 64-Bit running on VMPlayer on a Windows 7 Ultimate 64-bit OS for this write-up. You could use any Linux, Mac, or Windows Operating System as long as it is 64-Bit
Atleast 4 GB of RAM on your box - The least I've tested it is with a Dell Precision M6400 Intel Core 2 Extreme Edition QX 9300 Processor clocking at 2.53 GHz 1066 MHz, 12MB L2 Cache with 4 GB of RAM
If you are using VMWare Player like me on a machine with an Intel Virtualization Technology enabled processor, you need to enable VT from the BIOS. Follow the steps listed here at: http://www.ehow.com/how_7562877_enable-vt-dell-vmware.html
All through this write-up we're going to assume that a user with id hduser has been setup on your Ubuntu machine. If you're using a different user name, please make the appropriate modifications.

Downloads Required

VMWare Player which allows you to run virtual machines with different operating systems. Search for VM Player https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0 at their downloads page at https://my.vmware.com/web/vmware/downloads
Ubuntu 13.10 Server - A 64-bit Linux OS (Saucy Salamander) downloaded from http://www.releases.ubuntu.com/13.10/. Since I used a 64-bit Intel machine, I used the 64-bit PC (AMD64) server install image at http://www.releases.ubuntu.com/13.10/ubuntu-13.10-server-amd64.iso

Preparing our Virtualized Sandbox

01. Install VMWare Player

02. Create a new virtual machine

03. Point the installer disc image to the ISO file (Ubuntu) that you just downloaded

04. User name should be hduser

05. Hard disk space 40 GB Hard drive (more is better)

06. Customize hardware:

Memory: 2 GB RAM (more is better)
Processors: 2 (more is better)

07. Launch your virtual machine (all the instructions after this step will be performed in Ubuntu)

08. Login as hduser

09. After installing Ubuntu Server you'd be greeted with a command prompt. I like a minimal graphical user interface since I'd like to run Firefox to monitor the progress of my Jobs submitted to Hadoop. Moreover, I like to cut and paste commands on the Terminal Shell. You could optionally install a Gnome, KDE, XFCE, or any other desktop of your choice. I prefer the XFCE desktop since its a lightweight desktop environment. It takes up less system resources than either Gnome or KDE. It installs less packages but does not have the same level of graphics as the other two desktops. So here's how I do it. On the Terminal :

sudo apt-get install --no-install-recommends xubuntu-desktop

If you leave out the “no-install-recommends” option, Ubuntu installs software like games and other unwanted crap-ware which I do not want to be burdened with on a server. For Gnome use ubuntu-desktop and for a KDE desktop use kubuntu-desktop instead of the XFCE desktop which is a xubuntu-desktop. Once its installed the desktop files, you need to reboot the system with:

sudo reboot

10. Once you have logged back in as hduser, lauch a new Terminal window and install Firefox so you could later monitor the progress of your Hadoop jobs.

sudo apt-get install firefox

11. Install JDK 7 using:

sudo apt-get install openjdk-7-jdk

12 Install ssh and rsync

sudo apt-get install ssh
sudo apt-get install rsync

13.Install the SSH Client

sudo apt-get install openssh-client

14.Install the SSH Server

sudo apt-get install openssh-server

15.Configure and run the SSH Server with the following commands:

su - hduser
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost

16. Disable IPv6 by opening the /etc/sysctl.conf file with:

sudo vi /etc/sysctl.conf

And adding the following lines at the end of the file:

##################
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
##################

Conclusion

We've setup a virtualized Sanbox environment for our future BigData exploits.

Web Cornucopia

My "Sun Planets" Bundle

My "Sun CAPS and OpenESB Blogs" Bundle

My "MyBlogs" Bundle

My Blog List

Sunday, November 17, 2013

Interactive Analysis and Big Data: Setting up Google Dremel a.k.a. Big Query API for our Big Data Exploits

Before you begin

Activate Google BigQuery API

Exercise Google Dremel by executing some Queries

Run a few queries

Heaviest 10 children

Finding Juliet

How many works of Shakespeare are there?

Which are the longest works?

References

Tuesday, November 5, 2013

A Virtualized Sandbox for BigData Exploits

Pre-requisites - Before you start

Downloads Required

Preparing our Virtualized Sandbox

Conclusion

About Me

Facebook Badge

Search This Blog

Labels

Gopalan Suresh's shared items

Blog Archive

Followers

Web Cornucopia

My "Sun Planets" Bundle

My "Sun CAPS and OpenESB Blogs" Bundle

My "MyBlogs" Bundle

My Blog List

Sunday, November 17, 2013

Interactive Analysis and Big Data: Setting up Google Dremel a.k.a. Big Query API for our Big Data Exploits

Before you begin

Activate Google BigQuery API

Exercise Google Dremel by executing some Queries

Run a few queries

Heaviest 10 children

Finding Juliet

How many works of Shakespeare are there?

Which are the longest works?

References

Tuesday, November 5, 2013

A Virtualized Sandbox for BigData Exploits

Pre-requisites - Before you start

Downloads Required

Preparing our Virtualized Sandbox

Conclusion

About Me

Facebook Badge

Search This Blog

Subscribe To

Labels

Gopalan Suresh's shared items

Blog Archive

Followers