My "Sun Planets" Bundle

My "Sun CAPS and OpenESB Blogs" Bundle

My "MyBlogs" Bundle

My Blog List

Sunday, November 17, 2013

Interactive Analysis and Big Data: Setting up Google Dremel a.k.a. Big Query API for our Big Data Exploits

Before you begin

Activate Google BigQuery API

Go to the Google API Console at https://developers.google.com/console and enable the BigQuery API. If you have an existing project with the BigQuery API enabled you can use it, or else you can create a new project and enable it.

Exercise Google Dremel by executing some Queries

Let us run some queries on public data exposed by Goole just to get a feel for Big Query using Google Dremel.

Open the Big Query Browser tool at https://bigquery.cloud.google.com/

Google has exposed several public tables that you can run queries against to exercise Big Query API. A full desciption of the same is 

A complete description of Big Query's SQL syntax can be had from https://developers.google.com/bigquery/query-reference

In the query text, tables should be qualified in the format datasetId.tableId

Run a few queries

To run a query, click the Compose Query button, then enter the query SQL in the textbox at the top of the page and click Run Query. The results (or an error code) will be displayed below the query box.

Heaviest 10 children

SELECT weight_pounds, state, year, gestation_weeks FROM publicdata:samples.natality
ORDER BY weight_pounds DESC LIMIT 10;

Finding Juliet

SELECT word FROM publicdata:samples.shakespeare WHERE word="Juliet";

How many works of Shakespeare are there? 

SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;


Which are the longest works?

SELECT corpus, sum(word_count) AS wordcount FROM publicdata:samples.shakespeare GROUP BY corpus ORDER BY wordcount DESC;

References




Wednesday, November 6, 2013

Stream Processing and BigData - Setting up Storm for Stream Processing of BigData

Pre-requisites - Before you start

At a minimum you would need:

  1. You need to setup the Virtualized Sandbox for your BigData exploits as detailed in my blog post here webcornucopia.blogspot.com/2013/11/a-virtualized-sandbox-for-my-bigdata.html
  2. All through this write-up we're going to assume that a user with id hduser has been setup on your Ubuntu machine. If you're using a different user name, please make the appropriate modifications.

Downloads Required

1. Download Storm

Download the latest version of Storm from here: 
http://storm-project.net/downloads.html
As of this writing, the stable version of Storm is 0.8.2.

2. Install the following packages

You will need to install the following first make, pkg-config, libtool, automake, g++, uuid-dev, maven, and git. Here's how to do it.
sudo apt-get install make
sudo apt-get install pkg-config
sudo apt-get install libtool
sudo apt-get install automake
sudo apt-get install g++
sudo apt-get install uuid-dev
sudo apt-get install maven
sudo apt-get install git

3. Install the latest version of leinengin. 

Leinengen is used for automating builds of certain Storm samples, some of which are written in Clojure. Leinengen is still the best choice for automating clojure projects. As mentioned here,  https://github.com/technomancy/leiningen/wiki/Packaging, many package managers still include version 1.x, which is rather outdated, so you may be better off installing manually as explained below. If you do apt-get you will end up getting an older outdated version. The way to get the latest version is to do the following:
3.1. Copy the contents of this shell script into a file called lein
https://raw.github.com/technomancy/leiningen/stable/bin/lein
3.2. Move the lein file to /usr/local/bin
3.3. Give all permissions to the lein file:
sudo chmod 777 /usr/local/bin/lein
3.4. Upgrade to the latest version of leinengen which is now at 2.3.3 as of this writing:
/usr/local/bin/lein upgrade

Building Storm

1. Expand the storm-0.8.2.zip into a folder of your choice:
unzip storm-0.8.2.zip 
2. From the folder you exploded the Storm files, you need to Install ZeroMQ by executing:
bin/install_zmq.sh

Note: Issues you could potentially encounter and how to get around them while executing ZeroMQ install

Error Type #1
If you encounter this error while installing ZeroMQ:
Problem with the SSL CA cert (path? access rights?) while accessing...
You might want to ignore SSL verification as follows:
git config --global http.sslVerify false
And then re-run:
bin/install_zmq.sh
Error Type #2 - Happens mostly on Ubuntu and Mac
If you encounter this error while installing ZeroMQ:
make[1]: *** No rule to make target `classdist_noinst.stamp', needed by
`org/zeromq/ZMQ.class'.  Stop.
This is a known bug discovered by ebroder. Please See https://github.com/zeromq/jzmq/issues/114.
2.1. To fix it, edit the Makefile.am file at jzmq/src:
vi jzmq/src/Makefile.am 
2.2. Replace classdist_noinst.stamp by classnoinst.stamp
2.3. Re-run ZeroMQ install:
bin/install_zmq.sh

Setting up Storm Environment Variables

Once the ZeroMQ install is done, you will want to modify your startup script like .bashrc or any other startup scripts that you may have to set up Storm environment variables so you dont have to type them in again and again.
vi $HOME/.bashrc
At the end of the file, add the following lines making sure you specify the correct path. I have my storm files in $HOME/runtimes/storm-0.8.2 and therefore, my startup script has the following:
export STORM_PATH=$HOME/runtimes/storm-0.8.2
export PATH=$PATH:$STORM_PATH/bin

Getting and Building Storm Samples

The Storm Starter kit at https://github.com/nathanmarz/storm-starter contains some examples to play with. It can be obtained through Git as follows:
git clone http://github.com/nathanmarz/storm-starter

Building the Storm Starter Samples

Build the samples using:
/usr/local/bin/lein deps
/usr/local/bin/lein compile
/usr/local/bin/lein jar
The output on the screen at the end of executing the previous command is:
Compiling storm.starter.clj.word-count
hduser@ubuntu:~/runtimes/storm-starter$ /usr/local/bin/lein jar
Retrieving org/clojure/clojure/1.5.1/clojure-1.5.1.pom from central
Retrieving org/clojure/clojure/1.5.1/clojure-1.5.1.jar from central
Compiling 2 source files to /home/hduser/runtimes/storm-starter/target/classes
Created /home/hduser/runtimes/storm-starter/target/storm-starter-0.0.1-SNAPSHOT.jar
hduser@ubuntu:~/runtimes/storm-starter$

Run Storm Samples

You can refer to this URL for reference:
https://github.com/nathanmarz/storm-starter

1. Executing the ExclamationTopology Sample written in Java:

java -cp $STORM_PATH/lib/*:$STORM_PATH/storm-0.8.2.jar:\
$HOME/runtimes/storm-starter/target/storm-starter-0.0.1-SNAPSHOT.jar \
storm.starter.ExclamationTopology

2. Executing the Word Count Sample written in Clojure:

/usr/local/bin/lein run -m storm.starter.clj.word-count

Conclusion

We have successfully installed Storm and run the samples. We are now ready to perform further explorations in Stream Processing of BigData.

Tuesday, November 5, 2013

A Virtualized Sandbox for BigData Exploits

Pre-requisites - Before you start

At a minimum you would need:

  1. I used Ubuntu Server 13.10 (Saucy Salamander) 64-Bit running on VMPlayer on a Windows 7 Ultimate 64-bit OS for this write-up. You could use any Linux, Mac, or Windows Operating System as long as it is 64-Bit
  2. Atleast 4 GB of RAM on your box - The least I've tested it is with a Dell Precision M6400 Intel  Core 2 Extreme Edition QX 9300 Processor clocking at 2.53 GHz 1066 MHz, 12MB L2 Cache with 4 GB of RAM
  3. If you are using VMWare Player like me on a machine with an Intel Virtualization Technology enabled processor, you need to enable VT from the BIOS. Follow the steps listed here at:  http://www.ehow.com/how_7562877_enable-vt-dell-vmware.html
  4. All through this write-up we're going to assume that a user with id hduser has been setup on your Ubuntu machine. If you're using a different user name, please make the appropriate modifications.

Downloads Required

  1. VMWare Player which allows you to run virtual machines with different operating systems. Search for VM Player  https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0 at their  downloads page at https://my.vmware.com/web/vmware/downloads
  2. Ubuntu 13.10 Server - A 64-bit Linux OS (Saucy Salamander) downloaded from http://www.releases.ubuntu.com/13.10/. Since I used a 64-bit Intel machine, I used the 64-bit PC (AMD64) server install image at http://www.releases.ubuntu.com/13.10/ubuntu-13.10-server-amd64.iso  

Preparing our Virtualized Sandbox

01. Install VMWare Player
02. Create a new virtual machine
03. Point the installer disc image to the ISO file (Ubuntu) that you just downloaded
04. User name should be hduser
05. Hard disk space 40 GB Hard drive (more is better)
06. Customize hardware:
  • Memory: 2 GB RAM (more is better)
  • Processors: 2 (more is better)
07. Launch your virtual machine (all the instructions after this step will be performed in Ubuntu)
08. Login as hduser
09. After installing Ubuntu Server you'd be greeted with a command prompt. I like a minimal graphical user interface since I'd like to run Firefox to monitor the progress of my Jobs submitted to Hadoop. Moreover, I like to cut and paste commands on the Terminal Shell. You could optionally install a Gnome, KDE, XFCE, or any other desktop of your choice. I prefer the XFCE desktop since its a lightweight desktop environment. It takes up less system resources than either Gnome or KDE. It installs less packages but does not have the same level of graphics as the other two desktops. So here's how I do it. On the Terminal :
sudo apt-get install --no-install-recommends xubuntu-desktop
If you leave out the “no-install-recommends” option, Ubuntu installs software like games and other unwanted crap-ware which I do not want to be burdened with on a server. For Gnome use ubuntu-desktop and for a KDE desktop use kubuntu-desktop instead of the XFCE desktop which is a xubuntu-desktop. Once its installed the desktop files, you need to reboot the system with:
sudo reboot
10. Once you have logged back in as hduser, lauch a new Terminal window and install Firefox so you could later monitor the progress of your Hadoop jobs. 
sudo apt-get install firefox
11. Install JDK 7 using:
sudo apt-get install openjdk-7-jdk
12 Install ssh and rsync
sudo apt-get install ssh
sudo apt-get install rsync
13.Install the SSH Client
sudo apt-get install openssh-client
14.Install the SSH Server
sudo apt-get install openssh-server
15.Configure and run the SSH Server with the following commands:
su - hduser
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost
16. Disable IPv6 by opening the /etc/sysctl.conf file with:
sudo vi /etc/sysctl.conf
And adding the following lines at the end of the file:
##################
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
##################

Conclusion

 We've setup a virtualized Sanbox environment for our future BigData exploits.