My "Sun Planets" Bundle

My "Sun CAPS and OpenESB Blogs" Bundle

My "MyBlogs" Bundle

My Blog List

Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Sunday, November 17, 2013

Interactive Analysis and Big Data: Setting up Google Dremel a.k.a. Big Query API for our Big Data Exploits

Before you begin

Activate Google BigQuery API

Go to the Google API Console at https://developers.google.com/console and enable the BigQuery API. If you have an existing project with the BigQuery API enabled you can use it, or else you can create a new project and enable it.

Exercise Google Dremel by executing some Queries

Let us run some queries on public data exposed by Goole just to get a feel for Big Query using Google Dremel.

Open the Big Query Browser tool at https://bigquery.cloud.google.com/

Google has exposed several public tables that you can run queries against to exercise Big Query API. A full desciption of the same is 

A complete description of Big Query's SQL syntax can be had from https://developers.google.com/bigquery/query-reference

In the query text, tables should be qualified in the format datasetId.tableId

Run a few queries

To run a query, click the Compose Query button, then enter the query SQL in the textbox at the top of the page and click Run Query. The results (or an error code) will be displayed below the query box.

Heaviest 10 children

SELECT weight_pounds, state, year, gestation_weeks FROM publicdata:samples.natality
ORDER BY weight_pounds DESC LIMIT 10;

Finding Juliet

SELECT word FROM publicdata:samples.shakespeare WHERE word="Juliet";

How many works of Shakespeare are there? 

SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;


Which are the longest works?

SELECT corpus, sum(word_count) AS wordcount FROM publicdata:samples.shakespeare GROUP BY corpus ORDER BY wordcount DESC;

References




Tuesday, November 5, 2013

A Virtualized Sandbox for BigData Exploits

Pre-requisites - Before you start

At a minimum you would need:

  1. I used Ubuntu Server 13.10 (Saucy Salamander) 64-Bit running on VMPlayer on a Windows 7 Ultimate 64-bit OS for this write-up. You could use any Linux, Mac, or Windows Operating System as long as it is 64-Bit
  2. Atleast 4 GB of RAM on your box - The least I've tested it is with a Dell Precision M6400 Intel  Core 2 Extreme Edition QX 9300 Processor clocking at 2.53 GHz 1066 MHz, 12MB L2 Cache with 4 GB of RAM
  3. If you are using VMWare Player like me on a machine with an Intel Virtualization Technology enabled processor, you need to enable VT from the BIOS. Follow the steps listed here at:  http://www.ehow.com/how_7562877_enable-vt-dell-vmware.html
  4. All through this write-up we're going to assume that a user with id hduser has been setup on your Ubuntu machine. If you're using a different user name, please make the appropriate modifications.

Downloads Required

  1. VMWare Player which allows you to run virtual machines with different operating systems. Search for VM Player  https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0 at their  downloads page at https://my.vmware.com/web/vmware/downloads
  2. Ubuntu 13.10 Server - A 64-bit Linux OS (Saucy Salamander) downloaded from http://www.releases.ubuntu.com/13.10/. Since I used a 64-bit Intel machine, I used the 64-bit PC (AMD64) server install image at http://www.releases.ubuntu.com/13.10/ubuntu-13.10-server-amd64.iso  

Preparing our Virtualized Sandbox

01. Install VMWare Player
02. Create a new virtual machine
03. Point the installer disc image to the ISO file (Ubuntu) that you just downloaded
04. User name should be hduser
05. Hard disk space 40 GB Hard drive (more is better)
06. Customize hardware:
  • Memory: 2 GB RAM (more is better)
  • Processors: 2 (more is better)
07. Launch your virtual machine (all the instructions after this step will be performed in Ubuntu)
08. Login as hduser
09. After installing Ubuntu Server you'd be greeted with a command prompt. I like a minimal graphical user interface since I'd like to run Firefox to monitor the progress of my Jobs submitted to Hadoop. Moreover, I like to cut and paste commands on the Terminal Shell. You could optionally install a Gnome, KDE, XFCE, or any other desktop of your choice. I prefer the XFCE desktop since its a lightweight desktop environment. It takes up less system resources than either Gnome or KDE. It installs less packages but does not have the same level of graphics as the other two desktops. So here's how I do it. On the Terminal :
sudo apt-get install --no-install-recommends xubuntu-desktop
If you leave out the “no-install-recommends” option, Ubuntu installs software like games and other unwanted crap-ware which I do not want to be burdened with on a server. For Gnome use ubuntu-desktop and for a KDE desktop use kubuntu-desktop instead of the XFCE desktop which is a xubuntu-desktop. Once its installed the desktop files, you need to reboot the system with:
sudo reboot
10. Once you have logged back in as hduser, lauch a new Terminal window and install Firefox so you could later monitor the progress of your Hadoop jobs. 
sudo apt-get install firefox
11. Install JDK 7 using:
sudo apt-get install openjdk-7-jdk
12 Install ssh and rsync
sudo apt-get install ssh
sudo apt-get install rsync
13.Install the SSH Client
sudo apt-get install openssh-client
14.Install the SSH Server
sudo apt-get install openssh-server
15.Configure and run the SSH Server with the following commands:
su - hduser
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost
16. Disable IPv6 by opening the /etc/sysctl.conf file with:
sudo vi /etc/sysctl.conf
And adding the following lines at the end of the file:
##################
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
##################

Conclusion

 We've setup a virtualized Sanbox environment for our future BigData exploits.