SaurZCode

Sunday, June 29, 2014

Hadoop : Getting Started with Pig

What is Pig?

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data analysts to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.Pig Scripts are converted into MapReduce Jobs which runs on data stored in HDFS (refer to the diagram below).

Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

Pig Architecture

How Pig is being Used ?

Rapid prototyping of algorithms for processing large data sets.

Data Processing for web search platforms.

Ad Hoc queries across large data sets.

Web log processing.

Pig Elements

Pig consists of three elements -

Pig Latin
- High level scripting language
- No Schema
- Translated to MapReduce Jobs

Pig Grunt Shell
- Interactive shell for executing pig commands.

PiggyBank
- Shared repository for User defined functions (explained later).

Pig Latin Statements

Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output(except LOAD and STORE statements).

Pig Latin statements are generally organized as follows:

A LOAD statement to read data from the file system.

A series of "transformation" statements to process the data.

A DUMP statement to view results or a STORE statement to save the results.

Note that a DUMP or STORE statement is required to generate output.

In this example Pig will validate, but not execute, the LOAD and FOREACH statements.

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;

In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)

Storing Intermediate Results

Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property.

Storing Final Results

Use the STORE operator and the load/store functions to write results to the file system ( PigStorage is the default store function).

Note: During the testing/debugging phase of your implementation, you can use DUMP to display results to your terminal screen. However, in a production environment you always want to use the STORE operator to save your results.

Debugging Pig Latin

Pig Latin provides operators that can help you debug your Pig Latin statements:

Use the DUMP operator to display results to your terminal screen.

Use the DESCRIBE operator to review the schema of a relation.

Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.

Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.

What is Pig User Defined Functions (UDFs) ?

Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig.UDF is very powerful functionality to do many complex operations on data.The Piggy Bank is a place for Pig users to share their functions(UDFs).

Example:

REGISTER saurzcodeUDF.jar;
A = LOAD 'employee_data' AS (name: chararray, age: int, designation: chararray);
B = FOREACH A GENERATE saurzcodeUDF.UPPER(name);
DUMP B;

This article was just a Getting Started Article on Pig , I will cover further details about How to to Write Pig Latin commands for some basic operations like JOIN,FILTER,GROUP, ORDER etc. , also how to make your own UDFs for processing on Hadoop cluster.

References :-

http://pig.apache.org

Recommended readings for Hadoop

Free Online Hadoop Trainings

How to become a Hadoop Certified Developer ?

Monday, May 19, 2014

String Interning - What ,Why and When ?

What is String Interning

String Interning is a method of storing only one copy of each distinct String Value, which must be immutable.

In Java String class has a public method intern() that returns a canonical representation for the string object. Java's String class privately maintains a pool of strings, where String literals are automatically interned.

When the intern() method is invoked on a String object it looks the string contained by this String object in the pool, if the string is found there then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.

The intern() method helps in comparing two String objects with == operator by looking into the pre-existing pool of string literals, no doubt it is faster than equals() method. The pool of strings in Java is maintained for saving space and for faster comparisons. Normally Java programmers are advised to use equals(), not ==, to compare two strings. This is because == operator compares memory locations, while equals() method compares the content stored in two objects.

Why and When to Intern ?

Thought Java automatically interns all Remember that we only need to intern strings when they are not constants, and we want to be able to quickly compare them to other interned strings. The intern() method should be used on strings constructed with new String() in order to compare them by == operator.

Let's take a look at the following Java program to understand the intern() behavior.

[code language="java"]
public class TestString {

public static void main(String[] args) {
String s1 = "Test";
String s2 = "Test";
String s3 = new String("Test");
final String s4 = s3.intern();
System.out.println(s1 == s2);
System.out.println(s2 == s3);
System.out.println(s3 == s4);
System.out.println(s1 == s3);
System.out.println(s1 == s4);
System.out.println(s1.equals(s2));
System.out.println(s2.equals(s3));
System.out.println(s3.equals(s4));
System.out.println(s1.equals(s4));
System.out.println(s1.equals(s3));
}

}

//Output
true
false
false
false
true
true
true
true
true
true

[/code]

Friday, May 9, 2014

SOAP Webservices Using Apache CXF : Adding Custom Object as Header in Outgoing Requests

What is CXF?

Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like JAX-WS and JAX-RS. These services can speak a variety of protocols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS etc.

How CXF Works?

As you can see here and here, how CXF service calls are processed,most of the functionality in the Apache CXF runtime is implemented by interceptors. Every endpoint created by the Apache CXF runtime has potential interceptor chains for processing messages. The interceptors in the these chains are responsible for transforming messages between the raw data transported across the wire and the Java objects handled by the endpoint’s implementation code.

Interceptors in CXF

When a CXF client invokes a CXF server, there is an outgoing interceptor chain for the client and an incoming chain for the server. When the server sends the response back to the client, there is an outgoing chain for the server and an incoming one for the client. Additionally, in the case of SOAPFaults, a CXF web service will create a separate outbound error handling chain and the client will create an inbound error handling chain.

The interceptors are organized into phases to ensure that processing happens on the proper order.Various phases involved during the Interceptor chains are listed in CXF documentation here.

Adding your custom Interceptor involves extending one of the Abstract Intereceptor classes that CXF provides, and providing a phase when that interceptor should be invoked.

AbstractPhaseInterceptor class - This abstract class provides implementations for the phase management methods of the PhaseInterceptor interface. The AbstractPhaseInterceptor class also provides a default implementation of the handleFault() method.

Developers need to provide an implementation of the handleMessage() method. They can also provide a different implementation for the handleFault() method. The developer-provided implementations can manipulate the message data using the methods provided by the generic org.apache.cxf.message.Message interface.

For applications that work with SOAP messages, Apache CXF provides an AbstractSoapInterceptor class. Extending this class provides the handleMessage() method and the handleFault() method with access to the message data as an org.apache.cxf.binding.soap.SoapMessage object. SoapMessage objects have methods for retrieving the SOAP headers, the SOAP envelope, and other SOAP metadata from the message.

Below piece of code will show, how we can add a Custom Object as Header to an outgoing request –

Spring Configuration

[code language="xml"]
<jaxws:client id="mywebServiceClient"
serviceClass="com.saurzcode.TestService"
address="http://saurzcode.com:8088/mockTestService">

<jaxws:binding>
<soap:soapBinding version="1.2" mtomEnabled="true" />
</jaxws:binding>
</jaxws:client>
<cxf:bus>
<cxf:outInterceptors>
<bean class="com.saurzcode.ws.caller.SoapHeaderInterceptor" />
</cxf:outInterceptors>
</cxf:bus>
[/code]

Interceptor :-

[code language="java"]
public class SoapHeaderInterceptor extends AbstractSoapInterceptor {

public SoapHeaderInterceptor() {

super(Phase.POST_LOGICAL);

}

@Override
public void handleMessage(SoapMessage message) throws Fault {

List<Header> headers = message.getHeaders();

TestHeader testHeader = new TestHeader();

JAXBElement<TestHeader> testHeaders = new ObjectFactory()

.createTestHeader(testHeader);

try {

Header header = new Header(testHeaders.getName(), testHeader,

new JAXBDataBinding(TestHeader.class));

headers.add(header);

message.put(Header.HEADER_LIST, headers);

} catch (JAXBException e) {

e.printStackTrace();

}

}

[/code]

Monday, April 21, 2014

Free Online Hadoop Trainings

Hadoop and Big Data are becoming the new hot trends of the industry , being the most sought out skills in the market.There are various vendors and online training providers coming up with a nice explanation of some of the core concepts underlying Hadoop frameworks like Mapreduce,HDFS and various components involved in Hadoop Ecosystem.I will try to list down some of these resources here -

1. Udacity - Cloudera came with this nice course named "Intro to Hadoop and MapReduce" .This provides a nice explanation of the core concepts and internal working of hadoop components embedded with quizzes around each concept and some good handson exercies.They also provide VM for training purpose, which can be used to run example questions and to solve quizzes and exams for the courses.

Goals -

How Hadoop fits into the world (recognize the problems it solves)
Understand the concepts of HDFS and MapReduce (find out how it solves the problems)
Write MapReduce programs (see how we solve the problems)
Practice solving problems on your own

Prerequisites -

Some basic programming knowledge and a good interest in learning :)

2. Introduction to Mapreduce Programming

3. Moving Data in to Hadoop