When Pigs fly

May 17, 2013

I have always been using MapReduce with Java for all my data processing, recently I started using Pig for data processing and would like to share some of my thoughts in here.

PigLatin which is the language used to express the data flows can run in two modes Local within a JVM and within the hadoop cluster. Like most of the other things under the hood it converts all pig scriptlets into a MapReduce program. The cool thing that comes out with pig is that it requires probably less than a dozen lines of code to process TB worth of data. However, the key thing that I noticed is that pig is good to analyze somewhat simple unstructured data. When it comes to complex weblogs or unstructured Social data then things get hairy. I might not be exposed to all features of Pig but I feel like its a great tool when u have data sets that are not too complex to analyze.

While I have been interviewing folks in the Hadoop space for couple years now what I realized is that its very easy for BI professionals to transition into Hadoop area with HIVE and PIG. Its more like SQL and gets you into the Hadoop/Big Data area. However, the true analysis of big data comes with experience in programming, analytical background, statistics and Machine Learning.

For anyone who would like to get into the Hadoop space the quick entry is to start using PIG and HIVE however, if you want to proress any further please be aware that these 2 will probably get you to only so far.

Raghu Kashyap’s Musings

Comments

Ready for more?