September 7th 2017

Pig provides an engine for executing data flows in parallel on Hadoop. Pig is basically an high level language. You can apply all kinds of filters example sort, join and filter. Also a developer can create your own functions like how you create functions in SQL.

Building Empires: The Best Sandbox Ga…
Best Steam Irons in India
The Rise of Water-Based Sealants in F…
What is Prelox?
Another Appliance Company Now Files A…

The flow of of Pig in Hadoop environment is as follows. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. In addition there are 14 technologies as on data involved in Hadoop eco system.

The steps involved in executing Pig script:

By default, Pig reads input files from HDFS, uses HDFS to store intermediate data between MapReduce jobs, and writes its output to HDFS.

Pig and Hadoop

Let us see some important functionality of Mapreduce – MapReduce is a simple but powerful parallel data-processing paradigm.

Every job in MapReduce consists of three main phases: map, shuffle, and reduce. In the map phase, the application has the opportunity to operate on each record in the input separately.

Many maps are started at once so that while the input may be gigabytes or terabytes in size, given enough machines, the map phase can usually be completed in under one
minute. The below flow tells you how the Map and reduce jobs work in data processing.

Flow of Mapreduce

The below is the sample Pig script

— Load input from the file named Mary, and call the single
— field in the record ‘line’.
input = load ‘mary’ as (line);
— TOKENIZE splits the line into a field for each word.
— flatten will take the collection of records returned by
— TOKENIZE and produce a separate record for each one, calling the single
— field in the record word.
words = foreach input generate flatten(TOKENIZE(line)) as word;
— Now group them together by each word.
grpd = group words by word;
— Count them.cntd = foreach grpd generate group, COUNT(words);
— Print out the results.
dump cntd;

Stocksnap.io

Why Pig Latin is parallel data flow language

Pig Latin is a data flow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel.
These data flows can be simple linear flows like the word count example given previously. They can also be complex workflows that include points where multiple inputs are joined, and where data is split into multiple streams to be processed by different operators.
To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data

The top benefits of Pig Latin

Pig users can create custom functions to meet their particular processing
Pig users can create custom functions to meet their particular processing Easily programmed Complex tasks involving interrelated data transformations can be simplified and encoded as data flow sequences.
Pig programs accomplish huge tasks, but they are easy to write and maintain. Because the system automatically optimizes execution of Pig jobs, the user can focus on semantics.

Filed under: Pig Tagged: Big Data, hadoop pig tutorial, Pig script in Hadoop

This post first appeared on Srinimf - Tech.Jobs.Biz.Success, please read the originial post: here