May 8th 2017

Overview

In this blog, Pig is used to calculate and find Loan application metrics using loan, application, and applicant mock data. This mock data is prepared with specific criteria in mind. The fields are included in each dataset as per the standard data model used in finance industry in United States (US).

Usually, applicants applying for loan belong to different states. The overall summary of Loan Application Metrics by state will help the decision makers to find out performance (increase or decrease) of the branches by state. Few interesting calculations like average Depts. to Income (DTI), weighted average interest rate, and others by state were used.

About Apache Pig

Apache Pig is a platform used for analyzing large datasets consisting of a high-level language for expressing data analysis programs. It is used for performing Extract, Transform, and Load (ETL) operations. It is an interactive or script-based, execution environment supporting Pig Latin, a language used to express data flows. The Pig Latin language supports loading and processing of input data with a series of operators that transform the input data and produce the desired output.

Pre-requisites

To calculate and find Loan Application metrics by state using Pig, perform the following:

Setup Hortonworks (HDP) cluster with required services (HDFS, YARN, Pig, Hive, Tez, and others)
Create loan, application, and applicant Hive partition tables
Load data into those partition tables
Create loan metrics by state output partition table

Note: Datasets used in this use case, Hive DDL statements, and Pig scripts are available in GitHub. Please find GitHub location in the Reference section at the end of the article.

Use Case

In this use case, Pig is used to calculate and find loan application metrics. Pig scripts are executed in Tez execution engine instead of old MapReduce. The following calculations are included in the Pig scripts and are grouped by states, loan decision type, and loan type.

Average DTI of applicants
Total loan approved amount
Loan approved count
Total requested amount by applicants
Weighted average interest rate
Average applicant annual income
Average approved loan

Loan Application Metrics by State

Input Hive Tables

Partition tables are used as both input and output. The “loan_data” table structure of input table looks similar to the one below:

The “loan_data” table partition by “orgination_year” is derived from “orgination_date” field.

Note: Find the Reference section for Hive DDL statements GitHub location.

Extract, Transform, and Load (ETL) Using Pig

Pig script includes UDF functions and is used to invoke multiple Pig scripts. For simplicity, single Pig script is used to find loan application metrics.

Extract

Pig “Load” command and HCatLoader() are used to extract input data from partitioned Hive table.

The input data is stored in the variables for further processing.

Transform

In this phase, various operations such as filtering, joining datasets, grouping (by state, loan decision type, and loan type), and applying aggregate operations using Pig operators and derived calculations are performed. Filtering needed records will restrict the number of records to process and calculate. Two types of filters applied are:

Filtered only “Approved” records from loan data
Filtered only “loan” records from application data

Weighted average interest rate calculation is performed using the following steps:

Multiply each loan amount by its interest rate to obtain the per loan weight factor
Add the per loan weight factors together to obtain the total per loan weight factor
Add the loan amounts together to obtain the total loan amount
Divide the total per loan weight factor by the total loan amount. Multiply this by 100 to express it as a percentage and to obtain the actual interest rate.

Let us discuss them in detail. Multiply each loan amount by its interest rate to obtain the per loan weight factor.

DTI formula: (depts/income) * 100.

As required fields needed for calculations are in different variables, the loan, application, and applicant data and store are joined into a single variable.

The combined results are grouped by state, loan decision type, and loan type to provide metrics based on these combinations to gain more granularity. The aggregate operators used based on base value, the formula used for weighted rate, and average approved loan are explained in the below code:

Load

The output result is stored into the resulted Hive table using Pig “Store” command and HCatStorer ().

Execute Pig Script

As HCatalog functions are used in this use case, the following argument should be supplied while executing Pig scripts. The output result is stored into the resulted Hive table using Pig “Store” command and HCatStorer ().

The Pig scripts are executed using Tez execution engine, which constructs Direct Acyclic Graph (DAG) execution plan to execute the script in an optimized manner.

Understanding Tez DAG

The Apache Tez project aims at building an application framework that allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN. As execution will be tracked and listed in the Tez UI, all the tasks are executed using “Tez”. The Tez UI provides inventory of all queries executed with search and filter options.

Apache Tez models data processing as a dataflow graph. The vertices in the graph represents processing of data and the edges represents movement of data between the processing. Thus, the user logic that analyses and modifies the data sits in the vertices. Edges determine the consumer of the data, how the data is transferred, and the dependency between the producer and consumer vertices.

Vertex Swimlane: It is a new feature in HDP 2.5. It helps to quickly pinpoint the long-running parts of the query and to optimize them.

Conclusion

Pig has an option to dump the processed results and to validate the results before storing it into Hive table.

Once the results are verified, the final output will be stored into the partitioned Hive table.

The weighted average combines the interest rates into a single interest rate for yielding a combined cost same as the cost of the original separate loans. Let us sort by weighted rate descending to find out the loan in the state that produces more interest.

References

Datasets used in this use case, Hive DDL statements, and Pig scripts are available in the
GitHub location: https://github.com/treselle-systems/loan_application_metrics_by_state_using_pig
Hive vs Pig
https://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosystem/79
Hadoop Pig and Pig Latin for Big Data
http://www.dummies.com/programming/big-data/hadoop/hadoop-pig-and-pig-latin-for-big-data/
About Tez
https://tez.apache.org/
Data Processing API in Apache Tez
https://hortonworks.com/blog/expressing-data-processing-in-apache-tez/
Weighted Interest Rate
https://www.edvisors.com/repay-student-loans/federal/consolidation/calculate-weighted-average-interest-rates/

Elevate your TikTok experience with P…
Understanding the Disadvantages of So…
Another Tech Company Is Now Laying Of…
NTT DATA and Beyond ONE forge strateg…
Integration des Wartungssoftware mit …

This post first appeared on Front-end Code Review & Validation Tools | Treselle Systems, please read the originial post: here