Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Data Normalization and Filtration Using Drools

Overview

Drools, a Rule Engine, is used to implement an expert system using a rule-based approach. It is used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in DRL file.

In this blog, let us discuss about normalizing and filtering data using Drools.

Pre-requisite

Download and install the following:

  • Java 8 - http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
  • Drools latest version - https://www.drools.org/download/download.html

Use Case

Oil Well Drilling datasets from two different states–Arkansas (AR) and Oklahoma (OK) of USA are taken as the input data for processing based on API numbers.

  • Filter invalid drill types data
  • Remove null date values
  • Normalize API numbers to have correct digits
  • Format dates to the required format
  • Remove duplicate well information by taking only maximum modified date value

Data Description

The Oil Well Drilling datasets contain raw information about wells & its formation details, drill types, and production dates. Arkansas dataset has 6040 records and Oklahoma dataset has 2559 records.

The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well & invalid well information with modified dates. These raw data from source is transformed to MS SQL for further filtering and normalization.

Arkansas Dataset

Null values for date_of_1st_prod

Invalid values for initial_production

Incorrect Digits in Well API Numbers

Oklahoma Dataset

Duplicate well data

Invalid date values in test date

  • MS SQL is used to transform the input data into transient data.
  • Java Database Connectivity (JDBC) is used for interaction between Java and MS SQL in order to get input and to write output into MS SQL after transforming the data.
LOGGER.info("Creating MSSQL connection ................");
 Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
 conn = DriverManager.getConnection(url,userName,passWord);

KIE File System

KIE File System is used to load DRL files and to reduce the dependency from KIE module configuration file. It allows us to change the business logic without redeployment and to keep the DRL files separately from Jar and deployment files.

All the rule files in the given path are loaded into MS SQL to apply the rules on facts inserted. Facts are inserted to ksessions for which particular rules are loaded.

public void createKnowledgeSession(String rulePath) {
 //Gets the factory class for KIE.
 KieServices ks = KieServices.Factory.get();
 KieFileSystem kfs = ks.newKieFileSystem();
 KieRepository kr = ks.getRepository();
 File rulePathFile = new File(rulePath);
 File ruleFiles[] = null;
 if(rulePathFile.isDirectory()) {
 ruleFiles = rulePathFile.listFiles();
 }
 for(File drlFile : ruleFiles) {
 LOGGER.info("File path is "+drlFile.getAbsolutePath());
 kfs.write(ResourceFactory.newFileResource(drlFile));
 }
 KieBuilder kb = ks.newKieBuilder(kfs);
 kb.buildAll();
 KieContainer kContainer = ks.newKieContainer(kr.getDefaultReleaseId());
 this.kSession = kContainer.newKieSession();
 }

Applying Rules

Multiple rules were applied to both the datasets to process data, remove duplicate and invalid data, normalize data, and filter data.

Arkansas Dataset – Rules Applied

  1. Removing Invalid Values
  2. Setting Initial Production and Unifying Date Format

Oklahoma Dataset – Rules applied

  1. Applying Date Filter
  2. Getting Max Modify Date Values
  3. Filtering Max Modify Date Values

Applying Rules in Arkansas Dataset

Rule 1: Removing Invalid Values

This rule is applied to remove invalid production date and Initial Production values using retract key words.

rule "remove invalid initial production and first production values" salience 2
 when
 $arkanas : Arkanas( $first_prod_date : firstProdDate,$initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || 
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || 
$initial_production.contains("See Remarks") )
 then
 retract($arkanas);
 end

Arkansas data filtered after removing invalid production date and initial production values is as follows:

Rule 2: Setting Initial Production and Unifying Date Format

This rule is applied to format gas_vol values (initial production value is considered as gas_vol) and to format the production date to prescribed date type from string by giving the date format.
Well API numbers are normalized with zeros for formatting the API numbers to fourteen digits.

rule "set initial production to gas_vol and first_prod_date to correct format" salience 1
 when 
$arkanas : Arkanas( $initial_production : initialProcuction,$first_prod_date : firstProdDate,$api_number : apiNumber )
eval( CommonUtil.isValidDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy") )
 then
 try {
$arkanas.setDateTime(CommonUtil.getDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy"));
 $arkanas.setGasVol(Float.parseFloat($initial_production));
 $arkanas.setFilteredData(true);
 $arkanas.setApiNumber($api_number+"0000");
 } catch (Exception e) {
 e.printStackTrace();
 }
end

API numbers are normalized after applying the normalizing rule:

First production dates are converted into the prescribed format in DRL file and mapped with date_time column as shown in the below diagram:

Applying Rules in Oklahoma Dataset

Rule 1: Applying Date Filter

This rule is applied to filter the values based on their test date for last 7 years and to filter horizontal drill type.

rule "data between given range"
 when
 $oklahoma : Oklahoma( $api_number : apiNumber,$test_date : testDate,$modify_date : 
 modifyDate,$drill_type : drillType)
eval( !(CommonUtil.isDateWithinRange(CommonUtil.get
Date("2010-01-01","yyyy-MM-dd"),new 
Date(),$test_date) && StringUtil.isValidString($drill_type) && ( 
$drill_type.toUpperCase().startsWith("HORIZONTAL") || 
$drill_type.toUpperCase().equalsIgnoreCase("H") ) ) )
 then
 retract($oklahoma);
 end

Data with test date before 2010-01-01 is filtered to remove invalid test date values and is mapped with date_time as mentioned in the data description section:

Rule 2: Getting Max Modify Date Values

This rule is applied to get the Max Modify date values by grouping API number in temp storage object called MaxValue by using accumulate.

Accumulate gets the max modify date by grouping the data by well API numbers.

rule "selecting ok data with max modify date"
 when
 $oklahoma : Oklahoma($api_number : getApiNumber()) and not MaxValue($api_number == apinumber)
accumulate(ok : Oklahoma(getApiNumber()==$api_number),$maxDateValue : 
max(ok.getModifyDate().getTime()))
 then
 insert(new MaxValue($api_number,$maxDateValue));
 end

Rule 3: Filtering Max Modify Date Values

This rule is applied to filter max modify date values by storing the unique values such that data is not replicated.

rule "Filters max modify date values"
 when
 $oklahoma : Oklahoma($api_number : getApiNumber(),$max_date : getModifyDate().getTime()) 
and not UniqVal($api_number == apinumber,$max_date == max_modify_date)
$maxValue : MaxValue($api_number==apinumber && $max_date==max_modify_date)
 then
 insert(new UniqVal($api_number,$max_date));
 $oklahoma.setFilteredData(true);
 end

After removing the duplicate well information by grouping the API number with max modify date, the number of records is reduced to fifty:

Salience

Salience is used to set orders for applying rules as certain rules need to be executed after execution of certain rules.

For example, in Arkansas dataset, rules for “removing invalid initial production and first production values” had to be executed first to format date and gas_vol.

rule "remove invalid initial production and first production values" salience 2
 when
 $arkanas : Arkanas( $first_prod_date : firstProdDate, $initial_production : initialProcuction )
eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || 
"X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || 
$initial_production.contains("See Remarks") )
 then
 retract($arkanas);
 end

Conclusion

Business rules are separated from business code by applying the business logic for normalization and filtration in the DRL file. Thus, easily changing business logic without redeployments.

In this blog, the test date was used for getting last 7 years data. Business analyst can change this date range in future without changing code or performing redeployment.

References

  • Drools Expert: https://training-course-material.com/training/Drools_Expert_-_mvel_-_LHS_-_advanced
  • JBoss Community Team: https://docs.jboss.org/drools/release/6.5.0.Final/drools-docs/html_single/
  • GitHub Location: https://github.com/treselle-systems/data_normalization_and_filtration_using_drools


This post first appeared on Front-end Code Review & Validation Tools | Treselle Systems, please read the originial post: here

Share the post

Data Normalization and Filtration Using Drools

×

Subscribe to Front-end Code Review & Validation Tools | Treselle Systems

Get updates delivered right to your inbox!

Thank you for your subscription

×