Overview
Drools, a Rule Engine, is used to implement an expert system using a rule-based approach. It is used to convert both structured and unstructured data into transient data by applying business logic for normalizing and filtering data in DRL file.
In this blog, let us discuss about normalizing and filtering data using Drools.
Pre-requisite
Download and install the following:
- Java 8 - http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
- Drools latest version - https://www.drools.org/download/download.html
Use Case
Oil Well Drilling datasets from two different states–Arkansas (AR) and Oklahoma (OK) of USA are taken as the input data for processing based on API numbers.
- Filter invalid drill types data
- Remove null date values
- Normalize API numbers to have correct digits
- Format dates to the required format
- Remove duplicate well information by taking only maximum modified date value
Data Description
The Oil Well Drilling datasets contain raw information about wells & its formation details, drill types, and production dates. Arkansas dataset has 6040 records and Oklahoma dataset has 2559 records.
The raw data contains invalid values such as null, invalid date, invalid drill type, and duplicate well & invalid well information with modified dates. These raw data from source is transformed to MS SQL for further filtering and normalization.
Arkansas Dataset
Null values for date_of_1st_prod
Invalid values for initial_production
Incorrect Digits in Well API Numbers
Oklahoma Dataset
Duplicate well data
Invalid date values in test date
- MS SQL is used to transform the input data into transient data.
- Java Database Connectivity (JDBC) is used for interaction between Java and MS SQL in order to get input and to write output into MS SQL after transforming the data.
LOGGER.info("Creating MSSQL connection ................"); Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver"); conn = DriverManager.getConnection(url,userName,passWord);
KIE File System
KIE File System is used to load DRL files and to reduce the dependency from KIE module configuration file. It allows us to change the business logic without redeployment and to keep the DRL files separately from Jar and deployment files.
All the rule files in the given path are loaded into MS SQL to apply the rules on facts inserted. Facts are inserted to ksessions for which particular rules are loaded.
public void createKnowledgeSession(String rulePath) { //Gets the factory class for KIE. KieServices ks = KieServices.Factory.get(); KieFileSystem kfs = ks.newKieFileSystem(); KieRepository kr = ks.getRepository(); File rulePathFile = new File(rulePath); File ruleFiles[] = null; if(rulePathFile.isDirectory()) { ruleFiles = rulePathFile.listFiles(); } for(File drlFile : ruleFiles) { LOGGER.info("File path is "+drlFile.getAbsolutePath()); kfs.write(ResourceFactory.newFileResource(drlFile)); } KieBuilder kb = ks.newKieBuilder(kfs); kb.buildAll(); KieContainer kContainer = ks.newKieContainer(kr.getDefaultReleaseId()); this.kSession = kContainer.newKieSession(); }
Applying Rules
Multiple rules were applied to both the datasets to process data, remove duplicate and invalid data, normalize data, and filter data.
Arkansas Dataset – Rules Applied
- Removing Invalid Values
- Setting Initial Production and Unifying Date Format
Oklahoma Dataset – Rules applied
- Applying Date Filter
- Getting Max Modify Date Values
- Filtering Max Modify Date Values
Applying Rules in Arkansas Dataset
Rule 1: Removing Invalid Values
This rule is applied to remove invalid production date and Initial Production values using retract key words.
rule "remove invalid initial production and first production values" salience 2 when $arkanas : Arkanas( $first_prod_date : firstProdDate,$initial_production : initialProcuction ) eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || "X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || $initial_production.contains("See Remarks") ) then retract($arkanas); end
Arkansas data filtered after removing invalid production date and initial production values is as follows:
Rule 2: Setting Initial Production and Unifying Date Format
This rule is applied to format gas_vol values (initial production value is considered as gas_vol) and to format the production date to prescribed date type from string by giving the date format.
Well API numbers are normalized with zeros for formatting the API numbers to fourteen digits.
rule "set initial production to gas_vol and first_prod_date to correct format" salience 1 when $arkanas : Arkanas( $initial_production : initialProcuction,$first_prod_date : firstProdDate,$api_number : apiNumber ) eval( CommonUtil.isValidDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy") ) then try { $arkanas.setDateTime(CommonUtil.getDate($first_prod_date,"EEE MMM dd HH:mm:ss z yyyy")); $arkanas.setGasVol(Float.parseFloat($initial_production)); $arkanas.setFilteredData(true); $arkanas.setApiNumber($api_number+"0000"); } catch (Exception e) { e.printStackTrace(); } end
API numbers are normalized after applying the normalizing rule:
First production dates are converted into the prescribed format in DRL file and mapped with date_time column as shown in the below diagram:
Applying Rules in Oklahoma Dataset
Rule 1: Applying Date Filter
This rule is applied to filter the values based on their test date for last 7 years and to filter horizontal drill type.
rule "data between given range" when $oklahoma : Oklahoma( $api_number : apiNumber,$test_date : testDate,$modify_date : modifyDate,$drill_type : drillType) eval( !(CommonUtil.isDateWithinRange(CommonUtil.get Date("2010-01-01","yyyy-MM-dd"),new Date(),$test_date) && StringUtil.isValidString($drill_type) && ( $drill_type.toUpperCase().startsWith("HORIZONTAL") || $drill_type.toUpperCase().equalsIgnoreCase("H") ) ) ) then retract($oklahoma); end
Data with test date before 2010-01-01 is filtered to remove invalid test date values and is mapped with date_time as mentioned in the data description section:
Rule 2: Getting Max Modify Date Values
This rule is applied to get the Max Modify date values by grouping API number in temp storage object called MaxValue by using accumulate.
Accumulate gets the max modify date by grouping the data by well API numbers.
rule "selecting ok data with max modify date" when $oklahoma : Oklahoma($api_number : getApiNumber()) and not MaxValue($api_number == apinumber) accumulate(ok : Oklahoma(getApiNumber()==$api_number),$maxDateValue : max(ok.getModifyDate().getTime())) then insert(new MaxValue($api_number,$maxDateValue)); end
Rule 3: Filtering Max Modify Date Values
This rule is applied to filter max modify date values by storing the unique values such that data is not replicated.
rule "Filters max modify date values" when $oklahoma : Oklahoma($api_number : getApiNumber(),$max_date : getModifyDate().getTime()) and not UniqVal($api_number == apinumber,$max_date == max_modify_date) $maxValue : MaxValue($api_number==apinumber && $max_date==max_modify_date) then insert(new UniqVal($api_number,$max_date)); $oklahoma.setFilteredData(true); end
After removing the duplicate well information by grouping the API number with max modify date, the number of records is reduced to fifty:
Salience
Salience is used to set orders for applying rules as certain rules need to be executed after execution of certain rules.
For example, in Arkansas dataset, rules for “removing invalid initial production and first production values” had to be executed first to format date and gas_vol.
rule "remove invalid initial production and first production values" salience 2 when $arkanas : Arkanas( $first_prod_date : firstProdDate, $initial_production : initialProcuction ) eval( !StringUtil.isValidString($first_prod_date) || !StringUtil.isValidString($initial_production) || "X".equalsIgnoreCase($initial_production) || $initial_production.contains(",") || $initial_production.contains("See Remarks") ) then retract($arkanas); end
Conclusion
Business rules are separated from business code by applying the business logic for normalization and filtration in the DRL file. Thus, easily changing business logic without redeployments.
In this blog, the test date was used for getting last 7 years data. Business analyst can change this date range in future without changing code or performing redeployment.
References
- Drools Expert: https://training-course-material.com/training/Drools_Expert_-_mvel_-_LHS_-_advanced
- JBoss Community Team: https://docs.jboss.org/drools/release/6.5.0.Final/drools-docs/html_single/
- GitHub Location: https://github.com/treselle-systems/data_normalization_and_filtration_using_drools
This post first appeared on Front-end Code Review & Validation Tools | Treselle Systems, please read the originial post: here