Hadoop Big Data Java – Manufacturing scenario
Purpose: The purpose of this document is to explain how to apply the power of Hadoop Big Data platform in Manufacturing scenario.
Challenge: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Data growth challenges and opportunities are considered to be three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). In Manufacturing space there're number of scenarios where we can speak about Big Data. From Engineering Modeling perspective we can simulate every aspect of manufacturing process and get business insight when doing Demand Forecasting, Supply Chain Planning, Capacity Planning, Resource Scheduling, Inventory Optimization, OEE Optimization, etc.
Solution: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. Apache Hadoop platform consists of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS) and other components.
HDInsight is Microsoft's Hadoop-based service that brings a 100% Apache Hadoop-based solution to the cloud. HDInsight gives you the ability to gain the full value of Big Data with a modern, cloud-based data platform that manages data of any type, whether structured or unstructured, and of any size. With HDInsight you can seamlessly store and process data of all types through Microsoft's modern data platform that provides simplicity, ease of management, and an open Enterprise-ready Hadoop service all running in the cloud. You can analyze your Hadoop data directly in Excel using new capabilities like Power Pivot and Power View.
Scenario
In this scenario (OEE Optimization) I want to develop Hadoop MapReduce program in order to analyze Equipment Run Log file(s) and get business insight in order to optimize OEE.
Sample Equipment Run Log (file) in a structured way may look like
Time
|
Machine
|
Event
|
Message
|
8:55
|
Machine1
|
[TRACE]
|
8:55 Machine1 [TRACE] exit code is 546789093
|
9:00
|
Machine1
|
[TRACE]
|
9:00 Machine1 [TRACE] exit code is 775367878
|
9:01
|
Machine2
|
[DEBUG]
|
9:01 Machine2 [DEBUG] exit code is 5546774
|
9:03
|
Machine3
|
[TRACE]
|
9:03 Machine3 [TRACE] exit code is 455674443
|
9:03
|
Machine1
|
[INFO]
|
9:03 Machine1 [INFO] exit code is 99682642
|
9:06
|
Machine1
|
[TRACE]
|
9:06 Machine1 [TRACE] exit code is 56425462
|
9:07
|
Machine6
|
[DEBUG]
|
9:07 Machine6 [DEBUG] exit code is 3664526
|
9:10
|
Machine29
|
[TRACE]
|
9:10 Machine29 [TRACE] exit code is 6426342
|
9:10
|
Machine12
|
[TRACE]
|
9:10 Machine12 [TRACE] exit code is 4629422
|
9:10
|
Machine2
|
[DEBUG]
|
9:10 Machine2 [DEBUG] exit code is 7628764324
|
9:10
|
Machine6
|
[TRACE]
|
9:10 Machine6 [TRACE] exit code is 76428436284
|
9:15
|
Machine1
|
[TRACE]
|
9:15 Machine1 [TRACE] exit code is 24257443623
|
9:25
|
Machine10
|
[DEBUG]
|
9:25 Machine10 [DEBUG] exit code is 24586
|
9:28
|
Machine9
|
[FATAL]
|
9:28 Machine9 [FATAL] exit code is 2745722
|
However the data we collect from equipment may be unstructured, semi-structured or a combination of semi/unstructured data and structured data.
So the same Sample Equipment Run Log (file) may also look like this
8:55 Machine1 [TRACE] exit code is 546789093
9:00 Machine1 [TRACE] exit code is 775367878
9:01 Machine2 [DEBUG] exit code is 5546774
This is a diagnostics message: XYZ
Machine downtime - start
Machine downtime - end
9:03 Machine3 [TRACE] exit code is 455674443
9:03 Machine1 [INFO] exit code is 99682642
This is a diagnostics message: XYZ
This is a diagnostics message: XYZ
9:06 Machine1 [TRACE] exit code is 56425462
9:07 Machine6 [DEBUG] exit code is 3664526
9:10 Machine29 [TRACE] exit code is 6426342
9:10 Machine12 [TRACE] exit code is 4629422
9:10 Machine2 [DEBUG] exit code is 7628764324
This is a diagnostics message: XYZ
9:10 Machine6 [TRACE] exit code is 76428436284
9:15 Machine1 [TRACE] exit code is 24257443623
9:25 Machine10 [DEBUG] exit code is 24586
This is a diagnostics message: XYZ
This is a diagnostics message: XYZ
This is a diagnostics message: XYZ
9:28 Machine9 [FATAL] exit code is 2745722
|
In order to analyze unstructured data in Equipment Run Log file(s) we will apply Hadoop MapReduce algorithm. MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. MapReduce program comprises a Map() procedure that performs filtering and sorting (such as sorting messages by type into queues, one queue for each type) and a Reduce() procedure that performs a summary operation (such as counting the number of messages in each queue, yielding type frequencies). The MapReduce System orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and fault tolerance, and overall management of the whole process.
Please see how MapReduce algorithm works on the schema below
Walkthrough
Now let's review the process step-by-step!
Let's install Microsoft HDInsight Developer Preview first
Microsoft HDInsight Developer Preview
Microsoft HDInsight Developer Preview
Microsoft HDInsight Developer Preview
Once Microsoft HDInsight Developer Preview is installed you can access your Hadoop cluster at http://localhost:8085/ (exact URL may vary) on the localhost
You can now navigate to Local cluster to see what you can do with it
Please note that there're samples which you can deploy and try out
Now we can go ahead and create Java program to implement MapReduce program which will analyze Equipment Run Log file(s) and extract meaningful information in order to get a business insight into the types of messages we have there. For the sake of simplicity we will use Notepad
Source code (Java)
//Standard Java imports
import java.io.IOException;
import java.util.Iterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
//Hadoop imports
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
/**
* Tutorial1
*
*/
public class Tutorial1
{
//The Mapper
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
//Log levels to search for
private static final Pattern pattern = Pattern.compile("(TRACE)|(DEBUG)|(INFO)|(WARN)|(ERROR)|(FATAL)");
private static final IntWritable accumulator = new IntWritable(1);
private Text logLevel = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException
{
// split on space, '[', and ']'
final String[] tokens = value.toString().split("[ \\[\\]]");
if(tokens != null)
{
//now find the log level token
for(final String token : tokens)
{
final Matcher matcher = pattern.matcher(token);
//log level found
if(matcher.matches())
{
logLevel.set(token);
//Create the key value pairs
collector.collect(logLevel, accumulator);
}
}
}
}
}
//The Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> collector,
Reporter reporter) throws IOException
{
int count = 0;
//code to aggregate the occurrence
while(values.hasNext())
{
count += values.next().get();
}
System.out.println(key + "\t" + count);
collector.collect(key, new IntWritable(count));
}
}
//The java main method to execute the MapReduce job
public static void main(String[] args) throws Exception
{
//Code to create a new Job specifying the MapReduce class
final JobConf conf = new JobConf(Tutorial1.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
// Combiner is commented out – to be used in bonus activity
//conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
//File Input argument passed as a command line argument
FileInputFormat.setInputPaths(conf, new Path(args[0]));
//File Output argument passed as a command line argument
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
//statement to execute the job
JobClient.runJob(conf);
}
}
|
Then using Hadoop command prompt utility we'll take a number of steps to generate a JAR file
OEEAnalysis
Please find a detailed steps explained here: http://gettingstarted.hadooponazure.com/hw/mapReduce.html
Once JAR file has been generated we can submit MapReduce job for execution on Windows Azure HDInsight Portal
As the result of this we classify all messages into the type and get counts per type
MapReduce program executed successfully which we can review in a job history
Job History
And now we can get the output file from Hadoop File System as shown below
Command prompt
The result will look like
Result
DEBUG
|
4
|
FATAL
|
1
|
INFO
|
1
|
TRACE
|
8
|
Now we can do a quick analysis using the power of Microsoft Excel and visualize the results in Pie chart
Pie chart
Now we locally executed MapReduce job and got the result, similarly you can leverage Windows Azure HDInsight service which potentially can provide us with much more computational power in order to process TB's or PB's of Equipment Run Log file(s) data which we may have collected
This is how you can create MapReduce job in Windows Azure HDInsight cluster by supplying JAR file
In this walkthrough we reviewed how to install and set up HDInsight cluster locally and in the Cloud, how to do OEE Optimization utilizing Big Data collected from Equipment on the Shop Floor in form of unstructured logs and get a valuable business insight. Please note that we could also mash this data up with transactional data in Microsoft Dynamics AX 2012 for better business insights.
Please find more info about OEE here: http://en.wikipedia.org/wiki/Overall_equipment_effectiveness
Summary: This document describes how implement MapReduce Hadoop job using Java in order to do OEE Optimization for Manufacturing organization. Hadoop platform provides a cheaper (scales to PB's or more), faster (parallel data processing) and better (suited for particular types of Big Data problems) way to work with unstructured, semi-structured or the combination of semi/unstructured data and structured data, and get a valuable business insight for optimization. We discussed how to utilize a local Hadoop environment as well as Windows Azure HDInsight service available in the Cloud. Please learn more about Windows Azure here: http://www.windowsazure.com.
Tags: Big Data, Windows Azure, HDInsight, Hadoop, MapReduce, Manufacturing, Microsoft Dynamics AX 2012, OEE, Overall Equipment Efficiency, Java.
Note: This document is intended for information purposes only, presented as it is with no warranties from the author. This document may be updated with more content to better outline the issues and describe the solutions.
Author: Alex Anikiev, PhD, MCP
tutorial on Hadoop Big Data Java – Manufacturing scenario is excellent.Hadoop is creating more opportunities to every one. And thanks for sharing best information about hadoop in this blog.
ReplyDeleteHadoop Training in hyderabad
It is nice blog Thank you provide important information and i am searching for same information to save my timeBig data hadoop online course
ReplyDeleteGreat post with awesome piece of information. I'm glad that I came across your post. Regards.
ReplyDeleteTally Course in Chennai
Tally Classes in Chennai
Oracle DBA Training in Chennai
Unix Training in Chennai
Embedded System Course Chennai
IoT Training in Chennai
Ionic Training in Chennai
Nice Article,keep sharing more articles with us.
ReplyDeletekeep updating.
big data online training
hadoop admin course
Thanks again for the article post.Really thank you! Fantastic.
ReplyDeleteMuleSoft training
python training
Angular js training
selenium trainings
sql server dba training