• The Application of Machine Learning Tools on Complex and Big Data Projects

The Application of Machine Learning Tools on Complex and Big Data Projects


David J. Patrishkoff, President, E3 Extreme Enterprise Efficiency, Orlando, FL, USA


Machine Learning, Big Data, Predictive Modeling





Six Sigma statistical analysis techniques, performed within the DMAIC process, provides a robust generalized path to identify, verify and address the root causes of many business issues. However, as companies challenge their employees to produce a continuous flow of improvements and cost reductions, low and medium-hanging fruit projects quickly disappear. The next higher class of improvement projects may require the resolution of mission-critical and systemic business issues, where the root causes and solutions may be deeply hidden in complex or big data sets. In such cases, the analysis capabilities of standard statistical analysis software may be limited in its ability to provide the needed level of support. A new and growing class of machine learning analysis tools has been developed that can identify hidden groupings, correlations, patterns and information in complex or big data sets with automated modes of operation.

Machine Learning is a branch of Artificial Intelligence, which involves automated analytical predictive model building with minimal human intervention. “machine learning” is now more popular in the USA as a Google search term than “six sigma” and “lean manufacturing” combined. Machine learning does not imply the notion that computers can think on their own, but it does represent new abilities of machines to categorize, identify, detect, learn and then predict patterns from complex and big data sets. This represents a great advantage for six sigma belts, researchers and other analysts who want to move up to the next level of data analysis capabilities. These modern tools can protect against superficial analysis or the other extreme, which is commonly referred to as analysis-paralysis.

Superficial analysis can result in partial solutions and flawed conclusions. Unfortunately, automated warning flags are not available in classic data analysis software and methods to indicate when a data set is under-analyzed. Too often, when a regression or hypothesis test p-value is below 0.05 and the R-squared values or odds ratios seem somewhat respectable for at least a few factors, the analysis may be prematurely deemed as successful and completed. Too often, conclusions are drafted with no plan to explore the reasons for ever-present and unexplained outliers. Superficial analysis can easily derail the search for further knowledge after a few weak or barely significant root causes are discovered with linear and other limiting analysis methods.

Analysis-paralysis is a stagnated and continual state of investigation that never reaches an acceptable or desired conclusion. Analysis-paralysis can easily set in when an analyst, problem-solver or a researcher underestimates the complexity of the business problem or research project. Analysis-paralysis can also set in when an unstructured analysis strategy or the use of inefficient analysis tools result in no new knowledge or meaningful conclusions. Weak analysis results can create an prolonged spiral of uncertainty and guilt fueled by self-doubt, more analysis, over-analysis, over-thinking, circular thinking and lengthy but futile analysis paths in the search for missed knowledge and correlations.

Machine learning tools can guide analysts through complex data sets with supervised and unsupervised machine learning modes of operation to explore data sets with millions of rows and billions of data cells that can be run on most modern laptops or desktop PCs. Supervised machine learning modes of operation will exhaustively search for correlating factors and classifications that can predict and explain a target variable. Unsupervised machine learning modes of operation can independently expose hidden clusters in the data that may be of interest, regardless of the target factor. Machine learning tools can also create a fully automated array of reports, which includes decision and regression tree charts, predictive model reports, factor importance rankings, 2 and 3D charting for all possible factor combinations, hotspot reports, warnings for over-fitted models and much more. This rapid and exhaustive automated analysis provides the analyst with valuable information about hidden patterns, structure and relationships.

Throughout this presentation, a data analysis case study will demonstrate the combined application of advanced analysis strategies with machine learning tools on a large ongoing research project for a large and complex data set. This project involves the analysis of data from the United States National Highway Transportation Safety Administration (NHTSA) concerning vehicle accident, injury and fatality data that goes back to 1975, with updated analysis, whenever new annual data is available. The purpose of this ongoing research project is to identify the multiple interacting factors that correlate to roadway injury severities and fatalities, with consideration of the fact that these correlations may change over time for various reasons.

Download a copy of the Conference Brochure Please select the relevant Registrant Type from: a) Conference & Workshops; b) Exhibits & Sponsorship; c) Speakers & Co-Speakers

Submit your PowerPoint by
January 31, 2019
Speakers’ Orientation Meeting:
Tuesday, March 12, 2019,
6 PM-7 PM

Please download the Speaker Instructions here

Conference Chair's Message 

Joel Smith

This year’s conference is focused on making sure we are “continuing” our journey as practitioners. Whatever your level of experience, you will learn new concepts, new perspectives and network with the best in the industry.

read more