Data Lake in industry

Creating a data lake: can a general data lake be used for industrial data?

The need to manipulate mass data for informed decision-making has been an issue for industry for years. Technologies such as BI, Data-Warehouse, Big-Data and Data Lake all have the same goal: store data over time and process it to enable organizations to drive their businesses more effectively and improve their performance.

It is tempting to centralize all business data regardless of its source: sales, marketing, R&D or production in a single business data lake. This approach is appealing for many reasons, including centralization in a single information system and possibilities for comparison and standardization.

Data Lakes provide operational staff and data scientists with quick access to massive amounts of data. Many industrial data initiatives run into obstacles including incomplete data, lack of structure, and insufficient performance. This article discusses the particularities of industrial data and the implications on Data Lake implementation for their use.

Particularities of data in process industries

Data from industrial production processes is relatively varied but there are structural consistencies. Industry data includes:

  • time series (sensors, line controls, etc.);
  • batches, operations, campaigns, cycles (recipes, quality control, indicators, teams, tools, etc.);
  • events (scheduled or unscheduled downtime, alerts, tool or consumable changes, etc.);
  • traceability (where, when, and how an operation or a batch occurred) and relational (the connections between different stages, how an operation carries out one or more batches of previous operations, etc.).

Methods for handling data vary for different types of data.

Note that this data is very different from data generated by other departments in the business which is mostly transactional.

Consequences for storage

Storing time series or traceability data, for example, requires specific approaches to meet performance, cost, and usage requirements.

For time series, it must be possible to process requests over long periods with potentially fine-grained data. For example, one year of one-minute data samples alone is 525,600 points. High volumes need to be processed in a short time, with robust storage efficiency which limits the amount of occupied data.

Specialized databases for time series have been developed to that end such as: TimeScaleDB, InfluxDB, KDB+, OpenTSB (HBase-Hadoop), Quasar DDB, Warp, Azure TS Insight and AWS Time Stream DB. The choice in the field is vast. That said, it is not easy to choose the right database for time series. The choice largely depends on intended uses.

For traceability data, the ability to carry out effective searches for requested elements and to rebuild trees and relations prevails. That is what is usually expected of a strong relational database.

Clearly, a hybrid storage strategy depending on data types is necessary to make the right compromise between features, performance, and cost.

The impact on processing

Finally, we have our lake full of data. But now we are in danger of drowning. One of the first priorities is to structure the data by linking it to a business context. There are several different approaches, some of which encourage working with fairly unstructured data. We recommend structuring data early in the process, as soon as it is stored. Linking a business context makes life easier for all the Data Lake users. It also fundamentally increases the level of information, and in turn, what can be extracted from the data.

But it is still not enough for extracting the required information. Overlaps between data types and their transformation make it possible to build the information needed. Industry process data requires relatively common treatment:

  • Re-sampling, extrapolation of asynchronous time data, or sampling changes to combine them using calculations;
  • Aggregation of time data according to traceability elements (e.g. average of a parameter on the specific phase of a production batch);
  • Statistical calculations to understand data inconsistencies;
  • Calculations for stock management, yield, or specific consumption for example.

These transformations are often specific to production in process industries. Results require the consideration of many subtleties which can make creating and using a Data Lake very complex. As is the case for data storage, performance must also be a priority to provide a smooth experience for users.

What about a global data lake including industrial data?

Standardizing all data from a production site in a single Data Lake is still tempting. There are indeed several advantages:

  • only one Data Lake to manage;
  • centralized data storage; and
  • standard tools for all the business’s departments.

Yet, as discussed above, particularities about the use of production process data can make this approach less attractive:

  • data infrastructure and architecture ill-suited to data types being processed or performance required;
  • data processing requiring significant specific developments to ensure the required information, creating a complex system that is difficult to maintain;
  • a lack of business tools that meet the specific needs of production, quality, and industrial processes, hence the need for specific development;
  • managing a large complex set rather than dividing it into more coherent and manageable sets; and
  • high implementation and maintenance costs and long rollout times to meet needs.

The advantage of a specialized data lake or “Process Data Lake”

A “Process Data Lake” specializes in industrial data with appropriate tools that respond quickly to operational needs.

It is adapted to the storage, processing, and use of industry data. It features:

  • architecture suited to industrial data and uses: as explained earlier, the juxtaposition of different database types that meet the constraints of both time series (NoSQL databases optimized for time series) and traceability data (for relational management);
  • performance in line with end-user needs: directly related to how data is structured and stored according to specific characteristics.
  • rapid rollout: the most frequent uses in the process industry mentioned above (re-sampling, extrapolation, aggregation, material assessment calculations, yield, relational, etc.) are already configured;
  • controlled implementation and operating costs: benefit from our experience working with multiple industry customers and maintenance and development functions provided by our Saas architecture.

Why not put your industrial data in a generic Hadoop Data Lake?

Some examples exist to show the impact of using a generic Data Lake for industrial data. Some industry groups have used Hadoop technology to implement a global data base but there are pitfalls.

One of the first problems was managing time series in HBase, the Hadoop database: it is necessary to group data with a fine-grained time mesh, to predict pre-calculated data for temporal aggregates (min, max, average, etc.) with useful intervals (15 minutes, 1 hour, 1 day, etc). This makes calculation logic complex and copying is inefficient for storage. Systems such as HBase are not optimized to efficiently compress time data.

Even using a specific management overlay such as OpenTSDB, time series processing suffers from the cost of the number of technical layers and the complexity of Hadoop.

The second difficulty was indexing choices for different uses. This type of column database is very sensitive to such choices because the design of the column keys (RowKey) is essential for optimizing queries. When the search uses data characteristics that are not in the RowKey (Tags), performance deteriorates significantly. For example, if you want data from a sensor over a period of time, a “SensorID-TimeStamp” RowKey is ideal. This will prove ineffective for other search types unless duplicate tables are created with a RowKey adapted to each. This, however, is likely to increase complexity and storage costs.

In conclusion, the complexity of Hadoop and a vast need for customized design to ensure quality data processing requires significant investment. The cost of managing and maintaining this type of complex architecture is also very high.

Author: Mathieu Cura and Jean-François Hénon