Data lifecycle in a big data world

Jeff Richmond

Big data brings new opportunities but raises significant data governance challenges.

Big data introduces new challenges to data management that extend beyond managing vast volumes of data. One challenge often overlooked is that of data governance, both of the source data and its outputs. Comparing the data lifecycle of traditional data warehousing with big data helps to understand one of the more complex challenges of data governance in this new data world.

The data lifecycle

A typical data lifecycle will consist of four stages: Ingestion (landing the data); Identify/Cleanse/Enrich (tabularization of the data); Normalisation (building a common, integrated business-neutral data model); and Presentation (transforming data from the normalized model into a business-specific model for query and analysis).

Ingestion

At the stage the various data sources are brought onto the data platform. This data can include structured data such as spreadsheets and outputs from databases, unstructured data such as documents and social media information as well as audio and video.

Ingestion will typically include some form of basic data checking and validation (row count, checksum check, etc.), but essentially all this is doing is dumping all the available data into one central location. In the big data world this is often referred to as a Data Lake, Data Pool or Data
Reservoir.

Identify/Cleanse/Enrich

The next stage is the process of identifying the ingested data by tabularising it – that is, identifying column names and data types. At this stage the data may also be enriched (for example, adding the full name of a stock from the ticker ‘APPL’ to ‘Apple Computers’, and cleansed. For example, all those elements that failed basic integrity checks (for example, a row with too few mandatory columns) may be removed at this stage. Both of these actions are optional however – these steps may occur in the next phase.

Normalisation

The normalisation of the data entails transforming the data to an agreed business-neutral data model, sometimes referred to as a canonical model or, more technically, a Third Normal Form model (3NF). This involves building relations between the different data entities, essentially codifying internal knowledge and structure for the data. Sometimes this phase is described as data integration, because building up the model typically requires matching and combining data from multiple sources.  

This is the point at which business rules and domain checks would typically be introduced, as well as validating against master or reference data.  For example, a customer address may be validated against a Master Data Management service, which will provide an enterprise-wide best fit, based on all variations of known addresses. 

Presentation

The final step of the process, Presentation, transforms the business neutral model created in the previous step to one or more business-specific data representations. This model is often referred to as a dimensional model or snowflake schema since its shape consists of multiples dimensions related to a central fact entity. Further business rules may be applied at this point, as well as aggregations and the creation of derived data.

This form of model lends itself to efficient querying by business intelligence tools and can support high complexity and fairly large volumes. Apart from existing in relational forms, it can also be the basis for OLAP cubes and in-memory columnar stores.

Schema on Read / Schema on Write

A primary difference between traditional data warehousing and big data is the point at which the end user (the consumer of the data) begins to use the data.

In the traditional data-warehousing environment the consumer would usually only enter the picture after the Presentation stage, where the schema is well defined and populated – this is known as ‘schema on write’. Business Intelligence platforms and advanced analytics tooling consume data from the Presentation layer to provide reporting, dashboards and predictive
analytics.

In big data, the consumer accesses the information much earlier, somewhere between the Identify/Cleanse/Enrich stage and the Normalization stage (see chart). Each consumer group performs their own normalisation of the data at query time, building relationships between the data that fit their own business or research – this is known as ‘schema on read’. Essentially, they begin to create internal knowledge and structure that makes sense to them.

Big Data versus traditional data warehousing

The nature of the data processing in traditional data warehousing and big data methodologies leads to a number of differences in the processing and cost of the data lifecycle.

In big data, the first two stages are high volume and low cost/effort. Data (from social media, for example) is abundant and cheap and the ingestion, identification and cleansing of the data is relatively simple; the challenge lies in managing the vast volumes of data.

The difficulty lies in the latter two processes of the data lifecycle (Normalisation and Presentation) when trying to create meaning from such a vast and largely unorganised data set (schema on read).

In contrast, data warehousing requires a substantial amount of effort to ensure the quality of the data ingested and to transform data to appropriate data models (schema on write), as well as the consistent application of business rules.  However, the advantage is that all consumers have the same view of the data universe, at least up to the point of using the data. The maturity of tooling to access data delivers a very rich and high performance query capability (SQL is the lingua franca of data query).   

The data value-density (i.e. the ratio of quality and value of the data relative to the entirety of the data) is much higher in the traditional data warehouse – every row has intrinsic value.  This contrasts with big data, which often has low value-density since it has not been through similar processes and so the potential value of any given row is unknown.

Uncovering the value of big data is driven by each consumer group trying to answer or solve a given problem. Each group must apply an interpretation to the data, typically without much or any knowledge of what other groups have done.

Furthermore, the relative immaturity of big data query tooling means that consumer groups may use differing technologies. 

One distinct advantage, which big data has is agility. While data warehouses are notoriously difficult, time consuming and expensive to modify, data consumers set their own criteria and schedule within a big data world. There is no need to agree changes to the normalized data model with other consumer groups, and no need to reflect changes into the relevant business-specific presentation model(s). 

Big data governance issues

The ability of consumers of big data to access and interpret the data at a much earlier stage of the data lifecycle raises serious questions about data validation and the ability to verify the outputs of a model.

In data warehousing there is a single universal data set against which results can be tested (e.g. in economic data or results from medical trials) – that stability of the data source is eroded significantly in big data.

This poses serious problems when trying to test and validate models, as it is impossible to compare the interpretation of the data by different consumers. Similarly validating data lineage becomes problematic, not for any one consumer group but for the organization as a whole.

Models and outputs from data warehouses can trace back any item of data to its original source, including any transformation and business logic – this is relatively established practice. With big data, because different consumers are doing their own normalisation they may have slightly
different interpretations of, for example, P&L data and this makes it very difficult if not impossible to demonstrate provenance of what should be an authoritative figure.

Big opportunities and challenges

Big data provides opportunities for cheaper and more agile delivery to data consumers as compared to traditional data warehousing, but by its nature it sacrifices enterprise-wide agreed data models and data provenance.  The inherent modelling mechanisms of data warehousing must be augmented by stronger non-systemic controls in a big data world.

Jeff Richmond is a member of the Oracle Cloud Enterprise Architecture team in the UK. The views expressed in this article are the author’s own.

NEXT ARTICLE:

Big data in academia