Synergy Codes

|
|

What is Data Lineage and why it’s important in data management?

What is Data Lineage and why it’s important in data management?

Anyone who has ever had to make an audit for regulatory compliance, optimize or debug a data-driven process, has probably found themselves asking: “Where did this data come from?” In other words: they found themselves looking for data lineage. Chances are, this search, when done manually, led to hours wasted in reading code and reverse-engineering algorithms.

Data lineage is a fundamental feature of any reliable data governance system, but it is not always included. This article dives into data lineage meaning, how it works, and how to ensure data quality in complex enterprise systems.

 What is data lineage in data governance?

The diagram showing the basic data lineage representation
The data lineage diagram drills down into everything that happens in the data life cycle – horizontally across the organization, or vertically on different system levels.

Data lineage lets you track down everything that happened to a piece of data from the point of origin to its point of use. Data lineage analysis answers six questions about a given piece of data and its whole life cycle – whether it be one variable, a file, or an entire database table:

  • Who is responsible for this data? Who creates, edits, uses, owns it?
  • What information is in this data? What type of data is it and which policies apply to it?
  • Why is this data stored, used, needed? If it’s redundant, should it be deleted?
  • Where is this data stored, processed and used? Where did it come from?
  • When was this data created, modified, deleted, used?
  • Which processes, applications, reports, departments rely on this data?

What does data lineage do?

While it may seem like a minor, technical process for data analysts, having the ability to perform data lineage is not only a compliance necessity for regulated sectors like medicine or law. It can also give a powerful advantage to managers who want to enhance processes and cooperation across the organization, or empower data scientists to speed up debugging and optimization.

The most important uses of data lineage are:

  • Troubleshooting (figuring out where wrong or unexpected data came from)
  • Reporting (providing data provenance and proving regulatory compliance)
  • Optimization and improvement (impact analysis, improving processes, applications, big data systems, data migration)
  • Explainability (better understanding of what influences specific outcomes)
  • Data quality (removing redundant, outdated, or false data)
  • Data security (tracking potential vulnerabilities in a processing pipeline)

Data lineage in business operations

Tracking the life cycle of any data has particular benefits for process optimization. It helps any data-driven manager perform root cause analysis, troubleshot when they receive incorrect information, and prepare detailed reporting. It’s also fundamental for impact analysis or answering the question: “if I change this process, how will it impact the rest of the data ecosystem?”

But data lineage can also be leveraged in business intelligence for powerful insight. For example, sales managers can identify the most profitable sources of customer journeys. Marketing managers can decide which advertising channel needs optimization. Product designers can use data lineage to make informed decisions on popular or problematic areas in the product design.

In enterprise organizations that manage massive and complex data sets across different departments and silos, data lineage can save valuable time by directing them to the right people and processes responsible for a given business result.

Data lineage in data governance

Keeping track of data lineage is a fundamental element of maintaining data quality and performing effective data audits.

Regulations such as HIPAA (Health Insurance Portability and Accountability Act) in the healthcare sector, ESG (Environmental, Social, and Governance) in corporate governance, or GDPR (General Data Protection Regulation) in every industry today, tracking the origin and access to data is necessary to provide compliance documentation to regulators, customers, business partners, and shareholders. It’s not unusual to hear news stories about a company being fined for giving unauthorized access to customers’ data at some point in their operations. GDPR also requires whoever processes the data to obey requests to remove it from every point the data touches in their systems.

Data lineage is also the lifeblood of data management, especially in maintaining high data quality, optimizing storage and structure, and cleaning it up. Big organizations can make significant optimization by keeping a single source of truth for every vital piece of information. Data lineage is necessary to ensure no conflicting redundancies and that the essential information can be traced to its source and stakeholders.

Data lineage in IT

A few years ago, being a data-driven organization was something companies could brag about and enjoy a competitive edge in the market. Nowadays, processing massive amounts of data and multi-level data analytics has become standard in almost every industry.

Data lineage is necessary for data science and advanced analytics to build and improve analytical models. As is evident to companies attempting machine learning and data science, the algorithms and models are only as good as the quality and management of their input data.


In cybersecurity, both at the monitoring and system development level, the whole data processing pipeline must be investigated to address vulnerabilities and suspicious activity. Reliable data lineage makes the entire pipeline visible, allowing troubleshooting, debugging, investigating security breaches, and experimentation with maximum awareness of dependencies and issues that come up along the way.

In data science and advanced analytics, data lineage is necessary to build and improve analytical models. As is evident to companies attempting machine learning and data science, the algorithms and models are only as good as the quality and management of their input data.

In cybersecurity, both at the monitoring and system development level, the whole data processing pipeline must be investigated to address vulnerabilities and suspicious activity. Reliable data lineage makes the entire pipeline visible, allowing troubleshooting, debugging, investigating security breaches, and experimentation with maximum awareness of dependencies and issues that come up along the way.

In ETL (Extract, Transform, Load) processes, unstructured data from multiple sources is transformed in a data In ETL (Extract, Transform, Load) processes, unstructured data from multiple sources is transformed in a data warehouse and further in the enterprise software ecosystem. Data lineage analysis allows ETL developers and other engineers downstream to debug and create reports of the ETL process.

Data lineage is a building block of every IT process that leads to reporting or changes in data processing: from ETL to impact analysis, data migration, or system modernization.

How to perform data lineage mapping?

To do helpful data lineage mapping, the first question we need to ask is: Why do we need it? Is it for business intelligence, such as analyzing which process or data source contributes to a business outcome? Or is it for IT system management? A comprehensive data lineage diagram tool will allow seamless transition from one level of granularity to another and enable switching between different contexts.

Data lineage provides different levels of detail, scope, and filtering depending on need:

Business-oriented data lineage

This type of data lineage visualization or report is also known as business lineage, or horizontal data lineage. It is a cross-system, helicopter view that resembles a black-box approach to how data is processed in an enterprise system. It might show us which application or platform collected the data – e.g., an online shop where users registered – and how that user data.

a simple bubble diagram - this is data lineage in a simplified, zoomed-out view
A screenshot with a simple bubble diagram – this is data lineage in a simplified, zoomed-out view

IT system-oriented data lineage

Vertical data lineage diagram with a view of database columns
Vertical data lineage diagram with a view of database columns

Also known as vertical data lineage, this approach drills down into the columns of database entries and dependencies between their elements across different parts of the IT system. In vertical data lineage, you will see the names and properties of specific processes in the data pipeline: software services, database table transformations, ETL information, API parameters, and so on.

Data lineage techniques

There are a few techniques for data lineage. Each one has its strengths and weaknesses, so if there is no comprehensive automated data lineage solution implemented already, it is usually best to apply a combination of these techniques:

Data lineage through Data Tagging

Also called self-contained data lineage, this type of data lineage can be extracted within a single, self-contained system, which tags data as it is moved and processed within.


Pros: this data lineage analysis is usually reliable and comprehensive.


Cons: it only works on data within a transformation system. The tagging mechanism would not cover the data life cycle outside the system.

Data lineage through parsing

Data lineage by parsing reverse-engineers transformation logic of analyzed data, i.e., how the tools and systems processed it in its life cycle.


Pros: This data lineage technique accurately captures change across systems.


Cons: It is very challenging to implement it for a complete life cycle because it requires an understanding of the structure of all applications and tools in the data life cycle.

Pattern-based data lineage

Pattern-based lineage is done by leveraging patterns and similarities in tables, reports, and columns metadata.


Pros: It can be carried out without understanding the system’s data transformation tools’ programming logic. It can be used universally on different databases and in places where application code is unavailable.


Cons: It’s unreliable since it doesn’t look into data transformation logic.

How to create a data lineage process: best practices

Data lineage analysis can, in theory, be done manually, with a combination of interviews with employees, spreadsheets, reports from software architects, and reading application code. But in enterprise-level companies, this would be a tedious, costly process that risks turning out incomplete results, which is unacceptable if the analysis is required for a regulatory audit.


That’s why this analysis nowadays is done with a dedicated data lineage tool which is either a standalone solution carefully integrated into a company’s IT architecture, or it’s a feature included in an existing data platform, such as a data warehouse, cloud architecture management system, or data analytics solution.


Here are a few of the best practices to assure successful data lineage tool implementation:

1. Stakeholder buy-in and alignment with business requirements. 

Implementing any data policy can only be effective if people working with the data ecosystem see value in it and understand how it improves their outcomes, so they put the requisite effort into consistent data management.

2. Solid data and metadata governance. 

Once system engineers have painstakingly integrated the data lineage tool into the organization’s data ecosystem, new additions to the system must meet compatibility requirements so that reporting can be reliable. It is also vital that data and metadata policies are consistent with regulatory requirements. Data stewards have a complete overview of who is authorized to view and modify data.

3. Data lineage automation and progressive extraction. 

Although manual data lineage capture is possible, it is very resource and time-consuming. Integrating an automated solution makes data lineage analysis convenient and available on demand. Additionally, progressive extraction with automatic tracking assures that data lineage can be validated in real-time, without the need of reverse engineering missing metadata at the time of reporting.

4. Dynamic real-time diagrams with vertical and horizontal data lineage. 

A truly comprehensive data lineage tool covers all data creation, transformation, and access touchpoints and serves various stakeholders’ reports of different granularity levels and contexts. Well-designed visual data dashboards combine detailed lineage records in visual or text format, depending on the user’s needs.

Examples of Data Lineage diagram tools

Data lineage mapping has a simple function, but it requires profound understanding and meticulous integration into any IT system to work correctly. And then, it takes a lot of optimization to make this rich, but straightforward functionality works smoothly and intuitively. The data lineage examples below show how we accomplished that to support better data quality for our clients.

Data mining platform for big data enablement

Data lineage visualisation tool for big data enablement company
Data lineage visualization tool for big data enablement company

We created an efficient and flexible implementation for a company that enables the transition of companies into data-driven organizations. Data lineage visualization is the backbone of ensuring data quality in big data platforms, showing the whole pathway from the data source through the ETL pipeline into the data warehouse, transformations along with the IT system, and the presentation layer for reports and visualizations.

Dynamic data diagrams for data governance

Data mining platform for big data enablement
Data mining platform for big data enablement

We developed a set of dynamically adjusted visual dashboards for a robust data governance platform that shows the dependencies and life cycle of data assets in large-scale ecosystems. This solution traces data lineage on multiple horizontal and vertical layers: software architecture, data infrastructure, business processes, access management.

Summary

Data lineage mapping is the foundation for maintaining good data quality in an organization. Complete insight into data origin and transformation life cycle makes reporting reliable for compliance and business intelligence.


It’s hard to imagine troubleshooting or debugging process without it. It’s also crucial for data security, exposing every place in the IT system that any data object moved through.


Deploying a customized, robust, and responsive visualization tool for data lineage empowers any employee or manager to fully understand and optimize processes, whether internal operations or achieve better business results.