How custom diagram visualizations support the data lineage in data warehouses

Content team

•

Mar 30, 2022

•

min read

Discover how custom diagram visualizations, using GoJS and Angular, support data lineage by improving transparency, analysis, and governance in data warehouses.

For many companies that deal with multiple data sources, it's essential to grasp how to provide transparency while greatly simplifying and eliminating errors rooted in the origin of the data.

Our previous article introduced the data lineage in basics. As you know, data lineage delivers significant Business Value. Now it's time to provide examples of custom diagrams and visualizations that support the data analysis and working with data warehouses.

Let's check out some projects and expertise in building the functionalities based on custom diagram visualization methods that greatly help business users investigate big data and its analysis, especially needed in data lineage.

The pros of diagram visualizations to improve data quality

To understand the supportive role of data visualization and custom diagrams use in data analysis and data processing with data warehouses, it's crucial to grasp the idea of graph visualization. It's also called 'link analysis' or 'network visualization.' Diagram visualization is the process of visually representing the connected entities as networks of objects and links.

For a smooth operation, the visualizations are interactive, so the user intuitively explores the complex connected data at scale.

The purpose of the use of diagram visualization lies in the growing need to present data and data relationships in a readable and easy-to-grasp way. Thus, diagram visualizations cover the expectations by:

1.      Intuitive use

Implementing a diagram visualization object-link model allows the user to grasp the sense and idea ad hoc. It's becoming understandable even for non-tech users or unfamiliar ones with the diagrams.

2.      Fast operating

Diagram visualization supports spotting patterns in a readable way. It helps identify trends and outliers ad hoc.

3.      Scalability

The helicopter view enables one to examine the overview and then deep dive into the details within a single diagram. The interactive diagram visualization leads to noticing and understanding individual data points, and thus it reveals the context and structure along with single connections.

4.      Insightfulness

A high-level overview first draws the user's attention to lead to a more profound knowledge of the data and its connection. Alternative interactive visualization supports gaining complete insights.

Intelligent data processing and data mining tool

Modern business organizations operate within processes that generate a large amount of valuable data. Data mining, then, is about analyzing, which deals with accurately acquiring insights into the semantics of the actual business process's semantics.

The presented visual components support getting maximum value in the shortest possible delivery time by targeting data and process content.

Our team has been working on crafting the database integrations functionalities via flow diagrams. To follow the data management via interactive flow diagrams, database diagrams, and interactive dashboard, we have implemented the example functionalities that enable data mining, thus structuring the information.

Database flow

The database flow provides a helicopter view of the data systems. It consists of a central diagram that depicts the databases' correlations. The user knows exactly the origin and the target place of the data. Additionally, the provided arrows with direct the path of the data journey. The graph presents the information flow and dependencies between the databases. Each data flow can be followed through the filters to take action.

Database's correlations

The user can dive through specific information flow anytime. The expandable nodes showcase the details within them. By clicking on the node, the user learns data insights. It leads to discovering dependencies between databases, filtering, and taking up the actions by clicking the node. On the other hand, the highlighted arrows facilitate investigating the input and output correlations and their origins.

TThe graphic shows the diagram with detail node conenction in UML

Database timeline

The timeline equips each record to present data changes during a specific time. By clicking on the node, the user can notice any change within the data stream, see the change history record or follow the flow. The integrated table component enables the user to look over the separate data sheet within the app.

Within the features mentioned above, our team has made an effort to implement the right technology to provide the demanded functionalities to make the dedicated tool operate in the desired way.

The grapic shows the timeline correlations within the UML diagram

The dev team used React to write the project and GoJS to create the diagrams. They enabled crafting the data lineage features such as:

Highlighting the entire path from the root of the diagram to a specific node of the item;
Filtering the diagram based on the above routes;
Fetching data from the API by expanding the node from the selected page and caching this data;
Topological sort, including the depth, thus reflecting the arrangement of nodes from Digraph;
Dropdown with a search engine on the base (with checkboxes) for filtering node items;
Optimized (without recursion) algorithms for searching and graph traversal.

The tool then can be operated by data scientists, data architects, or data managers.

Bulletproof and accessible solution for data management and data provenance

When dealing with database storage, it's necessary to craft the solutions and tools supporting data visualization for loads of gathered data. Thus, the Synergy Codes specialists were challenged to provide customized visualizations, including data from DBML or the files generated as part of the application DBT process to investigate the connections between them and craft the solution enabling easy and more accessible data analysis.

The user can use the tool for more complex operations within the diagrams. Once the data has been stored in spreadsheets, the tool gathers them all and enables grouping within the nodes. Each node includes the information and can be expanded to investigate the details. In this way, the use of large databases and observing the several files within one diagram is possible. This makes the work well-balanced and intuitive.

Considering the data lineage features, the Synergy Codes specialists made up such analytical and transaction databases correlation functionalities as:

Data management with ETL processes

ETL stands for Extract, Transform and Load. It is a data integration process that combines data from multiple data sources into one consistent data store loaded into a data warehouse or other target system.

The essential information is copied or exported from source locations to a staging area during data extraction. The transform phase occurs in the staging area, so the raw data undergoes data processing. It means that data is transformed and consolidated for its analytical use case. The load stage deals with the transformed data. It is moved from the staging area into a target data warehouse. This usually involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse.

For most organizations that use ETL, the process is automated, well-defined, continuous, and batch-driven.

The diagram showing the basic data lineage representation

Data change through the flow

Most data change is being held during the transform stage of the ETL data management process. You can analyze data via some specific functionalities the diagram can perform:

Filtering, cleansing, de-duplicating, validating, and authenticating the data.

Performing calculations, translations, or summarizations based on the raw data.

The above actions can change row and column headers for consistency, convert currencies or other units of measurement, or edit text strings.

Conducting audits to ensure data quality and compliance.

Removing, encrypting, or protecting data governed by industry or governmental regulators.

Formatting the data into tables or joined tables to match the schema of the target data warehouse.

Live interactions within links

As part of building data lineage in advanced interactive data management tools, Synergy Codes specialists use various algorithms and visual libraries, including GoJS. The dev team used React, and Redux as well.

Thanks to them, it is possible to improve the operation of the application, but above all, extensive networks of diagrams, which, thanks to effective rendering, can display large amounts of data in a short time. They also enable visualizing the data warehouse architecture.

Thanks to highlighting and colors, the user knows the data path - from the source to the destination. Such visual procedures allow, first, to recognize specific data sets in a diagram. Still, most of all, to track changes and enable data protection during processes in the transform phase.

Another visual solution is enabling expandable nodes to look up the details of the sheer volume of data.

It is also worth mentioning that we have used the option of removing diagrams from the list when they become outdated or when specific data is eliminated from the database. Thanks to this, the diagrams are up-to-date and include only such data necessary at a given stage of work with them.

What is Data Lineage and why it's important in data management

Data management diagrams for robust organizations

Data management diagrams for enterprise organizations, significantly when data-related demands increase, the data landscape becomes more complex and distributed, pushing organizations to shift their data management practices from managing data to managing metadata. The unstructured data is then a challenge to conquer.

We equipped one of the leading data governance company's Data Catalog Software with advanced and custom diagrams such as dependency diagrams, inheritance diagrams, and knowledge graphs, to list just a few. It all aims at better understanding, more accessible access, and effective communication around data.

The core of the tool is the custom layout and dynamic counting swimlanes. The dev team used a layered digraph layout. For Knowledge graphs, they used virtualization.  The user can benefit from custom expand and collapse objects logically. The following functionalities are crafted with Typescript, Angular, RxJS, and GoJS.

Interactive dependency diagrams

The below visual presentation of data showcases the information placed in various databases and data warehouses. If the company relies on AI implementations, the data is categorized into sets and presented visually in swimlanes. Each object comes with a collapsible property section. The mechanism is equipped with a detection algorithm to ensure proper re-layout in any view.

Inheritance diagrams

This diagram visually represents the data organization and relationships with various databases and data warehouses. Each parent object can lead to many other nodes and vice versa. The advanced layout is based on a custom diagram layout.

The illustration depicts the inheritance diagram for data lineage

The extra feature: Knowledge graph

One of the functions is the indefinitely expanded diagram built with GraphQL API. The user can examine and work on several thousand nodes thanks to virtualization. Thus, there is no need to render everything visible on the screen. An applied minimap facilitates navigation.

The illustration shows the knowledge graph for data lineage

Roll-up

The above examples of data visualization made for data lineage operations, more specifically, for data analysis in data warehouses, enable users to bulletproof their processes with a large amount of data stored in tables and spreadsheets. Custom diagram visualizations are the core of best understanding the sources and routes of data, thus encouraging to dive deep into specific information gathered in readable and intuitive diagramming solutions.

The given functionalities include data lineage use and specifications to smoothen the extensive data works. With various technologies, we can manage to craft the tools to deliver the best data governance functionalities any large company needs.

If you find this topic relevant to your business needs, you can contact us for further examinations. 

Contact details