What is Data Lineage?

Data is plentiful and data-driven insights are the gold standard for business strategies and decisions. Yet, for many organizations visibility into what data sources (and what information is extracted) is still a mystery, leading to questions about integrity and accuracy because a lossy data conversion can remove details that, in security particularly, have larger repercussions for data analytics and key performance indicators (KPI’s).

Tracing data lineage across the security and IT data lifecycle enables organizations to create consistent, accurate, and reliable security analytics that enable strategic decisions making across security, IT, compliance, and operations functions. Without the appropriate insight into data sources, organizations increase data storage and maintenance costs while reducing analytics’ value.

What is data lineage?

Data lineage is the process of tracking – and ideally documenting – the journey of data over time. This begins from its creation at the source, includes various transformations as it moves through data pipelines, workflow engines, and ETL/ELT processes, and ends at the final application. By tracking how data travels from upstream producers to downstream consumers, organizations can more effectively identify a data’s source, understand how it has been manipulated, and verify its reliability for use in analytics models that drive decision-making and reporting.

Data lineage vs. data provenance vs. data governance

Although these three terms are interrelated, they apply to different processes:

Data lineage: the dataset’s complete history
Data provenance: the data’s origin source system and transformations
Data governance: data management that facilitates appropriate access, storage, and handling

Data governance relies on data lineage and data provenance. To facilitate appropriate data handling, organizations need visibility into where data originates and how it has been processed. For example, with security data, organizations need visibility into:

Technology generating the telemetry, like Endpoint Detection and Response (EDR) or network monitoring tools
Normalization of the data across the pipeline, like transforming data to an extended version of the Open Cybersecurity Framework Schema (OCSF)
Users accessing and using the data, like security, compliance, IT, and operations teams

Why is Data Lineage Important?

As organizations transform data and it moves throughout the data infrastructure, tracing data lineage empowers:

Trust and transparency: documenting data’s journey from source to consumption layer so data users save time when vetting data
Data quality and reliability: identifying intentional or unintentional loss of data or precisions that can create errors or bugs for downstream data consumers
Data and application debugging: identifying data error root cause to fix errors at the source
Regulatory compliance: implementing appropriate controls over protected information across the entire data lifecycle
Data modeling: revealing unknown or accidentally bypassed relationships between data elements and gaining real-time context into data flows to update models or make them more precise
Strategic decision-making: ensuring data remains updated and accurate so data users can trust the analytics used in decision-making
Impact analysis: identifying upstream and downstream impacts that changes to tables, columns, or reports can have
Self-service analytics: providing necessary context into upstream and downstream lineage for data analysts
Data exploration: improving discovery capabilities for more accurate analytics
Data modernization: identifying and documenting data elements that are critical for cloud migration
Asset management: identifying least and most used and certified data assets

Data lineage supports enterprise use of security telemetry for activities like:

Continuous controls monitoring (CCM): identifying and normalizing the data that enterprise technologies and cybersecurity tools generate to drive accuracy
Threat hunting: trusting analytics that enable the operationalization and automation of activities that proactively identify security incidents
Security hygiene: correlating and analyzing data to discover assets and suggest missing owners

How does data lineage work?

When automating data lineage processes, tools typically collect metadata, the data about data, like:

Type
Format
Structure
Source
Date created
Date modified
File size

After collecting this information, the tools organize it in a hierarchical structure using the following concepts:

Process: the data transformation operation that the system supports
Run: the execution of a process, often containing details like start time, end time, state, or other attributes
Event: the moment in time when a transformation operation occurred causing data to move between a source and the target entity

When managing security data across multiple tools, like security information and event management (SIEM) solutions, data lineage becomes complicated. For example, at the enterprise level, organizations struggle to trace data from its origin and throughout various transformations because they may:

Maintain more than 100 different cybersecurity tools
Incorporate multiple SIEMs or centralized log management solutions that all use different schemas
Face challenges losing data quality as they cascade through the hierarchy

Types of data lineage

Organizations struggle with data lineage because they often find that they need to choose between purpose and granularity when engaging in the process.

Business versus technical lineage

At a high level, the choice between these two means determining which is more important to the use case:

Business: context into business purpose and daily use, like comments, data classifications, justifications, data consumer notes
Technical: end-to-end visibility for data engineers and technical analysts about how data reached its destination team

When managing security data and analytics, data teams struggle to identify all internal data consumers across various teams, including IT, security, compliance, and operations. With different needs, business lineage may be more important. Meanwhile, traditional data orchestration tools may not be able to extract technical lineage from the complex and diverse schemas that the IT and security stack uses.

Table-level versus column-level lineage

At this level, the choice between these two means determining whether location or granularity is more important:

Table-level: ways tables map to each other by using the metadata from the relational database or data lake
Column-level: ways changing data in a table’s columns impact attributes like data type and precision or how combining columns created new column

When managing security data analytics, teams often struggle because the data may be semi-structured, not easily lending itself to table mapping or column granularity.