Unveiling the Path: Why Data Lineage is Crucial for Building Effective AI Products
Understanding Data Lineage: Exploring Its Definition and Growing Adoption in Organizations
In today’s data-driven world, understanding the journey of data—from its origin to its final destination—is more crucial than ever. This capability, referred to as data lineage, provides a comprehensive view of how data flows through an organization, detailing its transformations and dependencies along the way. Data lineage varies in terms of levels of complexity, with “coarse lineage” demonstrating the table to table transformations, and “fine lineage” being at the attribute level. These assets can be mapped in tools such as Solidatus, providing an automated method for creating a clear overview of data sources, transformation, and usage. At Artefact, our teams design & build data & AI products for our clients day in and data out, and lineage helps our clients answer questions like: “Which systems are giving us this client balance, since it looks inaccurate?” or “Why is my client lending propensity model having different results specifically on Wednesdays?”
Companies leverage data lineage for several key reasons, with regulatory compliance and data quality management being at the forefront. In the financial services industry, robust data lineage is essential for meeting stringent auditing requirements and principles such as the BCBS 239, ensuring adherence to regulations around governance, data architecture, risk data aggregation, accuracy, integrity and frequency of risk reporting. Looking beyond compliance, data lineage is a powerful tool for enhancing data quality, enabling organizations to track data issues, validate accuracy, and maintain trust in their information systems. This article will delve into the intricacies of data lineage, specifically coarse lineage, and explore why it has become a cornerstone of modern data management strategies.
"AI’s Rapid Rise in Financial Services: Opportunities, Challenges, and the Path Forward"
Building on the importance of understanding data, Artificial Intelligence (AI) is transforming the modern financial services landscape, simulating human intelligence to perform tasks requiring learning and decision-making. AI’s applications are diverse and impactful: conversational AI, like chatbots, enhances customer interactions; productivity assistants streamline workflows and automate tasks; and automated data analysis accelerates insights from complex datasets. In August 2024, the European Union’s AI Act introduced new regulations aimed at ensuring ethical AI use and protecting user rights, highlighting the global shift towards responsible AI implementation. This development underscores the growing need for organizations to not only harness AI’s power but also manage it with careful oversight, complementing their efforts in data lineage and quality management.
While the use of open-source Generative AI like ChatGPT for personal use, integrating AI into an organization and generating real value for the business is a different ball game. Most financial institutions are in the rapid race of churning out Gen AI pilots and POCs however real dollars are only committed when it’s proved that they believe the potential benefits are dependable and the product is suitable for both business and technical users. Many institutions are still struggling to scale these technologies due to concerns about reliability (74%), user adoption (60%), and insufficient technical expertise (60%). A Gen AI scalability framework is what has been built by Artefact to address the core scalability dimensions: Output Relevancy, Explainability, Fairness/Bias, Latency, Infrastructure, Organizational Efficiency, and User Experience/Adoption.
In the context of AI, data lineage offers significant business value by ensuring transparency and reliability in data-driven decisions. Today, over 75% of consumers are concerned about misinformation from AI. AI is often referred to as a “black box”, meaning end users frequently do not understand the inner workings that produce the output they are regularly using. As AI systems increasingly rely on vast and complex datasets, understanding the origins and transformations of this data is crucial for maintaining accuracy and trustworthiness. Data lineage helps organizations track and validate the data feeding into AI models, which is essential for optimizing model performance and addressing issues like bias or errors. By providing a clear audit trail, data lineage also supports compliance with regulations and enhances data governance, ultimately leading to more informed, reliable, and ethical AI applications that drive better business outcomes.
Data Lineage in Action: How It Could Have Supercharged Real-World AI Development
Data lineage is crucial for meeting regulatory and legal requirements in AI, especially under policies like the California Consumer Privacy Act (CCPA) and the Gramm-Leach-Bliley Act (GLBA). For example, consider a use case involving customer turnover within a financial services firm. In this case, the system lacked standardized practices for anonymizing private information and had no data lineage to track data flows. As a result, data enrichment to mask sensitive details was performed as a last step with minimal governance. This approach not only compromised data privacy but also exposed the system to compliance risks. If our partnering organization had robust data lineage in Solidatus, the organization could have tracked where data was being used, captured data transformations, ensured proper anonymization at each stage, and met regulatory requirements more effectively, thereby safeguarding privacy and enhancing data governance.
The majority of organizations (80%) have claimed their data is ready to use in AI, however more than half (52%) experienced issues with implementation based off the quality of their data. Data lineage is vital for ensuring data quality in AI development, as it provides a clear view of how data is sourced, transformed, and utilized. At Artefact, we understand the imperative of data readiness and quality. We believe in an AI operating model that develops the technical requirements simultaneously with the data preparation and governance required to deploy large scale reliable AI. Our teams worked on a credit risk prediction model that relied on multiple data tables to assess borrower risk. The team discovered inconsistencies between these tables in their preliminary investigations—such as discrepancies in data formats or outdated information. This would cause the model to be skewed, and an inaccurate risk assessment to be generated. By implementing data lineage, the organization could trace the origins of data, identify where inconsistencies arise, and ensure that data transformations align with quality standards. This transparency helps in correcting issues before they impact the model, ultimately leading to more reliable and accurate predictions, and maintaining the overall integrity of the AI system.
The majority of organizations (80%) have claimed their data is ready to use in AI, however more than half (52%) experienced issues with implementation based off the quality of their data. Data lineage is vital for ensuring data quality in AI development, as it provides a clear view of how data is sourced, transformed, and utilized. At Artefact, we understand the imperative of data readiness and quality. We believe in an AI operating model that develops the technical requirements simultaneously with the data preparation and governance required to deploy large scale reliable AI. Our teams worked on a credit risk prediction model that relied on multiple data tables to assess borrower risk. The team discovered inconsistencies between these tables in their preliminary investigations—such as discrepancies in data formats or outdated information. This would cause the model to be skewed, and an inaccurate risk assessment to be generated. By implementing data lineage, the organization could trace the origins of data, identify where inconsistencies arise, and ensure that data transformations align with quality standards. This transparency helps in correcting issues before they impact the model, ultimately leading to more reliable and accurate predictions, and maintaining the overall integrity of the AI system.
Summary
Data lineage is transforming how organizations manage and leverage their data. From ensuring regulatory compliance to enhancing data quality, understanding the flow of data is no longer optional—it’s essential.
At Artefact, they design and build AI solutions that depend on reliable, traceable data to drive impactful outcomes. Tools like Solidatus allow them to map data’s journey and provide clarity on its sources, transformations, and uses. This clarity is the key to answering tough questions like why a model’s results vary or which systems contribute to inaccuracies.
In the financial services industry, where compliance standards like BCBS 239 demand precision, robust data lineage ensures that organizations can meet regulatory requirements while enhancing trust in their AI models. Beyond compliance, it is a cornerstone for building scalable, reliable AI systems that deliver real business value.