Optimizing Data Quality and Interpretability with Data Valuation


In the fast-evolving area of AI, the AI-DAPT framework stands out for emphasizing data quality and valuation as
driving factors for more accurate and robust Machine Learning (ML) models. Data valuation is a critical process
that assesses the quality and the contribution of data in relation to ML models performance or an entire AI
system. The data valuation encloses several aspects [6], including:

  • Feature Importance: Evaluate which features have the most impact on model results.
  • Identification of relevant Data Points: Assesses how each data point contributes to the overall
    performance of the model. When the application at hand involves images or videos, data valuation
    identifies the importance and utility of each in contributing to the overall prediction.
  • Bias Detection and Fairness: Identify any bias in the data that may influence the model’s predictions in
    an unfair way.
  • Data Quality: Assessment of how accurate, complete, and reliable the data is.
  • Economic and Business Value: Evaluate the economic impact of the data (cost of collecting the data,
    cost of determining what data point is the most valuable, etc).
  • Ethical and Legal Compliance: Ensuring that the data complies with ethical standards and legal
    regulations.
  • Data Traceability: Identify where the data comes from and how it was collected, and whether it has been
    modified.

In the context of the AI-DAPT framework, data valuation is considered multi-dimensional and supportive of the
whole lifecycle of ML models. It features methods for assessing data quality, improving feature selection,
detecting biases and optimizing model interpretability. This blog post discusses state-of-the-art data valuation
methods and outlines how the AI-DAPT project will apply these methodologies to assess and enhance data
quality.

DATA VALUATION METHODS AND PURPOSES

Data valuation is crucial for model performance: the higher the data quality, the more accurate and precise the
model’s predictions will be [1]. A key aspect of assessing data quality is ensuring that models are not trained on
irrelevant features. One common approach is systematically training models while excluding one feature at a
time [2]. This can allow the identification of features that have the most significant impact, as well as those that
might cause overfitting. Such methods are computationally expensive due to the need for retraining multiple
models.
One of the most widely used techniques for evaluating feature importance and data relevance is Shapley values,
a game-theoretic approach that fairly attributes contributions to individual features [3]. This method addresses
the “black-box” challenge accompanied with deep learning models, enhancing data and model explainability
while also supporting tasks such as feature importance analysis and bias detection, which are crucial in the
context of AI-DAPT.

DATA VALUATION IN AI-DAPT

The AI-DAPT framework is based on the principle of understanding that data quality plays a critical role in the
performance of ML models. In domains where data quality and interpretability are particularly crucial, robust
data valuation methods are essential. Additionally, AI-DAPT emphasizes computational efficiency and
scalability, recognizing that effective data valuation must account for the resources required to process, analyze,
and interpret large datasets in real-time. By combining computational techniques, statistical methods, and

domain-specific considerations, AI-DAPT employs a multifaceted approach that ensures both the integrity and
effectiveness of its models, while optimizing for performance and scalability in complex, data-driven
environments.

  • The AI-DAPT project will collect datasets from different domains (healthcare, robotics, energy, and
    manufacturing). The value of each data point, as well as the dataset as a whole, will be important to be
    measured. To quantify the contribution of each data point toward specific tasks, we will utilize Shapley
    techniques. Moreover, we will assess feature relevance and, where comparable public datasets exist,
    analyze and compare the overall impact and value that each dataset offers.
  • AI-DAPT’s data valuation will assess the quality and fairness of datasets, aiming to identify potential
    biases that may affect outcomes. The motivation behind this is that if data input is biased, the output is
    likely to be biased as well [4]. Several methods will be utilized for this. Initially, exploratory data analysis
    will help detect anomalies and missing values. Moreover, class imbalances will be assessed, and

reweighting or resampling techniques will be used when necessary. Finally, we will utilize state-of-the-
art open-source tools like IBM AI Fairness 360 [5], which offers metrics and algorithms that detect and

mitigate biases in data, helping ensure fairness across different attributes.
Our findings will be instrumental in optimizing data collection strategies in the field of our demonstrators.
Therefore, the datasets that will be collected and utilized in the context of AI-DAPT in the domains of healthcare,
robotics, energy, and manufacturing, will contribute to unbiased, high-quality findings and support future
research in these emerging domains.

CONCLUSION

This aspect of data valuation that AI-DAPT focuses on shows one very simple yet important reality of AI: whatever
comes out is only as good as what goes in. By improving the methodology behind data valuation, AI-DAPT is
setting the stage for any future AI applications to ensure not just that they will be powerful and efficient but also
that they are reliable and fair.

REFERENCES

[1] K. Jiang, W. Liang, J. Y. Zou and Y. Kwon, “Opendataval: a unified benchmark for data valuation,” Advances in Neural
Information Processing systems, vol. 36, 2023.
[2] Ghorbani, Amirata, and James Zou. “Data shapley: Equitable valuation of data for machine learning.” International
conference on machine learning. PMLR, 2019.
[3] Jia, Ruoxi, et al. “Towards efficient data valuation based on the shapley value.” The 22nd International Conference
on Artificial Intelligence and Statistics. PMLR, 2019.
[4] M. Huang and R. Rust, “A strategic framework for artificial intelligence in marketing,” Journal of the Academy of
Marketing Science, vol. 49, pp. 30-50, 2021.
[5] “AI Fairness 360 – IBM,” [Online]. Available: https://aif360.res.ibm.com/
[6] Miller, Russell, et al. “A Framework for Current and New Data Quality Dimensions: An Overview.” Data 9.12 (2024):
151.

more insights