Data consistency assessment facilitates transfer learning in ADME modeling

AssayInspector: A Python package for diagnostic assessment of data consistency in molecular datasets

Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments between benchmark and gold-standard sources that degrade model performance. Our analyses further revealed that dataset discrepancies arise from differences in various factors, from experimental conditions in data collection to chemical space coverage. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.

Keywords: data reporting, molecular property, ADME, physicochemical, machine learning, data aggregation, predictive accuracy, benchmark

Installation

To install and use the package, first create the conda environment as follows:

conda env create -f AssayInspector_env.yml

Then, activate the environment:

conda activate assay_inspector

Finally, install the package from PyPI using pip:

pip install assay_inspector

Getting Started

To run AssayInspector, you first need to prepare your input data. The file should be in .tsv or .csv format and include the following required columns:

smiles: The SMILES string representation of each molecule in the dataset.
value: The annotated value for each molecule — use a numerical value for regression tasks or a binary label (0 or 1) for classification tasks.
ref: The reference source name from which each value-molecule annotation was obtained.
endpoint: The name of the endpoint to analyze.

You can find two example input files for the half-life and clearance datasets.

Usage

Once the input data file has been prepared, you can run AssayInspector in the following way:

from assay_inspector import AssayInspector

# Prepare AssayInspector report
report = AssayInspector(
	data_path='path/to/dataset/file.tsv',
	endpoint_name='endpoint',
	task='regression',
	feature_type='ecfp4',
	reference_set='path/to/reference_set.tsv' # optional
)

# Run AssayInspector report
report.get_individual_reporting()
report.get_comparative_reporting()

AssayInspector arguments

Argument	Type	Description
`data_path`	`str`	Path to the input dataset file (`.csv` or `.tsv` format).
`endpoint_name`	`str`	Name of the endpoint to analyze.
`task`	`str`	Type of task: either `'regression'` or `'classification'`.
`feature_type`	`str`	Type of features to use: one of `'ecfp4'`, `'rdkit'`, or `'custom'`.
`outliers_method`	`str`	(Optional) Method to detect outliers: `'zscore'` (default) or `'iqr'`.
`distance_metric`	`str`	(Optional) Distance metric for custom descriptors: `'euclidean'` (default).
`descriptors_df`	`pd.DataFrame`	(Optional) DataFrame containing molecular descriptors for dataset molecules (required when `feature_type='custom'`).
`reference_set`	`str`	(Optional) Path to an additional dataset used for comparative analysis.
`lower_bound`	`int` or `float`	(Optional) Lower bound to define the endpoint applicability domain.
`upper_bound`	`int` or `float`	(Optional) Upper bound to define the endpoint applicability domain.

The resulting output will be saved in a folder named AssayInspector_YYYYMMDD, which will contain:

A tabular file that summarizes key descriptive parameters for each data source.
A comprehensive set of visualization plots that facilitate the detection of inconsistencies across data sources.
An insight report containing multiple alerts and recommendations to guide data cleaning and preprocessing.

Examples

Below are a few sample outputs generated by AssayInspector.

Endpoint	Outlier Visualization	Endpoint Distribution Comparative Visualization
Half-life
Clearance

License

AssayInspector is licensed under the MIT License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.github/workflows		.github/workflows
assay_inspector.egg-info		assay_inspector.egg-info
assay_inspector		assay_inspector
data		data
dist		dist
examples		examples
notebooks		notebooks
test_scripts		test_scripts
tests		tests
.gitignore		.gitignore
AssayInspector.svg		AssayInspector.svg
AssayInspector_env.yml		AssayInspector_env.yml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data consistency assessment facilitates transfer learning in ADME modeling

Installation

Getting Started

Usage

AssayInspector arguments

Examples

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

chemotargets/assay_inspector

Folders and files

Latest commit

History

Repository files navigation

Data consistency assessment facilitates transfer learning in ADME modeling

Installation

Getting Started

Usage

AssayInspector arguments

Examples

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages