🚧 This site is not yet live — actively under construction
Paper ingested
Study type classified
PICO extracted
Claims atomized
Evidence graded
Claims indexed
Quality verified

Overview

Our system processes published research through an automated pipeline that transforms unstructured abstracts and metadata into structured, graded, and interlinked evidence. The pipeline has seven stages, each producing versioned outputs that can be re-run independently.

This page documents each stage in detail. If you have questions or concerns about our methodology, we welcome them — methodology@onehealth.science.

1. Ingestion

We continuously index research from cross-publisher metadata feeds, open-access repositories, and publisher-provided endpoints. Our ingestion prioritizes metadata and abstracts, which are available without paywall restrictions for the vast majority of published research.

Every paper receives a canonical identifier, using DOI as the primary key where available, with fallbacks to PMID, publisher ID, or a title+authors+year fingerprint. Deduplication runs on every ingestion cycle using DOI matching and fuzzy title comparison.

We store three layers of data separately: the raw source payload (immutable), the normalized canonical record, and the derived AI outputs. This separation means we can re-run extraction models, improve rubrics, or update our pipeline without corrupting the underlying data.

For details on which journals and databases we currently index, see Sources & Coverage.

2. Study Type Classification

Each paper is classified into one of the following study types based on its title and abstract:

CategoryStudy Types
Evidence synthesisSystematic review, meta-analysis
ExperimentalRandomized controlled trial, clinical trial (non-randomized), crossover study
ObservationalCohort study, case-control study, cross-sectional study, diagnostic accuracy study
DescriptiveCase series, case report, field study, surveillance study
LaboratoryIn vitro study, animal model (experimental), pharmacokinetic study
NarrativeNarrative review, guideline, expert opinion, editorial

Classification uses a large language model with constrained output labels. A confidence score is stored alongside each classification. Papers classified below our confidence threshold are flagged for secondary review by a lightweight ML fallback model.

3. PICO Extraction

For each paper, we perform PICO-aligned structured extraction of Population, Intervention, Comparator, and Outcome fields. When no comparator exists (observational studies), the field is left empty rather than forcing an artificial structure.

FieldWhat we extractExample
PopulationSpecies, breed (if stated), age range, clinical settingDogs, Labrador Retriever, 2–8 years, referral hospital
Intervention / ExposureTreatment, drug, procedure, or risk factorMaropitant 1mg/kg IV q24h
ComparatorControl group or alternative treatment (if any)Ondansetron 0.5mg/kg IV q12h
OutcomePrimary and secondary outcomes measuredEpisodes of emesis per 24h, appetite score

PICO fields use free-form extraction rather than controlled vocabulary. This is deliberate: veterinary papers rarely follow clean PICO structure, and constrained extraction would either lose information or hallucinate structure that isn't there. Comparability across papers is achieved through semantic search — PubMedBERT embeddings ensure that "Dogs, Labrador Retriever, 2–8 years" and "Canine, Labs, adult" match by meaning, not string. Where controlled vocabulary matters (species, drugs, conditions), entity linking to SNOMED-VT handles normalization at the entity level.

4. Claim Extraction

This is the core of our extraction pipeline. We use a triple-extraction consensus process: three independent AI extractions run on every abstract at temperature 1.0, producing diverse candidate claims. These are then clustered by semantic similarity (PubMedBERT embeddings, cosine threshold ≥ 0.85), and only claims that appear in at least 2 of 3 extractions survive into the final output. This dramatically reduces hallucinated or idiosyncratic claims.

Each surviving claim is structured with the following attributes:

AttributeDescriptionValues
Claim textNormalized, atomic statement of the findingFree text (standardized format)
PolarityDirection of the findingPositive, Negative, Null
Causality strengthLanguage reflecting causal confidence"caused," "associated with," "no evidence"
Claim typeThe kind of assertionEfficacy, Harm, Diagnostic, Epidemiological, Mechanistic
ProvenanceSource sentence index in the abstractInteger offset(s) or character span(s)

Provenance is non-negotiable. Every claim must map to one or more specific sentences in the source abstract. We validate this with source passage verification: the cited text must appear as a verbatim substring of the original abstract, and any numbers in the claim are cross-checked against the source passage. Claims that fail verification are excluded. This is our primary defense against hallucinated findings.

5. Evidence Grading — GRADE Framework

We use the GRADE (Grading of Recommendations, Assessment, Development and Evaluation) framework, the international standard used by the WHO, Cochrane, and over 100 organizations worldwide for rating evidence quality. Each source receives a GRADE quality level based on study design, with adjustments for methodological factors.

This is not a substitute for expert systematic review — it is a transparent, reproducible triage that helps users prioritize and contextualize evidence using a recognized clinical standard.

Quality levels by study design

Quality LevelStarting Study DesignsWhat this means
HighSystematic reviews, meta-analyses, well-designed RCTsHigh confidence that the true effect is close to the estimate. Further research is unlikely to change our confidence.
ModerateRCTs with limitations, well-designed cohort studiesModerate confidence. The true effect is likely close to the estimate, but there is a possibility it could be substantially different.
LowObservational studies (cohort, case-control, cross-sectional)Limited confidence. The true effect may be substantially different from the estimate.
Very LowCase series, case reports, in vitro, expert opinionVery little confidence. The estimate is very uncertain. Useful for hypothesis generation.

GRADE adjustments

Following GRADE methodology, the initial quality level can be adjusted based on factors that increase or decrease confidence:

FactorDirectionWhen applied
Risk of biasDowngradeNo blinding, allocation concealment issues, or high attrition detected
ImprecisionDowngradeVery small sample sizes or wide confidence intervals
IndirectnessDowngradeSurrogate endpoints rather than clinical outcomes
Large effectUpgradeObservational studies with large, consistent effect sizes
Dose-responseUpgradeClear dose-response gradient detected in language cues

Source quality score

Alongside the GRADE quality level, each source receives a composite quality score (0–10) that captures additional quality signals in a single number. The formula is:

Quality Score = (Study Design × 0.40 + Journal Tier × 0.25 + Sample Size × 0.20 + Recency × 0.15) × 10

ComponentWeightHow it's calculated
Study design40%Mapped from study type classification: systematic reviews and RCTs score highest, case reports and expert opinion score lowest.
Journal tier25%Inverse ranking of the source journal within its discipline. Higher-tier journals receive higher scores.
Sample size20%Log-scaled: log(n) normalized to 0–1. This rewards larger studies while preventing extreme values from dominating.
Recency15%More recent publications score higher, reflecting current best evidence. Older landmark studies retain value through high citation counts.

The final score is clamped to the 0–10 range. Both the GRADE quality level and the composite score are displayed on every Evidence Card, so you always have two complementary views of evidence quality.

Two views of quality. The GRADE level tells you the type of evidence and how much confidence to place in it. The composite score integrates additional signals (journal quality, sample size, recency) into a single comparable number. Together, they give you both the categorical judgment and the granular detail.

6. Entity Normalization

Without normalization, the evidence index fragments. "Dogs," "canines," "Canis lupus familiaris," and "canine" must all resolve to the same entity. The same applies to pathogens, drugs, diseases, and procedures.

Our entity linking is RAG-grounded to SNOMED-VT (Systematized Nomenclature of Medicine — Veterinary Terminology), a curated ontology of 2,000+ veterinary concepts. Rather than maintaining separate dictionaries, we embed the ontology using PubMedBERT and use retrieval-augmented generation to link extracted terms to canonical concepts.

The linking process works in two steps: first, candidate concepts are retrieved by embedding similarity; then an LLM selects the best match given the surrounding abstract context. This approach handles synonyms, abbreviations, and ambiguous terms (for example, "parvo" in a vaccine context versus a disease context) without brittle string matching.

7. Semantic Linking

Evidence linking is powered by PubMedBERT-based semantic similarity search across 768-dimensional vector embeddings. Every claim, entity, and source is embedded into a shared vector space optimized for biomedical language. This enables finding related claims across papers and species — even when different terminology is used.

How it works. When a claim is extracted, it is embedded using PubMedBERT and stored in a vector index (LanceDB). At query time, the system retrieves semantically similar claims from across the corpus using cosine similarity. This means a search for "maropitant efficacy in canine chemotherapy-induced emesis" will also surface related findings about antiemetics in other species or different clinical contexts — without requiring exact keyword matches.

Cross-species discovery. Because entities are normalized to SNOMED-VT concepts and claims are embedded in a shared biomedical vector space, the system can surface evidence connections across species boundaries. A search about a zoonotic pathogen will return relevant findings from veterinary, wildlife, and human health literature in a single query.

This architecture delivers the core user-facing promise — find related claims across papers and species — without the complexity of a dedicated graph database. The vector similarity approach is both simpler to maintain and more robust to the terminological variation common in veterinary literature.

8. Quality Controls

Automated evidence synthesis without human review requires strong automated safeguards. Our quality control operates at multiple levels:

Triple-extraction consensus. Every abstract is processed three times independently (at temperature 1.0 for maximum diversity). The resulting claims are embedded using PubMedBERT and clustered by cosine similarity (≥ 0.85 threshold). Only claims that appear in at least 2 of 3 extractions survive into the final output. This is our most powerful defense against hallucinated or idiosyncratic findings.

Source passage validation. Every claim must cite a supporting passage from the original abstract. We verify this computationally: the cited text must appear as a verbatim substring of the abstract. Additionally, any numbers referenced in the claim (sample sizes, p-values, effect sizes) are cross-checked against the source passage. Claims that fail either check are excluded.

Schema validation. Every model output must validate against strict Pydantic schemas. Outputs that fail validation are rejected and re-processed. This ensures structural integrity for every field in every Evidence Card.

Veterinary review workflow. The platform includes an API-driven review workflow where qualified veterinary professionals can review, flag, and annotate Evidence Cards. Reviewed cards carry a verification status that is displayed alongside automated quality indicators.

9. Versioning

We treat every output as a scientific artifact. Each claim, grade, and extraction stores:

MetadataPurpose
Model versionWhich extraction model produced this output
Prompt versionWhich prompt template was used
Rubric versionWhich grading rubric was applied
Extraction timestampWhen this output was generated

This means every evidence grade can be traced to a specific rubric version. When we update our rubric, we re-grade affected claims and clearly indicate the rubric version in the output. You can always see: "This source received Moderate quality under Rubric v1.0."

10. Known Limitations

We are transparent about what our system can and cannot do:

Abstract-only extraction (v1). Our current pipeline extracts from abstracts and metadata only, not full-text articles. This means we may miss important nuance, subgroup analyses, or detailed methodology that appears only in the full text. We plan to add open-access full-text ingestion in a future release.

Automated grading is not peer review. Our evidence grades reflect study design and detectable quality signals. They do not assess internal validity, risk of bias, or methodological rigor at the level of a formal systematic review. The grades are a starting point for evidence triage, not a final judgment.

Coverage gaps. Our corpus depends on the availability of metadata and abstracts from our source feeds. Some journals, particularly regional or non-English publications, may be underrepresented. We are actively expanding coverage and welcome suggestions.

Entity normalization is imperfect. Despite our dictionaries and matching algorithms, some synonyms may not resolve correctly, particularly for rare species, emerging pathogens, or newly named drugs. We version our dictionaries and improve them continuously.

No clinical recommendations. We never say "you should" or "the evidence recommends." Clinical decisions involve patient-specific factors, practitioner expertise, and contextual judgment that are beyond the scope of any automated evidence tool.