Biomedical Knowledge Graph · MINERVA Extension

Imaging phenotypes as a first-class layer
in the microbiome graph

SanoMap's Radiomics Layer closes the gap between gut microbiome evidence and imaging-derived disease markers. Radiomic features and body-composition measurements are explicit intermediate nodes — not inferred, not implied — grounded in literature-extracted direct evidence and verified quantitative figure correlations.

Open Artifact Explorer View on GitHub
1,016 Papers in corpus (6 query lanes)
8 Node types
12 Edge types
29 Signed microbe–disease pairs
156 Passing tests
r=0.95 Verified Vision Track correlation
Graph Schema

The professor's four-part chain — complete

Every edge in the graph is grounded in direct evidence. No inferred bridge matches become asserted relationships. The imaging backbone (BodyLocation, ImagingModality, ImageRef) is now fully wired.

Microbe ─[:CORRELATES_WITH]─▶ RadiomicFeature ─[:ASSOCIATED_WITH]─▶ Disease
Microbe ─[:CORRELATES_WITH]─▶ BodyCompositionFeature ─[:ASSOCIATED_WITH]─▶ Disease
MicrobialSignature ─[:CORRELATES_WITH]─▶ RadiomicFeature ─[:MEASURED_AT]─▶ BodyLocation
RadiomicFeature ─[:ACQUIRED_VIA]─▶ ImagingModality ─[:REPRESENTED_BY]─▶ ImageRef
Microbe ─[:POSITIVELY_CORRELATED_WITH|NEGATIVELY_CORRELATED_WITH]─▶ Disease
Node Inventory

Eight node types, grounded in direct evidence

Each node type maps to a specific evidence source. No speculative nodes.

🦠

Microbe

Taxon-specific entities from microbial NER. Examples: Fusobacterium nucleatum, Akkermansia muciniphila.

🔬

MicrobialSignature

Non-taxon microbiome states: dysbiosis, alpha diversity, beta diversity, intratumoral microbiome.

📊

RadiomicFeature

IBSI-backed quantitative imaging features. Examples: glcm_entropy, first_order_kurtosis.

⚖️

BodyCompositionFeature

Imaging-derived phenotype markers: skeletal_muscle_index, visceral_adipose_tissue, sarcopenia, myosteatosis.

🏥

Disease

Disease or outcome concepts grounded in paper text. Filtered through shared span cleanup before edge promotion.

📍

BodyLocation

12 anatomical sites where imaging measurements are taken: liver, lung, colon, abdomen, muscle, bone, and more.

🖥️

ImagingModality

4 modalities with DICOM codes: CT/CT, MRI/MR, PET/PT, DXA/DXA.

🖼️

ImageRef

Verified figure references from the Vision Track. Stores PMCID, figure ID, topology, and image path. Completes the chain to representative evidence.

Vision Track

Quantitative correlation extraction from medical figures

The Vision Track uses a VLM to propose r-values from heatmap figures, then gates every proposal through a deterministic pixel-level verifier before any edge is asserted. Figures that are not continuous gradient heatmaps — or that lack a microbe entity — are correctly rejected.

PMC10605408 · Fig (CT texture heatmap)
Verified ✓
Prevotella nigrescens ↔ GLCM_Correlation
r = 0.95 · distance_metric = 0.05 · support_fraction = 1.0
→ 1 ImageRef node emitted to graph
PMC10176953 · Fig6 (microbiota ↔ radiomics)
Rejected — correct
VLM proposed Actinomyces r=−0.4 (paper text: R=−0.510)
Rejection reason: dot/bar-plot style, not a continuous gradient heatmap
PMC11924647 · Fig4 (radiomics feature heatmap)
Rejected — correct
Feature-to-feature correlation only. No microbe entity present.
Verifier correctly gates non-microbe figures.
Pipeline

Nine-stage evidence extraction pipeline

Aligned with MINERVA methodology. Each stage produces validated JSONL artifacts.

1

Literature Retrieval

src/harvest_pubmed.py

Split query profiles: strict radiomics, adjacent imaging, body-composition. 640-paper expanded corpus.

2

Corpus Merge + PMC Full-Text

src/merge_paper_corpora.py · src/download_pmc_fulltext.py

Deduplicated merge. Full-text preferred over abstract-only for downstream NER.

3

Phenotype Text Extraction

src/extract_radiomics_text.py

IBSI-backed radiomics + body-composition vocabulary. Detects BodyLocation (42.8% coverage) and ImagingModality. Stopword-guarded disease detection.

4

Disease + Microbe NER

src/text_ner_minerva.py

MINERVA-aligned sentence-level NER. MPS-accelerated microbe NER on Apple Silicon. BC5CDR disease NER.

5

Relation Input Construction

src/build_relation_input.py

Joins sentence evidence with phenotype context. Threads subject_node_type for graph typing.

6

Relation Extraction

src/relation_extract_stage.py

Self-consistency 3-label classification (Positive / Negative / Unrelated). Hosted via Gemini 2.5 Flash-Lite. 9 clean validated microbe–disease pairs.

7

Span Cleanup

src/span_cleanup.py

Pre-inference entity cleanup. Normalizes genus-containing products, finding-in-disease patterns, clause fragments.

8

Vision Track

src/index_figures.py · src/propose_vision_qwen.py · src/verify_heatmap.py

VLM proposes r-values from PMC figures. Deterministic pixel-level verifier gates every proposal. Only verified figures become ImageRef nodes.

9

Graph Assembly

src/assemble_edges.py

Emits Neo4j-ready edge CSV, BodyLocation/ImagingModality/ImageRef nodes, and audit-only phenotype-axis candidates. Bridge hypotheses are explicitly not ingested.

Validation Baseline

Current graph-ready output snapshot

All artifacts are schema-validated JSONL. The test suite runs in Conda base on Apple Silicon MPS.

Artifact Count Status
Validated microbe–disease pairs (Gemini, self-consistency) 9 Graph-ready
Phenotype-to-disease text edges (ASSOCIATED_WITH) 23 Graph-ready
BodyLocation nodes 12 Graph-ready
ImagingModality nodes (CT, MRI, PET, DXA) 4 Graph-ready
MEASURED_AT + ACQUIRED_VIA backbone rows 50 Graph-ready
ImageRef nodes (Vision Track verified) 1 Graph-ready
Phenotype-axis candidates (audit-only) 233 Audit-only
Bridge hypotheses (audit-only, not ingested) 232 Audit-only
pytest checks passing 156 Green
Corpus Growth

Literature coverage by year · 1,016 papers across 6 query lanes

The initial 640-paper corpus spanned 3 query lanes (strict radiomics, adjacent imaging, body composition). Four domain-specific lanes were added: liver radiomics, bone DXA, lung CT phenotypes, and colorectal imaging.

Graph Statistics

183 graph rows · 7 edge types · direct evidence only

Every edge carries a source PMID and evidence string. Text-derived edges use ASSOCIATED_WITH; Vision Track-verified edges use CORRELATES_WITH. Bridge hypotheses are excluded from graph import.

Microbe–Disease Associations

12 signed microbe–disease pairs from Gemini self-consistency extraction

Each pair was extracted with Gemini 2.5 Flash-Lite using 7-sample self-consistency (full agreement required). Sign direction: positive (enrichment associated with disease) or negative (depletion / protective association).

MicrobeDirectionDiseasePMIDConfidence
ProteobacteriaPOSITIVECirrhosis395393770.70
ProteobacteriaPOSITIVECirrhosis359786660.70
Peptostreptococcus stomatisPOSITIVECirrhosis365369570.70
RuminococcusPOSITIVECirrhosis365369570.70
Lactobacillus-based probioticsNEGATIVEInflammatory bowel disease379983340.70
BacteroidetesNEGATIVEInflammatory bowel disease379983340.70
Bifidobacterium bifidumNEGATIVEObesity363582880.70
Bifidobacterium lactisNEGATIVEObesity363582880.70
CatenibacteriumNEGATIVECirrhosis359786660.70
Actinobacteria speciesNEGATIVEObesity351263090.70
DysosmobacterNEGATIVEObesity341082370.70
Lactobacillus-containing probioticNEGATIVESystemic inflammation336332460.70