FMCEB-CXR
A structured chest X-ray benchmark from a Nigerian tertiary hospital, labeled with a hybrid natural language pipeline and validated by consultant radiologists.
The dataset
FMCEB-CXR is built from 5,517 anonymized radiology reports from Federal Medical Centre Ebute-Metta in Lagos, covering 14 chest pathology categories. It is a single site dataset today, designed to grow to further sites as the programme expands.
The 14 categories span common chest findings. The full validated distribution will be published with the dataset. Validation is in progress.
Publications and outputs
The programme is producing a series of papers. Status is shown honestly, and updated as the work progresses.
Paper 1 — the dataset and its labeling pipeline
The dataset and its hybrid NLP labeling pipeline.
Paper 2 — NLP labeling methodology
A deeper treatment of the NLP labeling methodology. MIRASOL is the Medical Image Computing in Resource-Constrained Settings workshop.
Paper 3 — end of phase findings
Findings from the completed annotation and validation phase. Three or more papers are expected from this phase.
Methodology
Labels are extracted with a hybrid natural language pipeline. A bilingual medical dictionary of 447 phrases is combined with a clinical language model, BioClinicalBERT, that we fine tuned for this task. Negation is detected explicitly, and radiologists validate every label before any model is trained.
Label quality comes first. Modeling follows validation.
Team and collaborators