Skip to content

Chemical structure-label pair extraction from scientific documents.

Notifications You must be signed in to change notification settings

structflo/structflo-cser

Repository files navigation

structflo.cser

structflo.cser — detection and pairing example

PyPI Downloads Tests License LinkedIn GitHub

Chemical structure and label extraction from scientific documents.

InstallationQuick StartStep-by-StepMatchersDownstream ProcessingNotebooks


structflo.cser extracts chemical structure–label pairs from images and PDF pages. It uses a fine-tuned YOLO detector trained on synthetic chemical structure data to locate structures and compound labels on a page, then pairs them using Learned Pair Scorer (LPS) model or a simpler Hungarian Matcher.

The extracted crops can be passed to any structure-to-SMILES converter (DECIMER, MolScribe) and any OCR engine for label text. DECIMER and EasyOCR are bundled for convenience, but any downstream tools can be swapped in.

Two-step process:

  1. Detect — A fine-tuned YOLO detector finds all chemical structures and compound labels in the image
  2. Match — A matcher pairs each structure with its corresponding label, producing cropped image pairs
LearnedMatcher (default) HungarianMatcher
Approach Neural Pair Scorer (LPS) Geometric (centroid distance)
Setup Auto-downloads weights Zero config
Speed Fast (GPU accelerated) Instantaneous
Accuracy Better for complex or crowded pages Good for simple layouts
Output CompoundPair CompoundPair (identical)

Installation

pip install structflo-cser
# or with uv
uv add structflo-cser

This also installs DECIMER and EasyOCR for downstream SMILES and text extraction. The core pipeline does not depend on them — any extractor implementation can be swapped in.

Quick Start

One call from image to (SMILES, label) pairs:

from structflo.cser.pipeline import ChemPipeline
from structflo.cser.lps import LearnedMatcher

pipeline = ChemPipeline(matcher=LearnedMatcher())
results = pipeline.process("page.png")

for pair in results:
    print(pair.smiles, pair.label_text)

Weights for both the detector and the LPS are auto-downloaded from HuggingFace Hub on first use.

Export to a pandas DataFrame or JSON:

df   = ChemPipeline.to_dataframe(results)
data = ChemPipeline.to_json(results)
   match_distance  match_confidence                              smiles     label_text
0          135.19            0.9844  CN1CCC2=C(C1)SC(=N2)C(=O)NC3=...      7178-39-6
1          208.40            0.9973  C1=CC(=CC=C1C2=C(C(=O)O)N=NN2...     72804-12-9
2          126.25            0.9997  COC1=CC=C(C=C1)C=C2C(=O)N(C3=...   ZINC2978 720

PDF input

For PDFs, use process_pdf() — it renders each page and returns one result list per page:

from structflo.cser.pipeline import ChemPipeline
from structflo.cser.lps import LearnedMatcher

pipeline = ChemPipeline(matcher=LearnedMatcher())

# Returns list[list[CompoundPair]] — one inner list per page
all_pages = pipeline.process_pdf("paper.pdf")

for page_num, pairs in enumerate(all_pages):
    print(f"Page {page_num + 1}: {len(pairs)} compound pairs")
    for pair in pairs:
        print(f"  {pair.label_text:20s}  {pair.smiles}")

Pass output_pdf to save an annotated copy with bounding boxes and extracted data overlaid:

pipeline.process_pdf("paper.pdf", output_pdf="paper_annotated.pdf")

Step-by-Step Pipeline

For finer control, each stage is exposed individually.

1. Create the pipeline

from structflo.cser.pipeline import ChemPipeline

# Default: LearnedMatcher — auto-downloads LPS weights on first use
pipeline = ChemPipeline(tile=False, conf=0.70)

For a heuristic based approach, use HungarianMatcher:

from structflo.cser.pipeline import ChemPipeline, HungarianMatcher

pipeline = ChemPipeline(
    tile=False,
    conf=0.70,
    matcher=HungarianMatcher(max_distance=500),
)

The pipeline is lazy — detector weights, DECIMER, and EasyOCR are loaded on first use only.

2. Detect

detections = pipeline.detect("page.png")

n_struct = sum(1 for d in detections if d.class_id == 0)
n_label  = sum(1 for d in detections if d.class_id == 1)
print(f"Found {n_struct} structures and {n_label} labels")
# Found 6 structures and 6 labels

class_id=0 = chemical structure  |  class_id=1 = compound label

3. Match

pairs = pipeline.match(detections)
# Matched 6 structure–label pairs
#   Pair 0: distance=135px  structure@(490,421)  label@(489,285)
#   Pair 1: distance=208px  structure@(258,194)  label@(466,195)

4. Visualise

from structflo.cser.viz import plot_detections, plot_pairs, plot_crops, plot_results

fig = plot_detections(img, detections)   # green = structure, blue = label
fig = plot_pairs(img, pairs)             # orange lines connect matched pairs
fig = plot_crops(img, pairs)             # cropped structure and label regions
fig = plot_results(img, results)         # final annotated output

Detection and pairing visualisation

5. Enrich — SMILES and label text

enriched = pipeline.enrich(pairs, "page.png")

for i, p in enumerate(enriched):
    print(f"Pair {i}:")
    print(f"  SMILES:     {p.smiles}")
    print(f"  Label text: {p.label_text}")
Pair 0:
  SMILES:     CN1CCC2=C(C1)SC(=N2)C(=O)NC3=C(C=CC=C3)CNC(=O)C4=CC=CC(=C4)Cl
  Label text: 7178-39-6

Pair 1:
  SMILES:     C1=CC(=CC=C1C2=C(C(=O)O)N=NN2C3=CC=C(C=C3)S(=O)(=O)N)Br
  Label text: 72804-12-9

Matchers

Learned Pair Scorer — LearnedMatcher (default)

A neural matcher trained to score structure–label compatibility using both visual crops and geometric features. It replaces the raw distance cost matrix with a learned association probability, then solves global assignment with the Hungarian algorithm.

Weights are auto-downloaded from HuggingFace Hub on first use — no manual setup needed. Models are hosted at:

from structflo.cser.pipeline import ChemPipeline
from structflo.cser.lps import LearnedMatcher

pipeline = ChemPipeline(
    matcher=LearnedMatcher(
        min_score=0.5,      # drop pairs below this confidence
        max_dist_px=None,   # optional centroid pre-filter to save compute
    )
)

min_score — pairs scoring below this threshold are discarded as unlabelled structures.

Hungarian Matcher — HungarianMatcher (fallback)

Pairs structures and labels by minimising total centroid-to-centroid distance. Zero config, zero weights download. Useful for simple document layouts or as a fast sanity check.

from structflo.cser.pipeline import ChemPipeline, HungarianMatcher

pipeline = ChemPipeline(
    matcher=HungarianMatcher(max_distance=500),
)

max_distance — maximum pixel distance for a valid pair. Increase for large pages; reduce to avoid false pairings on dense layouts.

Downstream Processing

structflo.cser outputs cropped image pairs. Plug in any converter for SMILES and any OCR for label text.

SMILES extraction

DECIMER is bundled by default. Swap for MolScribe or any custom BaseSmilesExtractor:

from structflo.cser.pipeline.smiles_extractor import BaseSmilesExtractor

class MyExtractor(BaseSmilesExtractor):
    def extract(self, image) -> str:
        return my_model.predict(image)

pipeline = ChemPipeline(smiles_extractor=MyExtractor())

OCR

EasyOCR is bundled by default. Swap for any custom BaseOCR:

from structflo.cser.pipeline.ocr import BaseOCR

class MyOCR(BaseOCR):
    def extract(self, image) -> str:
        return my_ocr.read(image)

pipeline = ChemPipeline(ocr=MyOCR())

CLI

Run extraction directly from the terminal:

# Detect and pair structures/labels in a directory of images
sf-detect --image_dir data/test_images/ --conf 0.60 --no_tile --pair --max_dist 500

# Full pipeline: detect → match → SMILES + OCR
sf-extract page.png

All available commands:

Command Description
sf-detect Run YOLO detection on images
sf-extract Full pipeline: detect → match → extract
sf-generate Generate synthetic training data
sf-train Train the YOLO detection model
sf-train-lps Train the Learned Pair Scorer
sf-eval-lps Evaluate LPS on a test set
sf-fetch-smiles Download SMILES from ChEMBL
sf-download-distractors Download distractor images for generation
sf-annotate Launch the web annotation server

Notebooks

Notebook Description
01-quickstart.ipynb Step-by-step pipeline walkthrough: detect → match → enrich, then one-call convenience API
02-LPS.ipynb Using the Learned Pair Scorer for improved matching on complex document pages

License

Apache License 2.0

About

Chemical structure-label pair extraction from scientific documents.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors