diff --git a/README.md b/README.md index edfb7ee..01f0145 100644 --- a/README.md +++ b/README.md @@ -1,258 +1,305 @@ -# structflo-cser +

structflo.cser

+ +

+ structflo.cser — detection and pairing example +

+ +

+ PyPI Downloads + Tests + License + LinkedIn + GitHub +

+ +

+ Chemical structure and label extraction from scientific documents. +

+ +

+ Installation • + Quick Start • + Step-by-Step • + Matchers • + Downstream Processing • + Notebooks +

-YOLO11l-based detector for chemical structures and their compound label IDs in scientific documents. +--- -Part of the **structflo** library. Import as: -```python -from structflo.cser.pipeline import ChemPipeline -``` +**structflo.cser** extracts chemical structure–label pairs from images and PDF pages. It uses a fine-tuned YOLO detector trained on synthetic chemical structure data to locate structures and compound labels on a page, then pairs them using Learned Pair Scorer (LPS) model or a simpler Hungarian Matcher. -**Detection target:** A single bounding box (`compound_panel`) enclosing the union of a rendered chemical structure and its nearby label ID (e.g. `CHEMBL12345`). +The extracted crops can be passed to any structure-to-SMILES converter (DECIMER, MolScribe) and any OCR engine for label text. DECIMER and EasyOCR are bundled for convenience, but any downstream tools can be swapped in. ---- +**Two-step process:** + +1. **Detect** — A fine-tuned YOLO detector finds all chemical structures and compound labels in the image +2. **Match** — A matcher pairs each structure with its corresponding label, producing cropped image pairs + +| | `LearnedMatcher` (default) | `HungarianMatcher` | +| ----------------- | --------------------------------------- | ------------------------------- | +| Approach | Neural Pair Scorer (LPS) | Geometric (centroid distance) | +| Setup | Auto-downloads weights | Zero config | +| Speed | Fast (GPU accelerated) | Instantaneous | +| Accuracy | Better for complex or crowded pages | Good for simple layouts | +| Output | `CompoundPair` | `CompoundPair` (identical) | ## Installation ```bash -uv pip install -e . +pip install structflo-cser ``` -This installs all dependencies and registers the `sf-*` CLI commands on your PATH. +```bash +# or with uv +uv add structflo-cser +``` ---- +This also installs DECIMER and EasyOCR for downstream SMILES and text extraction. The core pipeline does not depend on them — any extractor implementation can be swapped in. -## Pipeline +## Quick Start -``` -1. Fetch SMILES → sf-fetch-smiles -2. Download distractors → sf-download-distractors (optional but recommended) -3. Generate dataset → sf-generate -4. Visualize labels → sf-viz (optional QA check) -5. Train YOLO → sf-train -6. Run inference → sf-detect -7. Annotate real PDFs → sf-annotate (optional) -``` +One call from image to `(SMILES, label)` pairs: ---- +```python +from structflo.cser.pipeline import ChemPipeline +from structflo.cser.lps import LearnedMatcher -## Commands +pipeline = ChemPipeline(matcher=LearnedMatcher()) +results = pipeline.process("page.png") -### 1. Fetch SMILES from ChEMBL +for pair in results: + print(pair.smiles, pair.label_text) +``` -Extracts ~20 k small-molecule SMILES from a local [ChEMBL SQLite database](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/). +Weights for both the detector and the LPS are auto-downloaded from HuggingFace Hub on first use. -```bash -sf-fetch-smiles \ - --db chembl_35/chembl_35_sqlite/chembl_35.db \ - --output data/smiles/chembl_smiles.csv \ - --n 20000 +Export to a pandas DataFrame or JSON: + +```python +df = ChemPipeline.to_dataframe(results) +data = ChemPipeline.to_json(results) +``` + +``` + match_distance match_confidence smiles label_text +0 135.19 0.9844 CN1CCC2=C(C1)SC(=N2)C(=O)NC3=... 7178-39-6 +1 208.40 0.9973 C1=CC(=CC=C1C2=C(C(=O)O)N=NN2... 72804-12-9 +2 126.25 0.9997 COC1=CC=C(C=C1)C=C2C(=O)N(C3=... ZINC2978 720 ``` -Output: `data/smiles/chembl_smiles.csv` +### PDF input ---- +For PDFs, use `process_pdf()` — it renders each page and returns one result list per page: + +```python +from structflo.cser.pipeline import ChemPipeline +from structflo.cser.lps import LearnedMatcher -### 2. Download distractor images +pipeline = ChemPipeline(matcher=LearnedMatcher()) -Downloads real photographs from [Lorem Picsum](https://picsum.photos/) to use as hard-negative distractors during page generation. +# Returns list[list[CompoundPair]] — one inner list per page +all_pages = pipeline.process_pdf("paper.pdf") -```bash -sf-download-distractors --out data/distractors --count 1000 +for page_num, pairs in enumerate(all_pages): + print(f"Page {page_num + 1}: {len(pairs)} compound pairs") + for pair in pairs: + print(f" {pair.label_text:20s} {pair.smiles}") ``` ---- +Pass `output_pdf` to save an annotated copy with bounding boxes and extracted data overlaid: -### 3. Generate synthetic dataset +```python +pipeline.process_pdf("paper.pdf", output_pdf="paper_annotated.pdf") +``` -Generates document-like pages (A4 @ 300 DPI or slide format) containing chemical structures, compound labels, and distractor elements. +## Step-by-Step Pipeline -```bash -sf-generate \ - --smiles data/smiles/chembl_smiles.csv \ - --out data/generated \ - --num-train 2000 --num-val 400 \ - --fonts-dir data/fonts \ - --distractors-dir data/distractors \ - --dpi 96,144,200,300 \ - --workers 0 -``` +For finer control, each stage is exposed individually. -Key options: +### 1. Create the pipeline -| Flag | Default | Description | -|------|---------|-------------| -| `--num-train` | 2000 | Number of training pages | -| `--num-val` | 200 | Number of validation pages | -| `--dpi` | `96,144,200,300` | DPI values randomly sampled per page | -| `--grayscale` / `--no-grayscale` | on | Convert pages to grayscale | -| `--workers` | 0 (all CPUs) | Parallel workers; use `1` to disable multiprocessing | +```python +from structflo.cser.pipeline import ChemPipeline -**Output structure:** -``` -data/generated/ -├── train/ -│ ├── images/ (JPEG pages) -│ ├── labels/ (YOLO .txt — union bbox per compound panel) -│ └── ground_truth/ (JSON with split struct_bbox / label_bbox / smiles) -└── val/ - ├── images/ - ├── labels/ - └── ground_truth/ +# Default: LearnedMatcher — auto-downloads LPS weights on first use +pipeline = ChemPipeline(tile=False, conf=0.70) ``` ---- - -### 4. Visualize labels (QA) +For a heuristic based approach, use `HungarianMatcher`: -Overlays YOLO bounding boxes on a random sample of generated pages. +```python +from structflo.cser.pipeline import ChemPipeline, HungarianMatcher -```bash -sf-viz --split both --n 30 --out data/viz +pipeline = ChemPipeline( + tile=False, + conf=0.70, + matcher=HungarianMatcher(max_distance=500), +) ``` -Green boxes = `chemical_structure`, blue boxes = `compound_label`. +The pipeline is lazy — detector weights, DECIMER, and EasyOCR are loaded on first use only. ---- +### 2. Detect -### 5. Train +```python +detections = pipeline.detect("page.png") -Fine-tunes YOLO11l on the generated dataset. +n_struct = sum(1 for d in detections if d.class_id == 0) +n_label = sum(1 for d in detections if d.class_id == 1) +print(f"Found {n_struct} structures and {n_label} labels") +# Found 6 structures and 6 labels +``` -```bash -sf-train --epochs 50 --imgsz 1280 --batch 8 +`class_id=0` = chemical structure  |  `class_id=1` = compound label + +### 3. Match + +```python +pairs = pipeline.match(detections) +# Matched 6 structure–label pairs +# Pair 0: distance=135px structure@(490,421) label@(489,285) +# Pair 1: distance=208px structure@(258,194) label@(466,195) ``` -Key options: +### 4. Visualise -| Flag | Default | Description | -|------|---------|-------------| -| `--weights` | `yolo11l.pt` | Pretrained backbone | -| `--imgsz` | 1280 | Training resolution | -| `--batch` | 8 | Batch size (safe for A6000 48 GB) | -| `--resume` | — | Path to `last.pt` to resume an interrupted run | +```python +from structflo.cser.viz import plot_detections, plot_pairs, plot_crops, plot_results -**Output:** `runs/labels_detect/yolo11l_panels/weights/best.pt` +fig = plot_detections(img, detections) # green = structure, blue = label +fig = plot_pairs(img, pairs) # orange lines connect matched pairs +fig = plot_crops(img, pairs) # cropped structure and label regions +fig = plot_results(img, results) # final annotated output +``` ---- +![Detection and pairing visualisation](docs/images/example-2.png) -### 6. Detect +### 5. Enrich — SMILES and label text -Runs the trained detector on images using sliding-window tiling (1536 px tiles, 20 % overlap). +```python +enriched = pipeline.enrich(pairs, "page.png") -```bash -# Single image -sf-detect --image page.png +for i, p in enumerate(enriched): + print(f"Pair {i}:") + print(f" SMILES: {p.smiles}") + print(f" Label text: {p.label_text}") +``` -# Directory of images -sf-detect --image_dir data/real/images/ --out detections/ +``` +Pair 0: + SMILES: CN1CCC2=C(C1)SC(=N2)C(=O)NC3=C(C=CC=C3)CNC(=O)C4=CC=CC(=C4)Cl + Label text: 7178-39-6 -# With Hungarian pairing of structures → labels -sf-detect --image page.png --pair --max_dist 300 +Pair 1: + SMILES: C1=CC(=CC=C1C2=C(C(=O)O)N=NN2C3=CC=C(C=C3)S(=O)(=O)N)Br + Label text: 72804-12-9 ``` -Key options: +## Matchers -| Flag | Default | Description | -|------|---------|-------------| -| `--weights` | `runs/.../best.pt` | Model weights | -| `--conf` | 0.3 | Confidence threshold | -| `--tile_size` | 1536 | Tile size in pixels | -| `--no_tile` | off | Run on full image (skips tiling) | -| `--grayscale` | off | Convert to grayscale before detection | -| `--pair` | off | Hungarian match structures → labels | +### Learned Pair Scorer — `LearnedMatcher` (default) ---- +A neural matcher trained to score structure–label compatibility using both visual crops and geometric features. It replaces the raw distance cost matrix with a learned association probability, then solves global assignment with the Hungarian algorithm. -### 7. Annotate real PDFs (optional) +Weights are auto-downloaded from HuggingFace Hub on first use — no manual setup needed. Models are hosted at: -Web-based annotation tool for creating ground truth from real PDF documents. +- Detector: [huggingface.co/sidxz/structflo-cser-detector](https://huggingface.co/sidxz/structflo-cser-detector) +- LPS scorer: [huggingface.co/sidxz/structflo-cser-lps](https://huggingface.co/sidxz/structflo-cser-lps) -```bash -sf-annotate --out data/real --port 8000 -# then open http://127.0.0.1:8000 in a browser +```python +from structflo.cser.pipeline import ChemPipeline +from structflo.cser.lps import LearnedMatcher + +pipeline = ChemPipeline( + matcher=LearnedMatcher( + min_score=0.5, # drop pairs below this confidence + max_dist_px=None, # optional centroid pre-filter to save compute + ) +) ``` ---- +`min_score` — pairs scoring below this threshold are discarded as unlabelled structures. -## Package layout +### Hungarian Matcher — `HungarianMatcher` (fallback) -``` -structflo/cser/ # importable package (from structflo.cser import ...) -├── _geometry.py # shared bbox utilities (boxes_intersect, try_place_box) -├── config.py # PageConfig dataclass + make_page_config() -├── data/ -│ ├── smiles.py # load_smiles(), fetch_smiles_from_chembl_sqlite() -│ └── distractor_images.py # load_distractor_images(), download_picsum() -├── rendering/ -│ ├── chemistry.py # render_structure(), place_structure() -│ └── text.py # draw_rotated_text(), add_label_near_structure(), load_font() -├── distractors/ -│ ├── charts.py # bar / scatter / line / pie chart generators -│ ├── shapes.py # geometric shapes, noise patches, gradients -│ └── text_elements.py # prose blocks, captions, footnotes, arrows, tables -├── generation/ -│ ├── page.py # make_page(), make_negative_page(), apply_noise() -│ └── dataset.py # generate_dataset(), save_sample(), CLI entry point -├── training/ -│ └── trainer.py # train(), CLI entry point -├── inference/ -│ ├── tiling.py # generate_tiles() -│ ├── nms.py # nms() -│ ├── pairing.py # pair_detections() via Hungarian matching -│ └── detector.py # detect_tiled(), detect_full(), draw_boxes(), CLI -└── viz/ - └── labels.py # visualize_split(), draw_boxes(), CLI entry point - -annotate/ # Flask annotation tool (unchanged) -config/ -├── data.yaml # YOLO dataset paths -└── pipeline.yaml -data/ # data files (gitignored) -runs/ # training checkpoints (gitignored) +Pairs structures and labels by minimising total centroid-to-centroid distance. Zero config, zero weights download. Useful for simple document layouts or as a fast sanity check. + +```python +from structflo.cser.pipeline import ChemPipeline, HungarianMatcher + +pipeline = ChemPipeline( + matcher=HungarianMatcher(max_distance=500), +) ``` ---- +`max_distance` — maximum pixel distance for a valid pair. Increase for large pages; reduce to avoid false pairings on dense layouts. -## Data directory layout +## Downstream Processing -``` -data/ -├── smiles/ -│ └── chembl_smiles.csv # ~20 k SMILES from ChEMBL -├── fonts/ # TTF/OTF fonts for label rendering -├── distractors/ # ~1 k real photos (sf-download-distractors output) -├── generated/ # synthetic dataset (sf-generate output) -│ ├── train/ -│ └── val/ -└── real/ # manually annotated real pages (sf-annotate output) - ├── images/ - ├── labels/ - └── ground_truth/ +**structflo.cser** outputs cropped image pairs. Plug in any converter for SMILES and any OCR for label text. + +### SMILES extraction + +DECIMER is bundled by default. Swap for MolScribe or any custom `BaseSmilesExtractor`: + +```python +from structflo.cser.pipeline.smiles_extractor import BaseSmilesExtractor + +class MyExtractor(BaseSmilesExtractor): + def extract(self, image) -> str: + return my_model.predict(image) + +pipeline = ChemPipeline(smiles_extractor=MyExtractor()) ``` ---- +### OCR + +EasyOCR is bundled by default. Swap for any custom `BaseOCR`: -## YOLO label format +```python +from structflo.cser.pipeline.ocr import BaseOCR -Each `.txt` label file contains one line per annotated object: +class MyOCR(BaseOCR): + def extract(self, image) -> str: + return my_ocr.read(image) +pipeline = ChemPipeline(ocr=MyOCR()) ``` - (all normalised to [0, 1]) + +## CLI + +Run extraction directly from the terminal: + +```bash +# Detect and pair structures/labels in a directory of images +sf-detect --image_dir data/test_images/ --conf 0.60 --no_tile --pair --max_dist 500 + +# Full pipeline: detect → match → SMILES + OCR +sf-extract page.png ``` -| class_id | name | -|----------|------| -| 0 | chemical_structure | -| 1 | compound_label | +All available commands: -Ground-truth JSON files in `ground_truth/` contain raw pixel coordinates plus `smiles` and `label_text` for downstream analysis. +| Command | Description | +| ------------------------- | ------------------------------------------ | +| `sf-detect` | Run YOLO detection on images | +| `sf-extract` | Full pipeline: detect → match → extract | +| `sf-generate` | Generate synthetic training data | +| `sf-train` | Train the YOLO detection model | +| `sf-train-lps` | Train the Learned Pair Scorer | +| `sf-eval-lps` | Evaluate LPS on a test set | +| `sf-fetch-smiles` | Download SMILES from ChEMBL | +| `sf-download-distractors` | Download distractor images for generation | +| `sf-annotate` | Launch the web annotation server | ---- +## Notebooks + +| Notebook | Description | +| -------- | ----------- | +| [01-quickstart.ipynb](notebooks/01-quickstart.ipynb) | Step-by-step pipeline walkthrough: detect → match → enrich, then one-call convenience API | +| [02-LPS.ipynb](notebooks/02-LPS.ipynb) | Using the Learned Pair Scorer for improved matching on complex document pages | -## Key design decisions +## License -- **Union bounding box** — each compound panel is annotated as the union of structure + label (1 class for YOLO). The GT JSON preserves the individual boxes. -- **No horizontal flips** — chemical handedness matters; `fliplr=0` is enforced during training. -- **15 % negative pages** — pages with no structures teach the model to output nothing for non-chemistry content. -- **Multi-DPI generation** — pages at {96, 144, 200, 300} DPI create scale variance, improving robustness to different scanning resolutions. -- **Tiled inference** — A4 pages (2480 × 3508 px) are tiled into 1536 px chunks with 20 % overlap to stay within GPU memory. +Apache License 2.0 diff --git a/docs/images/example-1.png b/docs/images/example-1.png new file mode 100644 index 0000000..6f7226b Binary files /dev/null and b/docs/images/example-1.png differ diff --git a/docs/images/example-2.png b/docs/images/example-2.png new file mode 100644 index 0000000..650e957 Binary files /dev/null and b/docs/images/example-2.png differ diff --git a/structflo/cser/pipeline/pipeline.py b/structflo/cser/pipeline/pipeline.py index cbca452..77b7ce6 100644 --- a/structflo/cser/pipeline/pipeline.py +++ b/structflo/cser/pipeline/pipeline.py @@ -12,7 +12,8 @@ from structflo.cser.inference.detector import detect_full, detect_tiled from structflo.cser.weights import resolve_weights -from structflo.cser.pipeline.matcher import BaseMatcher, HungarianMatcher +from structflo.cser.lps import LearnedMatcher +from structflo.cser.pipeline.matcher import BaseMatcher from structflo.cser.pipeline.models import BBox, CompoundPair, Detection from structflo.cser.pipeline.ocr import BaseOCR, EasyOCRExtractor from structflo.cser.pipeline.smiles_extractor import ( @@ -76,7 +77,8 @@ def __init__( weights: Weights version tag (e.g. ``"v1.0"``) or path to a local ``.pt`` file. ``None`` auto-downloads the latest published weights. - matcher: Pairing strategy. Defaults to HungarianMatcher. + matcher: Pairing strategy. Defaults to LearnedMatcher + (auto-downloads weights from HuggingFace Hub). smiles_extractor: SMILES model. Defaults to DecimerExtractor. ocr: OCR engine. Defaults to PaddleOCRExtractor. tile: Use sliding-window tiling during detection. @@ -86,7 +88,7 @@ def __init__( Defaults to True to match training data distribution. """ self._weights = weights # version tag, local path str/Path, or None - self._matcher = matcher or HungarianMatcher() + self._matcher = matcher or LearnedMatcher() self._smiles = smiles_extractor or DecimerExtractor() self._ocr = ocr or EasyOCRExtractor() self.tile = tile @@ -190,6 +192,64 @@ def process(self, image: ImageLike) -> list[CompoundPair]: pairs = self.match(detections, image=img) return self.enrich(pairs, img) + def process_pdf( + self, + pdf_path: Path | str, + *, + dpi: int = 150, + output_pdf: Path | str | None = None, + ) -> list[list[CompoundPair]]: + """Run the full pipeline on every page of a PDF. + + Pages are processed one at a time so memory usage stays bounded + regardless of document length. + + Args: + pdf_path: Path to the input PDF. + dpi: Rendering resolution. 150 dpi works well for typical + journal pages; use 200-300 for small or dense text. + output_pdf: Optional path for an annotated output PDF. When given, + each page is rendered with bounding boxes, pairing + lines, and extracted SMILES / label text, then saved + as a multi-page PDF. + + Returns: + A list with one entry per page; each entry is a list of + ``CompoundPair`` objects with ``smiles`` and ``label_text`` + populated. + """ + import fitz # pymupdf — required dependency + + doc = fitz.open(str(pdf_path)) + mat = fitz.Matrix(dpi / 72, dpi / 72) + all_results: list[list[CompoundPair]] = [] + + if output_pdf is not None: + import matplotlib.pyplot as plt + from matplotlib.backends.backend_pdf import PdfPages + from structflo.cser.viz import plot_results + + pdf_out: PdfPages | None = PdfPages(str(output_pdf)) + else: + pdf_out = None + + try: + for page in doc: + pix = page.get_pixmap(matrix=mat, colorspace=fitz.csRGB) + img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples) + pairs = self.process(img) + all_results.append(pairs) + if pdf_out is not None: + fig = plot_results(img, pairs) + pdf_out.savefig(fig, bbox_inches="tight") + plt.close(fig) + finally: + doc.close() + if pdf_out is not None: + pdf_out.close() + + return all_results + # ------------------------------------------------------------------ # Output helpers (static — can also be called on the class directly) # ------------------------------------------------------------------