Skip to content

inconsistent behaviour due to loading order #32

@jbecherer

Description

@jbecherer

Bug report: MultiDBD.get() returns fewer data points when a second MultiDBD instance (different file type) is created first in the same Python session

Library: dbdreader v0.5.8
Python: 3.14.2
OS: Linux


Summary

Creating a MultiDBD instance for .dbd (flight-computer) files and calling .get() on it before creating a separate MultiDBD instance for .ebd (science-computer) files causes the .ebd instance to return significantly fewer data points — consistently and reproducibly — than when the .ebd instance is created first.

This leads to silently incomplete data and, because the effect size varies somewhat between Python sessions, non-reproducible processing pipelines.


Minimal reproducible example

import dbdreader

DATA = "/path/to/glider/hd/"   # contains echo*.dbd and echo*.ebd

# Case A: load EBD first, then DBD
gl_ebd = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t_a, _ = gl_ebd.get("sci_ctd41cp_timestamp")

gl_dbd = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
_ = gl_dbd.get("m_gps_lat")

print(f"Case A (EBD first): {len(t_a):,} points")   # → 1,038,686

# Case B: load DBD first, then EBD (typical script order)
gl_dbd2 = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
_ = gl_dbd2.get("m_gps_lat")

gl_ebd2 = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t_b, _ = gl_ebd2.get("sci_ctd41cp_timestamp")

print(f"Case B (DBD first): {len(t_b):,} points")   # → 1,028,677  (≈ 10k fewer)

Observed behaviour

Scenario sci_ctd41cp_timestamp length Notes
EBD only (no DBD in session) 1,038,686 consistent across runs
EBD after DBD loaded 1,028,677 consistent within a single Python session, but the exact count varies between separate Python sessions (observed range: ~964k – ~1,036k)

Both MultiDBD instances are created from the same 281 .ebd files; len(gl.filenames) reports 281 in all cases.


Impact

A data-processing script that (naturally) loads GPS positions from .dbd files before reading CTD data from .ebd files will receive up to ~74,000 fewer data points than if the loading order is reversed. In practice we observed:

  • Downstream dataset produced from the "DBD-first" script had ~964 k time steps
  • The same script with "EBD-first" order produced ~1,025 k time steps
  • The extra ~61 k points recovered by reordering were not QC failures — they were valid science data

Because the magnitude of the shortfall varies between Python sessions (likely depending on whether certain .ccc cache files have already been decompressed to .cac in a prior run), the pipeline is non-reproducible: re-running the same script on the same input files can yield different output files.


Suspected cause

The issue appears to involve shared state between MultiDBD instances. Candidate locations in the source:

  1. DBDCache.CACHEDIR (class-level attribute) — This is shared across all MultiDBD instances. Reading .dbd files first triggers decompress_file() calls that convert .ccc.cac files. On subsequent .ebd reads, the newly-present .cac files change which files pass the _safely_open_dbd_file logic, potentially altering the set of files classified as "ok" vs "failed".

  2. DBDPatternSelect.cache = {} (class-level dict) — This timestamp-keyed cache is shared across all instances and could mix up file-open-time metadata between DBD and EBD instances.

  3. DBDCache decompression race / state.ccc.cac decompression during one instance's __init__ modifies the filesystem in a way that changes what the next instance finds.


Workaround

Load .ebd (science) files before .dbd (flight) files in the same Python session. After this reordering, MultiDBD.get() gives consistent, reproducible results across repeated runs.


Steps to confirm

# Verify consistency when EBD is always first:
for _ in range(5):
    gl = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
    t, _ = gl.get("sci_ctd41cp_timestamp")
    print(len(t))   # prints 1,038,686 every time
# Verify inconsistency when DBD comes first:
gl_dbd = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
gl_dbd.get("m_gps_lat")
for _ in range(3):
    gl = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
    t, _ = gl.get("sci_ctd41cp_timestamp")
    print(len(t))   # same value within a session, but differs between sessions

H

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions