-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Bug report: MultiDBD.get() returns fewer data points when a second MultiDBD instance (different file type) is created first in the same Python session
Library: dbdreader v0.5.8
Python: 3.14.2
OS: Linux
Summary
Creating a MultiDBD instance for .dbd (flight-computer) files and calling .get() on it before creating a separate MultiDBD instance for .ebd (science-computer) files causes the .ebd instance to return significantly fewer data points — consistently and reproducibly — than when the .ebd instance is created first.
This leads to silently incomplete data and, because the effect size varies somewhat between Python sessions, non-reproducible processing pipelines.
Minimal reproducible example
import dbdreader
DATA = "/path/to/glider/hd/" # contains echo*.dbd and echo*.ebd
# Case A: load EBD first, then DBD
gl_ebd = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t_a, _ = gl_ebd.get("sci_ctd41cp_timestamp")
gl_dbd = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
_ = gl_dbd.get("m_gps_lat")
print(f"Case A (EBD first): {len(t_a):,} points") # → 1,038,686
# Case B: load DBD first, then EBD (typical script order)
gl_dbd2 = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
_ = gl_dbd2.get("m_gps_lat")
gl_ebd2 = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t_b, _ = gl_ebd2.get("sci_ctd41cp_timestamp")
print(f"Case B (DBD first): {len(t_b):,} points") # → 1,028,677 (≈ 10k fewer)
Observed behaviour
| Scenario | sci_ctd41cp_timestamp length |
Notes |
|---|---|---|
| EBD only (no DBD in session) | 1,038,686 | consistent across runs |
| EBD after DBD loaded | 1,028,677 | consistent within a single Python session, but the exact count varies between separate Python sessions (observed range: ~964k – ~1,036k) |
Both MultiDBD instances are created from the same 281 .ebd files; len(gl.filenames) reports 281 in all cases.
Impact
A data-processing script that (naturally) loads GPS positions from .dbd files before reading CTD data from .ebd files will receive up to ~74,000 fewer data points than if the loading order is reversed. In practice we observed:
- Downstream dataset produced from the "DBD-first" script had ~964 k time steps
- The same script with "EBD-first" order produced ~1,025 k time steps
- The extra ~61 k points recovered by reordering were not QC failures — they were valid science data
Because the magnitude of the shortfall varies between Python sessions (likely depending on whether certain .ccc cache files have already been decompressed to .cac in a prior run), the pipeline is non-reproducible: re-running the same script on the same input files can yield different output files.
Suspected cause
The issue appears to involve shared state between MultiDBD instances. Candidate locations in the source:
-
DBDCache.CACHEDIR(class-level attribute) — This is shared across allMultiDBDinstances. Reading.dbdfiles first triggersdecompress_file()calls that convert.ccc→.cacfiles. On subsequent.ebdreads, the newly-present.cacfiles change which files pass the_safely_open_dbd_filelogic, potentially altering the set of files classified as"ok"vs"failed". -
DBDPatternSelect.cache = {}(class-level dict) — This timestamp-keyed cache is shared across all instances and could mix up file-open-time metadata between DBD and EBD instances. -
DBDCachedecompression race / state —.ccc→.cacdecompression during one instance's__init__modifies the filesystem in a way that changes what the next instance finds.
Workaround
Load .ebd (science) files before .dbd (flight) files in the same Python session. After this reordering, MultiDBD.get() gives consistent, reproducible results across repeated runs.
Steps to confirm
# Verify consistency when EBD is always first:
for _ in range(5):
gl = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t, _ = gl.get("sci_ctd41cp_timestamp")
print(len(t)) # prints 1,038,686 every time
# Verify inconsistency when DBD comes first:
gl_dbd = dbdreader.MultiDBD(pattern=DATA + "echo*.dbd")
gl_dbd.get("m_gps_lat")
for _ in range(3):
gl = dbdreader.MultiDBD(pattern=DATA + "echo*.ebd")
t, _ = gl.get("sci_ctd41cp_timestamp")
print(len(t)) # same value within a session, but differs between sessions
H