Skip to content

Fast loading of multi-file BOUT++ datasets#336

Open
bendudson wants to merge 7 commits intomasterfrom
feature/lazy-load
Open

Fast loading of multi-file BOUT++ datasets#336
bendudson wants to merge 7 commits intomasterfrom
feature/lazy-load

Conversation

@bendudson
Copy link
Contributor

@bendudson bendudson commented Feb 27, 2026

Adds a xbout.lazy_open_boutdataset function that reads collections of dmp or restart files by opening one file, then using metadata to construct Dask chunks for all other files without opening them. No merging of datasets needed.
This greatly reduces the time needed to open a dataset.

xbout.open_boutdataset is modified so that if we are opening a collection of NetCDF files that are all in the same directory (a common use-case) then lazy_open_dataset will be used unless disabled with lazy_load=False.

Simple test case from hermes-perftest. 10 files. (t, x, y, z) sizes (1, 16, 50, 1) without X or Y boundaries.

Current approach:

import xbout
%time ds = xbout.open_boutdataset("./BOUT.dmp.*.nc", lazy_load = False)     # Time: 16.4 seconds

New default approach:

import xbout
%time ds = xbout.open_boutdataset("./BOUT.dmp.*.nc")     # Time: 1.6 seconds

Gridfile and geometry loading is handled in the same way as before.

Only opens one file, using metadata to construct Dask chunks
for all other files. This greatly reduces the time needed
to open a dataset.
Sorting out imports
@mikekryjak
Copy link
Collaborator

Wow, very cool! I see you made a new function for this. How come we can't just modify the original one?

@bendudson
Copy link
Contributor Author

Wow, very cool! I see you made a new function for this. How come we can't just modify the original one?

The next step is to modify the original open_boutdataset to call this function, then perform all the geometry stuff. I wanted to get this working first because open_boutdataset handles many different cases e.g pre-squashed files.

If opening a set of NetCDF files that are all in the same directory,
use lazy_open_boutdataset. This is a common use-case and is
significantly faster this way.

For more complicated cases (e.g. concatenating multiple BOUT++ runs),
or if `lazy_load = False`, fall back to the old method.
@bendudson bendudson changed the title WIP: Fast loading of multi-file BOUT++ datasets Fast loading of multi-file BOUT++ datasets Mar 4, 2026
Merge ds.metadata only if it exists.
Testing uses lists of datasets rather than glob string input.
@mikekryjak
Copy link
Collaborator

I load with absolute paths most of the time, and the glob you used only works for relative paths. I just pushed a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants