-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Hi @shuds13,
As discussed today: we got a nice HPC workflow up with optimas/libEnsemble that can make use of containers for the individually executed commands/runs in TemplateEvaluator. Currently, we use podman-hpc but it would work with other HPC-focused container managers, too.
The best-practice for doing many runs inside the same container is to:
- start (
run -d) a container detached execindividual simulations (1-N times)- finally stop the container
That way, a persistent container is spun up once for the whole optimas/libEnsemble run is ongoing, and all the fragile and costly resource work like mounting file systems only happens once. The rest is then done with changes of (in-container, thus different base path) work-dirs during exec.
The last challenge we have now: we need to know the current, relative simulation evaluation directory just when an individual run is evaluated, as part of the precedent, to change the container workdir (inside the container) to the cd evaluations/simXYZW/ directory.
Code snippet (from run_grid_scan.py below):
precedent = "podman-hpc exec my_container_name /opt/entrypoint.sh" # usually from an environment variable in the jop script
# base dir of the optimas/libEnsemble run
base_dir = "/data/" # this is a mount point inside the container and generally different than the host path
rel_sim_dir = "evaluations/sim0000/" # TODO: generalize to the PWD sim folder that the TemplateEvaluator picks
rel_sim_dir = "%LIBENSEMBLE_SIM_DIR%" # TODO: before calling srun, libensemble would replace `%LIBENSEMBLE_SIM_DIR%` with the sim's run dir
# inject into pre-defined precedent: add `--workdir ...` as needed inside the container
extra_options = f"--workdir {base_dir}/{rel_sim_dir}"
precedent = re.sub(r'(\s+exec)\s+', rf'\1 {extra_options} ', precedent)
ev_main = TemplateEvaluator(
sim_template="templates/warpx_input_script",
analysis_func=analysis_func_main,
executable="templates/warpx",
precedent=precedent,
n_gpus=1, # GPUs per individual evaluation
env_mpi="srun",
)Full Example / Private Repo Context
- https://github.com/BLAST-AI-ML/synapse-bella-staging-injector/pull/2/files 🔒
- files:
simulation_scripts/templates/run_grid_scan.pyis the optimas script,simulation_scripts/submission_script_*are the "outer" job scripts