A C++ library that embeds a single Python interpreter, providing thread‑pooled, asynchronous, and synchronous invocation of Python functions from C++ with optimized GIL usage.
Optimized as an execution engine for Machine Learning pipelines.
- Thread Safe Implementation
Ensures only one Python interpreter is ever initialized per process for pre python3.13.
- Pre Python 3.13
- Python 3.13+ no-gil sub interpreter
- Thread‑Pooled Execution
Uses a high‑performance round robin queue to minimize latency during high throughput workloads - Optimized GIL Management
Acquires/releases the Global Interpreter Lock only around actual Python execution. - Synchronous & Asynchronous APIs
Easily call Python functions synchronously or schedule them with callbacks returningstd::future. - Opportunistic Batching Batches similar workloads together to minimize up-call latencies into Python.
System Dependencies
- CMake ≥ 3.27
- C++20 (or later)
- Python ≥ 3.x development headers
- CUDAToolkit
Can add to your project as a submodule or install as a system library.
- Clone repository:
git clone --recurse-submodules - Configure and build with
configure.sh(use-h | --helpfor all options).
# Configure + build default debug build
./configure.sh --build
# Configure + build tests with ASan only
./configure.sh --tests --asan --build
# Configure + build tests with TSan only
./configure.sh --tests --tsan --build
# Configure + build release
./configure.sh --mode release --build- Install with either:
./configure.sh --mode Release --build --install
# or
cmake --install build-Release- Uninstall
sudo xargs rm < build-Release/install_manifest.txt
The configure script also supports custom install prefix. You can either use the -p | --prefix flag, or specify the PYTHON_UDL_INTERFACE_PREFIX environment variable.
Pyscheduler is designed to manage the lifecycle of embedding Python within a C++ multithreaded environment. At its core, the library aims to reduce the overhead imposed by Python's Global Interpreter Lock (GIL), allowing for concurrent C++ execution before transitioning data to the Python runtime.
PyManager
The primary lifecycle manager for the embedded Python environment. It safely scopes the interpreter's initialization and finalization, ensuring that Python routines are accessible across the C++ application. PyManager provides thread-safety when starting the Python runtime and serves as the factory for creating specialized function handlers.
InvokeHandler
An execution interface bound to a specific Python function. Handlers isolate execution pipelines, allowing the library to apply localized optimizations like per-handler queuing, opportunistic batching, and data prefetching without impacting other Python tasks.
A key design principle of Pyscheduler is decoupling C++ state preparation from Python execution.
- Commit Phase: When work is dispatched via
queue_invoke, Pyscheduler requests a user-provided commit function. The commit function is called ahead of execution to provide the user with an opportunity to pre-process data e.g. load memory onto the GPU. - Execution Phase: Once the commit phase structures the C++ arguments into Python-accessible objects (using
pybind11), the handler acquires the GIL and submits the batched payload to the underlying Python interpreter. - Callback Phase: Results yielded from the Python function are handled by a C++ callback, returning the computed outcomes to the caller asynchronously via a standard
std::future.
A typical pipeline consists of initializing the runtime, establishing a handler, and dispatching work:
// 1. Initialize the global Python manager
PyManager manager;
// 2. Bind a specific Python function to a handler
PyManager::InvokeHandler add_handler = manager.loadPythonModule("math_ops", "add");
// 3. Synchronous invocation
int sync_result = add_handler.invoke<int>(arg1, arg2);
// 4. Asynchronous invocation with GIL-free staging
auto commit_fn = [](int a, int b) {
// Prepared without the GIL lock
return pybind11::make_tuple(a, b);
};
auto callback_fn = [](pybind11::object&& result) {
// Resolves output from Python execution
return pybind11::cast<int>(result);
};
std::future<int> async_result = add_handler.queue_invoke(commit_fn, callback_fn, arg1, arg2);Call a Python function directly and cast the result.
Python (my_module.py):
def invoke(a, b):
return a + bC++:
pyscheduler::PyManager manager;
auto add = manager.loadPythonModule("my_module", "invoke");
int64_t result = add.invoke<int64_t>(3000, -1234); // 1766Process the raw pybind11::object before it leaves the GIL scope.
auto to_string = [](const pybind11::object& obj) { return obj.cast<std::string>(); };
std::string result = add.invoke(to_string, "hello", " world"); // "hello world"Queue work items that get batched into a single Python call. The commit function converts C++ arguments into a pybind11::object, and the callback processes each result.
Python (svd.py):
import numpy as np
def generate(items):
return [np.random.rand(n, n) for n in items]
def compute_svd_rank(matrices):
def svd_one(M):
S = np.linalg.svd(M, compute_uv=False)
tol = np.max(M.shape) * np.spacing(S.max())
return int(np.sum(S > tol))
return [svd_one(M) for M in matrices]C++:
pyscheduler::PyManager manager;
// batch_size=1 (default) — one matrix per Python call
auto generate = manager.loadPythonModule("svd", "generate");
// batch_size=32, prefetch_depth=3 — up to 32 matrices per call, 96 pre-committed
auto compute = manager.loadPythonModule("svd", "compute_svd_rank", 32, 3);
// Step 1: generate 1000 random matrices
auto commit = [](int n) -> pybind11::object { return pybind11::cast(n); };
auto extract = [](pybind11::object&& obj) { return obj.release().ptr(); };
std::vector<std::future<PyObject*>> gen_futures;
for (int i = 0; i < 1000; i++)
gen_futures.push_back(generate.queue_invoke(commit, extract, 100));
// Collect raw PyObject* pointers (ownership transferred out of pybind11)
std::vector<PyObject*> matrices;
for (auto& f : gen_futures)
matrices.push_back(f.get());
// Step 2: compute SVD rank of each matrix (batched 32 at a time)
auto commit2 = [](PyObject* p) -> pybind11::object {
return pybind11::reinterpret_steal<pybind11::object>(p);
};
auto to_int = [](pybind11::object&& obj) { return obj.cast<int>(); };
std::vector<std::future<int>> rank_futures;
for (int i = 0; i < 1000; i++)
rank_futures.push_back(compute.queue_invoke(commit2, to_int, matrices[i]));
for (auto& f : rank_futures)
std::cout << f.get() << "\n";Build the tests with ASan enabled:
./configure.sh --tests --asan --buildThis adds -fsanitize=address -fno-omit-frame-pointer -g to the compile and link flags.
CPython's embedded interpreter and extension modules (e.g. numpy) intentionally leak interned strings, type caches, and method tables at shutdown. These are not bugs in pyscheduler but will be reported by LeakSanitizer. A suppressions file (lsan_supressions.txt) in the project root filters them out.
The build system automatically symlinks this file into the test binary directory. To use it:
LSAN_OPTIONS=suppressions=lsan_suppressions.txt PYTHONMALLOC=malloc ./build/debug/tests/pyscheduler_testsPYTHONMALLOC=malloc is required so that Python's allocations go through the system malloc, which ASan can track. Without it, CPython uses its own arena allocator and ASan cannot see individual allocations.