Skip to content

[AUTO-MERGE] Polished existing vector database notebook, added to common workflows#753

Merged
drewoldag merged 7 commits intomainfrom
issue/730/vdb-notebook
Mar 7, 2026
Merged

[AUTO-MERGE] Polished existing vector database notebook, added to common workflows#753
drewoldag merged 7 commits intomainfrom
issue/730/vdb-notebook

Conversation

@drewoldag
Copy link
Collaborator

Change Description

Closes #730

The vector database notebook was mostly ready to go, I just polished some of the words. Also added to common workflows page.

FWIW it's a pre-executed notebook, and we could probably export the inference results to a zipped file on zenodo to save ourselves a bunch of cells at the beginning of the notebook. This could be an approach taken for a lot of notebooks...

@drewoldag drewoldag self-assigned this Mar 5, 2026
@drewoldag drewoldag requested review from SamSandwich07 and Copilot and removed request for Copilot March 5, 2026 00:32
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.69%. Comparing base (618be04) to head (0337c13).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #753   +/-   ##
=======================================
  Coverage   64.69%   64.69%           
=======================================
  Files          61       61           
  Lines        5894     5894           
=======================================
  Hits         3813     3813           
  Misses       2081     2081           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Before [618be04] After [a038bc5] Ratio Benchmark (Parameter)
failed failed n/a data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_hsc1k
failed failed n/a data_cache_benchmarks.DataCacheBenchmarks.track_cache_hsc1k_hyrax_size_undercount
failed failed n/a data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.94±0.01s 1.97±0.01s 1.02 benchmarks.time_database_connection_help
1.59G 1.62G 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
9.60±0ms 9.79±0.08ms 1.02 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'chromadb')
37.9±0.1ms 38.2±0.2ms 1.01 benchmarks.time_nb_obj_construct
1.96±0.01s 1.98±0.03s 1.01 benchmarks.time_prepare_help
1.95±0.01s 1.98±0.02s 1.01 benchmarks.time_save_to_database_help
1.96±0.02s 1.97±0.02s 1.01 benchmarks.time_umap_help

Click here to view all benchmarks.

Copy link
Collaborator

@aritraghsh09 aritraghsh09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Note for the future for all our notebooks: Because we say we are an astronomy framework, we should really move away from CIFAR to something astronomy-related for all our notebooks, but we can do that later once we have basic working docs we are happy with!

@@ -5,22 +5,22 @@
"id": "f163b281",
Copy link
Collaborator

@SamSandwich07 SamSandwich07 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #6.    # h.config["vector_db"]["qdrant"]["vector_size"] = 64  # must match the model's latent dimension

Is the models' latent dimension something the user must know themselves or is there some way to recover it from the model? If so, maybe it's good to mention this for users?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to the qdrant vector database. ChromaDB is more forgiving and detects the input size automatically. Qdrant requires it as an input configuration parameter. I'll add a note about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the user would know what the latent dimension size is for a model that they created. It is also a configurable parameter for the HyraxAutoencoderV2 model. :)

@@ -5,22 +5,22 @@
"id": "f163b281",
Copy link
Collaborator

@SamSandwich07 SamSandwich07 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    conn = h.database_connection()

Does this use the most recently timestamped directory? If so, is there some way to override the database used for the connection like there is with save_to_database()? Like with the latent space, maybe good to mention this for users if true?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, yes it does, and yes, the user can specify a different database to connect to. I'll update the comments.

Copy link
Collaborator

@SamSandwich07 SamSandwich07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple questions which I am wondering about. Also, would it be good to include a sentence or two about when a user might want to use ChromaDB vs Qdrant? Or is that something that is already elsewhere or the user is expected to know?

Copilot AI review requested due to automatic review settings March 6, 2026 04:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR polishes the existing pre-executed vector database demo notebook and adds it to the documentation’s “Common workflows” index, addressing issue #730 (“Demo how to create and query a vector database”).

Changes:

  • Updated the vector DB demo notebook narrative and restructured sections (backend selection, database creation, querying, visualization).
  • Adjusted demo configuration (model name, epochs, and data_request shape) and streamlined some notebook outputs.
  • Added the notebook to docs/common_workflows.rst to make it discoverable.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.

File Description
docs/pre_executed/vector_db_demo.ipynb Rewords and reorganizes the vector DB demo; updates config/code cells for model training, DB creation, queries, and visualization.
docs/common_workflows.rst Adds the vector DB demo to the “Common workflows” toctree.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +43 to +56
" \"dataset_class\": \"HyraxCifarDataset\",\n",
" \"data_location\": \"./data\",\n",
" \"fields\": [\"image\"],\n",
" \"primary_id_field\": \"object_id\",\n",
" \"split_fraction\": 1.0,\n",
" },\n",
" },\n",
" \"infer\": {\n",
" \"data\": {\n",
" \"dataset_class\": \"HyraxCifarDataset\",\n",
" \"data_location\": \"./data\",\n",
" \"fields\": [\"image\", \"object_id\"],\n",
" \"primary_id_field\": \"object_id\",\n",
" },\n",
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_fraction is set for the train data_request group but not for the infer group while both share the same data_location ("./data"). Hyrax validates that if any group for a given data_location specifies split_fraction, then all groups for that data_location must specify it, and the fractions must sum to <= 1.0. As written, this configuration is expected to error at runtime when h.set_config("data_request", data_request) is called.

Suggested fix: remove split_fraction here, or switch fully to split_fraction-based partitioning by adding split_fraction to all relevant groups and ensuring the per-data_location fractions sum to <= 1.0.

Copilot uses AI. Check for mistakes.
@drewoldag
Copy link
Collaborator Author

@aritraghsh09 The approach I've been taking without talking about it is to use CIFAR for the "common workflow" stuff just because it's small and not specific to any one astronomy area. Then in the science examples, there will be real astro data used.

But to your point, if that seems wrong or unhelpful, we can definitely change it up.

@drewoldag drewoldag enabled auto-merge (squash) March 7, 2026 00:41
@drewoldag drewoldag changed the title Polished existing vector database notebook, added to common workflows [AUTO-MERGE] Polished existing vector database notebook, added to common workflows Mar 7, 2026
@drewoldag drewoldag merged commit 7448bc9 into main Mar 7, 2026
6 of 7 checks passed
@drewoldag drewoldag deleted the issue/730/vdb-notebook branch March 7, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Demo how to create and query a vector database

4 participants