[AUTO-MERGE] Polished existing vector database notebook, added to common workflows#753
[AUTO-MERGE] Polished existing vector database notebook, added to common workflows#753
Conversation
…some of the words.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #753 +/- ##
=======================================
Coverage 64.69% 64.69%
=======================================
Files 61 61
Lines 5894 5894
=======================================
Hits 3813 3813
Misses 2081 2081 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Click here to view all benchmarks. |
aritraghsh09
left a comment
There was a problem hiding this comment.
Looks good!
Note for the future for all our notebooks: Because we say we are an astronomy framework, we should really move away from CIFAR to something astronomy-related for all our notebooks, but we can do that later once we have basic working docs we are happy with!
| @@ -5,22 +5,22 @@ | |||
| "id": "f163b281", | |||
There was a problem hiding this comment.
Line #6. # h.config["vector_db"]["qdrant"]["vector_size"] = 64 # must match the model's latent dimension
Is the models' latent dimension something the user must know themselves or is there some way to recover it from the model? If so, maybe it's good to mention this for users?
Reply via ReviewNB
There was a problem hiding this comment.
This is specific to the qdrant vector database. ChromaDB is more forgiving and detects the input size automatically. Qdrant requires it as an input configuration parameter. I'll add a note about it.
There was a problem hiding this comment.
Also, the user would know what the latent dimension size is for a model that they created. It is also a configurable parameter for the HyraxAutoencoderV2 model. :)
| @@ -5,22 +5,22 @@ | |||
| "id": "f163b281", | |||
There was a problem hiding this comment.
Line #1. conn = h.database_connection()
Does this use the most recently timestamped directory? If so, is there some way to override the database used for the connection like there is with save_to_database()? Like with the latent space, maybe good to mention this for users if true?
Reply via ReviewNB
There was a problem hiding this comment.
Good question, yes it does, and yes, the user can specify a different database to connect to. I'll update the comments.
SamSandwich07
left a comment
There was a problem hiding this comment.
Left a couple questions which I am wondering about. Also, would it be good to include a sentence or two about when a user might want to use ChromaDB vs Qdrant? Or is that something that is already elsewhere or the user is expected to know?
There was a problem hiding this comment.
Pull request overview
This PR polishes the existing pre-executed vector database demo notebook and adds it to the documentation’s “Common workflows” index, addressing issue #730 (“Demo how to create and query a vector database”).
Changes:
- Updated the vector DB demo notebook narrative and restructured sections (backend selection, database creation, querying, visualization).
- Adjusted demo configuration (model name, epochs, and data_request shape) and streamlined some notebook outputs.
- Added the notebook to
docs/common_workflows.rstto make it discoverable.
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
docs/pre_executed/vector_db_demo.ipynb |
Rewords and reorganizes the vector DB demo; updates config/code cells for model training, DB creation, queries, and visualization. |
docs/common_workflows.rst |
Adds the vector DB demo to the “Common workflows” toctree. |
You can also share your feedback on Copilot code review. Take the survey.
| " \"dataset_class\": \"HyraxCifarDataset\",\n", | ||
| " \"data_location\": \"./data\",\n", | ||
| " \"fields\": [\"image\"],\n", | ||
| " \"primary_id_field\": \"object_id\",\n", | ||
| " \"split_fraction\": 1.0,\n", | ||
| " },\n", | ||
| " },\n", | ||
| " \"infer\": {\n", | ||
| " \"data\": {\n", | ||
| " \"dataset_class\": \"HyraxCifarDataset\",\n", | ||
| " \"data_location\": \"./data\",\n", | ||
| " \"fields\": [\"image\", \"object_id\"],\n", | ||
| " \"primary_id_field\": \"object_id\",\n", | ||
| " },\n", |
There was a problem hiding this comment.
split_fraction is set for the train data_request group but not for the infer group while both share the same data_location ("./data"). Hyrax validates that if any group for a given data_location specifies split_fraction, then all groups for that data_location must specify it, and the fractions must sum to <= 1.0. As written, this configuration is expected to error at runtime when h.set_config("data_request", data_request) is called.
Suggested fix: remove split_fraction here, or switch fully to split_fraction-based partitioning by adding split_fraction to all relevant groups and ensuring the per-data_location fractions sum to <= 1.0.
|
@aritraghsh09 The approach I've been taking without talking about it is to use CIFAR for the "common workflow" stuff just because it's small and not specific to any one astronomy area. Then in the science examples, there will be real astro data used. But to your point, if that seems wrong or unhelpful, we can definitely change it up. |
Change Description
Closes #730
The vector database notebook was mostly ready to go, I just polished some of the words. Also added to common workflows page.
FWIW it's a pre-executed notebook, and we could probably export the inference results to a zipped file on zenodo to save ourselves a bunch of cells at the beginning of the notebook. This could be an approach taken for a lot of notebooks...