#81 Venomx integration #96

iQuxLE · 2024-10-11T16:32:31Z

this is an ongoing PR to migrate to the venomx metadata format as described in #81

…double

caufieldjh · 2024-10-23T15:20:03Z

Currently seeing these errors on my system:

============================================================================================== short test summary info ===============================================================================================
FAILED tests/cli/test_store_cli.py::test_store_management - AssertionError: assert 1 == 0
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[all-MiniLM-L6-v2-False-model-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[None-False-model-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[all-MiniLM-L6-v2-False-id-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[None-False-id-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_fetch_all_memory_safe - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_the_embedding_function_variations[None-None-False] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_the_embedding_function_variations[one_collection-None-False] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
ERROR tests/store/test_duckdb_adapter.py::test_diversified_search - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
==================================================================== 8 failed, 62 passed, 96 skipped, 8426 warnings, 1 error in 82.86s (0:01:22) =====================================================================
py: exit 1 (86.02 seconds) /home/harry/curate-gpt> poetry run pytest pid=269016
  py: FAIL code 1 (86.49=setup[0.48]+cmd[86.02] seconds)
  evaluation failed :( (86.69 seconds)

caufieldjh · 2024-10-23T15:24:52Z

We looked into the WAL errors before: #85 (comment)

iQuxLE · 2024-10-23T22:45:34Z

@caufieldjh
I cannot reproduce the error. It might be a difference of how Linux and Windows are dealing with WAL files.

There must be different behaviours regarding file-clean ups/ file locks and your machine might see the WAL and wants to recover but does not have the extensions loaded (because by now the connection is first established and than the extensions are loaded)

DuckDB's documentation mentions this scenario with HNSW indexes, recommending to load the VSS extension before any database recovery attempts. I'm implementing this pre-loading step to ensure consistent behavior across platforms, especially for test scenarios where databases are frequently created and reset.

iQuxLE · 2024-10-24T07:20:41Z

@caufieldjh
made some changes, can you check if there is a difference on your side?

caufieldjh · 2024-10-24T14:35:16Z

OK - now the WAL errors in test_duckdb_adapter.py are all gone, once I remove all pre-existing test outputs.

Strangely, I'm still getting the test_store_cli AssertionError:

FAILED tests/cli/test_store_cli.py::test_store_management - AssertionError: assert 1 == 0

stack trace:

____________________________________________________________________________________ test_store_management ____________________________________________________________________________________

runner = <click.testing.CliRunner object at 0x7f53374c1db0>

    def test_store_management(runner):
        result = runner.invoke(
            main, ["ontology", "index", ONT_DB, "-D", "chromadb", "-m", "all-MiniLM-L6-v2", "-c", "oai"]
        )
        assert result.exit_code == 0
        result = runner.invoke(main, ["ontology", "index", ONT_DB, "-c", "default"])
        assert result.exit_code == 0
        result = runner.invoke(main, ["search", "-c", "default", "nuclear membrane"])
        assert result.exit_code == 0
        assert "nuclear membrane" in result.output
        result = runner.invoke(main, ["search", "-c", "oai", "nuclear membrane"])
        assert result.exit_code == 0
        assert "nuclear membrane" in result.output
        result = runner.invoke(main, ["collections", "list"])
>       assert result.exit_code == 0
E       AssertionError: assert 1 == 0
E        +  where 1 = <Result KeyError('_venomx')>.exit_code

tests/cli/test_store_cli.py:21: AssertionError
-------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection oai: Collection oai does not exist.
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection oai: Collection oai does not exist.
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection default: Collection default does not exist.
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection default: Collection default does not exist.

caufieldjh · 2024-10-24T14:50:00Z

Or the more useful stack trace:

$ curategpt collections list
Traceback (most recent call last):
  File "/home/harry/curate-gpt/.venv/bin/curategpt", line 6, in <module>
    sys.exit(main())
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/curate-gpt/src/curategpt/cli.py", line 2063, in list_collections
    cm = db.collection_metadata(cn, include_derived=derived)
  File "/home/harry/curate-gpt/src/curategpt/store/chromadb_adapter.py", line 298, in collection_metadata
    cm = Metadata.deserialize_venomx_metadata_from_adapter(metadata_data, self.name)
  File "/home/harry/curate-gpt/src/curategpt/store/metadata.py", line 49, in deserialize_venomx_metadata_from_adapter
    venomx_json = metadata_dict.pop("_venomx")
KeyError: '_venomx'

caufieldjh · 2024-10-24T14:56:53Z

Ah, I think this is a me problem: I had a bunch of existing chromadb collections made before this PR, so they don't have any of the venomx metadata. The curategpt collections list attempts to read all local collections, even when it's called in the tests.
I added a condition to deserialize_venomx_metadata_from_adapter() to catch this, though even with that in place, existing collections will still yield errors like this:

ERROR:curategpt.store.chromadb_adapter:Failed to get object count: "Metadata" object has no field "object_count"
## Collection: vbo_new N=19762 meta={'hnsw_space': 'cosine', 'model': 'openai:', 'name': 'vbo_new', 'object_type': 'OntologyClass'} // venomx=None hnsw_space='cosine' object_type='OntologyClass'

caufieldjh

Will need to mention in the release notes and docs that this is a breaking change re: older ChromaDB collections (but I think duckdb collections will be OK)

iQuxLE · 2024-10-25T16:28:14Z

🥳 Any ontology indexed with the index_ontology_command carries a venomx Metadata object now.
It should be fairly easy to update this object to carry more potential data as for example ModelInputMethod.
But the easiest would be to do this with a general --venomx-config command that would carry a predefined YAML via the CLI.

All fields that were prior carried by the old metadata are also carried with venomx.

iQuxLE added 16 commits October 11, 2024 17:25

venomx migration in db_adapter.py + chromadb_adapter.py

5ea5ef8

update index_ontology_command

29e066d

update CollectionMetdata -> Metadata for venomx

a8fe1de

start updating chroma tests

701f05b

tests for debugging

5cccb97

linting

d9ad0ec

making old CollectionMetadata available

9d021d5

linting

31f9039

adjust cli to select correct field from venomx

cbfb0d7

finish chromadb_adapter.py venomx update + tests

1676367

fix object type - test_store_cli.py acts not expected

98887a9

duckdb_adapter update

ccf0d0b

change to composition to have cleaner model output and not get Index …

2cbf4bc

…double

bugfixes and test completion: finished adapters

b024230

update in memory adapter

ea43595

update metadata.py to only use venomx

7451f80

iQuxLE requested a review from caufieldjh October 23, 2024 14:38

iQuxLE added 4 commits October 23, 2024 16:35

bugfix search in chroma to take default model if no model given

c684b57

bugfix

4783f12

bugfix

ff08dc4

bugfix

f8db565

iQuxLE added 2 commits October 24, 2024 08:11

potentially handle WAL related errors

b9aa15e

lint

0c756ed

Read _venomx only if it exists

4f991a3

caufieldjh approved these changes Oct 24, 2024

View reviewed changes

iQuxLE added 2 commits October 25, 2024 16:54

bugfix

1815c8e

Merge remote-tracking branch 'origin/venomx' into venomx

f02aff0

caufieldjh merged commit 767d43f into monarch-initiative:main Nov 4, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#81 Venomx integration #96

#81 Venomx integration #96

iQuxLE commented Oct 11, 2024 •

edited

Loading

caufieldjh commented Oct 23, 2024

caufieldjh commented Oct 23, 2024

iQuxLE commented Oct 23, 2024

iQuxLE commented Oct 24, 2024

caufieldjh commented Oct 24, 2024

caufieldjh commented Oct 24, 2024

caufieldjh commented Oct 24, 2024

caufieldjh left a comment

iQuxLE commented Oct 25, 2024 •

edited

Loading

#81 Venomx integration #96

#81 Venomx integration #96

Conversation

iQuxLE commented Oct 11, 2024 • edited Loading

caufieldjh commented Oct 23, 2024

caufieldjh commented Oct 23, 2024

iQuxLE commented Oct 23, 2024

iQuxLE commented Oct 24, 2024

caufieldjh commented Oct 24, 2024

caufieldjh commented Oct 24, 2024

caufieldjh commented Oct 24, 2024

caufieldjh left a comment

Choose a reason for hiding this comment

iQuxLE commented Oct 25, 2024 • edited Loading

iQuxLE commented Oct 11, 2024 •

edited

Loading

iQuxLE commented Oct 25, 2024 •

edited

Loading