Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#81 Venomx integration #96

Merged
merged 25 commits into from
Nov 4, 2024
Merged

#81 Venomx integration #96

merged 25 commits into from
Nov 4, 2024

Conversation

iQuxLE
Copy link
Member

@iQuxLE iQuxLE commented Oct 11, 2024

this is an ongoing PR to migrate to the venomx metadata format as described in #81

  • update db_adapter
  • migrate into chromadb_adapter
  • update chromadb_adapter_tests
  • update duckdb_adapter
  • update duckdb_adapter tests
  • update in_memory_adapter
  • update in memory_adapter_tests
  • check all other tests if needed

@iQuxLE iQuxLE requested a review from caufieldjh October 23, 2024 14:38
@caufieldjh
Copy link
Member

Currently seeing these errors on my system:

============================================================================================== short test summary info ===============================================================================================
FAILED tests/cli/test_store_cli.py::test_store_management - AssertionError: assert 1 == 0
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[all-MiniLM-L6-v2-False-model-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[None-False-model-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[all-MiniLM-L6-v2-False-id-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_store_variations[None-False-id-True] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_fetch_all_memory_safe - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_the_embedding_function_variations[None-None-False] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
FAILED tests/store/test_duckdb_adapter.py::test_the_embedding_function_variations[one_collection-None-False] - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
ERROR tests/store/test_duckdb_adapter.py::test_diversified_search - duckdb.duckdb.Error: Failure while replaying WAL file "/home/harry/curate-gpt/tests/output/duckdbvss.db.wal": Cannot bind index 'test_collection', unknown index type 'HNSW'. You need to load the extension that...
==================================================================== 8 failed, 62 passed, 96 skipped, 8426 warnings, 1 error in 82.86s (0:01:22) =====================================================================
py: exit 1 (86.02 seconds) /home/harry/curate-gpt> poetry run pytest pid=269016
  py: FAIL code 1 (86.49=setup[0.48]+cmd[86.02] seconds)
  evaluation failed :( (86.69 seconds)

@caufieldjh
Copy link
Member

We looked into the WAL errors before: #85 (comment)

@iQuxLE
Copy link
Member Author

iQuxLE commented Oct 23, 2024

@caufieldjh
I cannot reproduce the error. It might be a difference of how Linux and Windows are dealing with WAL files.

There must be different behaviours regarding file-clean ups/ file locks and your machine might see the WAL and wants to recover but does not have the extensions loaded (because by now the connection is first established and than the extensions are loaded)

DuckDB's documentation mentions this scenario with HNSW indexes, recommending to load the VSS extension before any database recovery attempts. I'm implementing this pre-loading step to ensure consistent behavior across platforms, especially for test scenarios where databases are frequently created and reset.

@iQuxLE
Copy link
Member Author

iQuxLE commented Oct 24, 2024

@caufieldjh
made some changes, can you check if there is a difference on your side?

@caufieldjh
Copy link
Member

OK - now the WAL errors in test_duckdb_adapter.py are all gone, once I remove all pre-existing test outputs.

Strangely, I'm still getting the test_store_cli AssertionError:

FAILED tests/cli/test_store_cli.py::test_store_management - AssertionError: assert 1 == 0

stack trace:

____________________________________________________________________________________ test_store_management ____________________________________________________________________________________

runner = <click.testing.CliRunner object at 0x7f53374c1db0>

    def test_store_management(runner):
        result = runner.invoke(
            main, ["ontology", "index", ONT_DB, "-D", "chromadb", "-m", "all-MiniLM-L6-v2", "-c", "oai"]
        )
        assert result.exit_code == 0
        result = runner.invoke(main, ["ontology", "index", ONT_DB, "-c", "default"])
        assert result.exit_code == 0
        result = runner.invoke(main, ["search", "-c", "default", "nuclear membrane"])
        assert result.exit_code == 0
        assert "nuclear membrane" in result.output
        result = runner.invoke(main, ["search", "-c", "oai", "nuclear membrane"])
        assert result.exit_code == 0
        assert "nuclear membrane" in result.output
        result = runner.invoke(main, ["collections", "list"])
>       assert result.exit_code == 0
E       AssertionError: assert 1 == 0
E        +  where 1 = <Result KeyError('_venomx')>.exit_code

tests/cli/test_store_cli.py:21: AssertionError
-------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection oai: Collection oai does not exist.
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection oai: Collection oai does not exist.
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection default: Collection default does not exist.
ERROR    curategpt.store.chromadb_adapter:chromadb_adapter.py:293 Failed to get collection default: Collection default does not exist.

@caufieldjh
Copy link
Member

Or the more useful stack trace:

$ curategpt collections list
Traceback (most recent call last):
  File "/home/harry/curate-gpt/.venv/bin/curategpt", line 6, in <module>
    sys.exit(main())
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/curate-gpt/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/curate-gpt/src/curategpt/cli.py", line 2063, in list_collections
    cm = db.collection_metadata(cn, include_derived=derived)
  File "/home/harry/curate-gpt/src/curategpt/store/chromadb_adapter.py", line 298, in collection_metadata
    cm = Metadata.deserialize_venomx_metadata_from_adapter(metadata_data, self.name)
  File "/home/harry/curate-gpt/src/curategpt/store/metadata.py", line 49, in deserialize_venomx_metadata_from_adapter
    venomx_json = metadata_dict.pop("_venomx")
KeyError: '_venomx'

@caufieldjh
Copy link
Member

Ah, I think this is a me problem: I had a bunch of existing chromadb collections made before this PR, so they don't have any of the venomx metadata. The curategpt collections list attempts to read all local collections, even when it's called in the tests.
I added a condition to deserialize_venomx_metadata_from_adapter() to catch this, though even with that in place, existing collections will still yield errors like this:

ERROR:curategpt.store.chromadb_adapter:Failed to get object count: "Metadata" object has no field "object_count"
## Collection: vbo_new N=19762 meta={'hnsw_space': 'cosine', 'model': 'openai:', 'name': 'vbo_new', 'object_type': 'OntologyClass'} // venomx=None hnsw_space='cosine' object_type='OntologyClass'

Copy link
Member

@caufieldjh caufieldjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to mention in the release notes and docs that this is a breaking change re: older ChromaDB collections (but I think duckdb collections will be OK)

@iQuxLE
Copy link
Member Author

iQuxLE commented Oct 25, 2024

🥳 Any ontology indexed with the index_ontology_command carries a venomx Metadata object now.
It should be fairly easy to update this object to carry more potential data as for example ModelInputMethod.
But the easiest would be to do this with a general --venomx-config command that would carry a predefined YAML via the CLI.

All fields that were prior carried by the old metadata are also carried with venomx.

@caufieldjh caufieldjh merged commit 767d43f into monarch-initiative:main Nov 4, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants