Skip to content

Updated ES search functions and authentication #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

vitaliyok
Copy link

Hi,

Here are proposed changes to ES search and authentications options:

  1. Renamed and refactored the original search function
  2. Added a function which would allow users to reuse scroll id if the the search fails
  3. Added a function which sorting options, as recommended by ES
  4. Added support for API key object (with encoded API key) as generated by ES
  5. Added some more options for users to view index mapping and available indices in the search template

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few minor issues I see.
Mostly to do with the removed methods having been used in other parts of the project.

Though the gist of it seems to be fine. I haven't tested it out, but I'm sure it'll work if you've been using it at GSTT.

cogstack.py Outdated

def get_docs_generator(self, index: List, query: Dict, es_gen_size: int=800, request_timeout: Optional[int] = 300):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method has been removed as but can be brought back.
All of the new methods return a Pandas DataFrame but currently this function returns raw JSON which is converted to a list of tuples. It also uses ES _source object. In the new functions, I have excluded _source object from search results and only returning "fields", as recommended by Elastic. The problem is that all fields are arrays and values need to be joined in for the resulting DataFrame.
I think, it would be possible to change the implementation to use new methods but create tuples from DataFrame instead of brining the old function back.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's the right approach here.

I'm not saying that the methods I've tagged need to be reimplemented. All I'm trying to do is make sure that the code that uses them (i.e in the other notebooks and/or scripts) gets updated alongside the changes to cogstack.py. I.e if someone uses the scripts we provide (after this change), they don't error out because the they are out of sync from the loaded module(s).

cogstack.py Outdated
df = pd.DataFrame(temp_results)
return df

def DataFrame(self, index: str, columns: Optional[List[str]] = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using eland DataFrame which is the same as Pandas DataFrame and can be re-implemented without eland.

cogstack.py Outdated
api_key=api_key,
verify_certs=False,
timeout=timeout)
apiKey: Dict = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generaly, we want snake_case names for variables. So api_key would make more sense.

@vitaliyok
Copy link
Author

It looks like the current implementation is still using the old CogStack text field: "body_analysed". It should probably be renamed to "document_Content" or not use any specific field names in the code here.

@mart-r
Copy link
Collaborator

mart-r commented Jul 15, 2025

It looks like the current implementation is still using the old CogStack text field: "body_analysed". It should probably be renamed to "document_Content" or not use any specific field names in the code here.

The exepctation is generally that the user provides the correct fields they're interested in. I'm pretty sure body_analysed serves as just an example.

With that said, if there's a more relevant, up to date example, we'd be better off using that indeed.

@mart-r
Copy link
Collaborator

mart-r commented Aug 4, 2025

Could you take a look regards to v8 in KCH @vladd-bit

@vladd-bit
Copy link
Member

Feel free to merge this, we will eventually upgrade to ES9 anyway.

separate methods for API and basic.
@vitaliyok
Copy link
Author

I have changed the logic for different types of authentication and removed the mixed approach (api key and basic) from constructor. Now there are two instance methods for corresponding auth type. This should make more obvious for users which type to use. This was discussed a few days ago.

.gitignore Outdated
@@ -11,3 +11,6 @@ data/cogstack_search_results/

# Default environments
venv

# pythin cache folder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pythin haha

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes a lot of sense!
Separating the authentication is good.

However, there's a few concerns I have:

  1. There is no guarantee that the user of the class actually does the authentication
    • If they just init with hosts, they don't necessarily know that they need to call another method
    • They will just see everything fail once they start to use the other methods
    • I think it might be beneficial to have class methods that deal with the auth as well, something like:
      @classmethod
      def with_basic_auth(cls, hosts: List[str], username: Optional[str] = None, password: Optional[str] = None) -> 'CogStack':
          cs = cls(hosts)
          cs.use_basic_auth(username, password)
          return cs
      @classmethod
      def with_api_key_auth(cls, hosts: List[str], api_key: Optional[Dict] = None) -> 'CogStack':
          cs = cls(hosts)
          cs.use_api_key_auth(api_key)
          return cs
  2. There's still a few places within the repo where the old init is used. We can't just update the class and not the examples / scripts
  3. You've removed some customisability such as the timeout option. And now it's hard coded as default 300. Perhaps we could (at the very least) define this as a class level constant so someone could change it? I.e
    class CogStack(object):
        ES_TIMEOUT = 300
        # and just refer to `self.ES_TIMEOUT` in `__connect`

@vitaliyok
Copy link
Author

I have added the class methods, etc.
About the references from other other scripts the old implementation, I have not included the old functionality into this version. As we discussed earlier, it might be better to review these scripts, in general, and make a decision whether to try to replicate the functionality, rewrite the whole thing or use old version for existing references and the new one to only to read data. Maybe, use as a version reference in the Notebook script.

@vitaliyok
Copy link
Author

For backward compatibility, there are now two version of search. The original one also has a deprecation warning.

@mart-r
Copy link
Collaborator

mart-r commented Aug 13, 2025

@vitaliyok could you take a look at the typing issues raised by the CI?
I've copied it here:

cogstack2.py:150: error: Incompatible types in assignment (expression has type "Union[Any, str]", variable has type "Optional[dict[Any, Any]]")  [assignment]
cogstack2.py:152: error: Argument "api_key" to "__connect" of "CogStack" has incompatible type "Union[Any, str, tuple[Union[Any, str], dict[Any, Any]], None]"; expected "Union[str, tuple[str, str], None]"  [arg-type]
cogstack2.py:154: error: X | Y syntax for unions requires Python 3.10  [syntax]
cogstack2.py:200: error: X | Y syntax for unions requires Python 3.10  [syntax]
cogstack2.py:237: error: X | Y syntax for unions requires Python 3.10  [syntax]
cogstack2.py:263: error: Incompatible default for argument "include_fields" (default has type "None", argument has type "list[str]")  [assignment]
cogstack2.py:263: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
cogstack2.py:263: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
cogstack2.py:321: error: Unexpected keyword argument "request_timeout" for "count" of "Elasticsearch"  [call-arg]
cogstack2.py:337: error: X | Y syntax for unions requires Python 3.10  [syntax]
cogstack2.py:441: error: X | Y syntax for unions requires Python 3.10  [syntax]
cogstack2.py:445: error: X | Y syntax for unions requires Python 3.10  [syntax]
cogstack2.py:446: error: X | Y syntax for unions requires Python 3.10  [syntax]
/opt/hostedtoolcache/Python/3.9.23/x64/lib/python3.9/site-packages/elasticsearch/_sync/client/__init__.py:953: note: "count" of "Elasticsearch" defined here
cogstack2.py:403: error: Argument "fields" to "search" of "Elasticsearch" has incompatible type "Optional[list[str]]"; expected "Optional[Sequence[Mapping[str, Any]]]"  [arg-type]
cogstack2.py:511: error: Item "dict[Any, Any]" of "Union[dict[Any, Any], list[str]]" has no attribute "append"  [union-attr]
cogstack2.py:519: error: Argument "fields" to "search" of "Elasticsearch" has incompatible type "Optional[list[str]]"; expected "Optional[Sequence[Mapping[str, Any]]]"  [arg-type]

PS: The project is currently set to support python 3.9 through to 3.12. Which is in line with what the medcat library supports.

@vitaliyok
Copy link
Author

OK. I have tested with Python 3.9. Seems to be working but lets see what CI says this time.

@mart-r
Copy link
Collaborator

mart-r commented Aug 14, 2025

The CI still has issues:

cogstack2.py:171: error: Argument "api_key" to "__connect" of "CogStack" has incompatible type "Any | str | tuple[Any | str | None, Any | str | None] | None"; expected "str | tuple[str, str] | None"  [arg-type]
cogstack2.py:191: error: Incompatible types in assignment (expression has type "Elasticsearch", variable has type "None")  [assignment]
cogstack2.py:196: error: "None" has no attribute "ping"  [attr-defined]
cogstack2.py:243: error: "None" has no attribute "indices"  [attr-defined]
cogstack2.py:286: error: "None" has no attribute "count"  [attr-defined]
cogstack2.py:350: error: Argument 1 to "scan" has incompatible type "None"; expected "Elasticsearch"  [arg-type]
cogstack2.py:358: error: "None" has no attribute "count"  [attr-defined]
cogstack2.py:365: error: Item "None" of "tqdm[dict[str, Any]] | None" has no attribute "bar_format"  [union-attr]
cogstack2.py:368: error: Item "None" of "tqdm[dict[str, Any]] | None" has no attribute "set_description"  [union-attr]
cogstack2.py:451: error: Incompatible types in assignment (expression has type "list[str] | None", variable has type "Sequence[Mapping[str, Any]]")  [assignment]
cogstack2.py:458: error: "None" has no attribute "search"  [attr-defined]
cogstack2.py:476: error: "None" has no attribute "scroll"  [attr-defined]
cogstack2.py:484: error: "None" has no attribute "clear_scroll"  [attr-defined]
cogstack2.py:491: error: "None" has no attribute "clear_scroll"  [attr-defined]
cogstack2.py:567: error: Incompatible types in assignment (expression has type "list[str] | None", variable has type "Sequence[Mapping[str, Any]]")  [assignment]
cogstack2.py:585: error: "None" has no attribute "search"  [attr-defined]

@vitaliyok
Copy link
Author

Fixed CI build issues but will see if it works during the next build.

@mart-r
Copy link
Collaborator

mart-r commented Aug 14, 2025

Still reporting one issue:

cogstack2.py:356: error: Incompatible types in assignment (expression has type "None", variable has type "tqdm[Any]")  [assignment]

@vitaliyok
Copy link
Author

Fixed the type issue raised by CI. 🤞

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants