-
Notifications
You must be signed in to change notification settings - Fork 8
Updated ES search functions and authentication #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few minor issues I see.
Mostly to do with the removed methods having been used in other parts of the project.
Though the gist of it seems to be fine. I haven't tested it out, but I'm sure it'll work if you've been using it at GSTT.
cogstack.py
Outdated
|
||
def get_docs_generator(self, index: List, query: Dict, es_gen_size: int=800, request_timeout: Optional[int] = 300): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit that's using this:
https://github.com/CogStack/working_with_cogstack/blob/main/medcat/3_run_model/run_model.py#L73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method has been removed as but can be brought back.
All of the new methods return a Pandas DataFrame but currently this function returns raw JSON which is converted to a list of tuples. It also uses ES _source object. In the new functions, I have excluded _source object from search results and only returning "fields", as recommended by Elastic. The problem is that all fields are arrays and values need to be joined in for the resulting DataFrame.
I think, it would be possible to change the implementation to use new methods but create tuples from DataFrame instead of brining the old function back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's the right approach here.
I'm not saying that the methods I've tagged need to be reimplemented. All I'm trying to do is make sure that the code that uses them (i.e in the other notebooks and/or scripts) gets updated alongside the changes to cogstack.py
. I.e if someone uses the scripts we provide (after this change), they don't error out because the they are out of sync from the loaded module(s).
cogstack.py
Outdated
df = pd.DataFrame(temp_results) | ||
return df | ||
|
||
def DataFrame(self, index: str, columns: Optional[List[str]] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using eland DataFrame which is the same as Pandas DataFrame and can be re-implemented without eland.
cogstack.py
Outdated
api_key=api_key, | ||
verify_certs=False, | ||
timeout=timeout) | ||
apiKey: Dict = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generaly, we want snake_case
names for variables. So api_key
would make more sense.
It looks like the current implementation is still using the old CogStack text field: "body_analysed". It should probably be renamed to "document_Content" or not use any specific field names in the code here. |
The exepctation is generally that the user provides the correct fields they're interested in. I'm pretty sure With that said, if there's a more relevant, up to date example, we'd be better off using that indeed. |
Could you take a look regards to v8 in KCH @vladd-bit |
Feel free to merge this, we will eventually upgrade to ES9 anyway. |
separate methods for API and basic.
I have changed the logic for different types of authentication and removed the mixed approach (api key and basic) from constructor. Now there are two instance methods for corresponding auth type. This should make more obvious for users which type to use. This was discussed a few days ago. |
.gitignore
Outdated
@@ -11,3 +11,6 @@ data/cogstack_search_results/ | |||
|
|||
# Default environments | |||
venv | |||
|
|||
# pythin cache folder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pythin haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change makes a lot of sense!
Separating the authentication is good.
However, there's a few concerns I have:
- There is no guarantee that the user of the class actually does the authentication
- If they just init with hosts, they don't necessarily know that they need to call another method
- They will just see everything fail once they start to use the other methods
- I think it might be beneficial to have class methods that deal with the auth as well, something like:
@classmethod def with_basic_auth(cls, hosts: List[str], username: Optional[str] = None, password: Optional[str] = None) -> 'CogStack': cs = cls(hosts) cs.use_basic_auth(username, password) return cs @classmethod def with_api_key_auth(cls, hosts: List[str], api_key: Optional[Dict] = None) -> 'CogStack': cs = cls(hosts) cs.use_api_key_auth(api_key) return cs
- There's still a few places within the repo where the old init is used. We can't just update the class and not the examples / scripts
- https://github.com/vitaliyok/working_with_cogstack/blob/feature/access_cogstack/medcat/2_train_model/1_unsupervised_training/unsupervised_medcattraining.py#L27
- https://github.com/vitaliyok/working_with_cogstack/blob/feature/access_cogstack/medcat/3_run_model/run_model.py#L27
- https://github.com/vitaliyok/working_with_cogstack/blob/feature/access_cogstack/search/search_template.ipynb
- You've removed some customisability such as the
timeout
option. And now it's hard coded as default 300. Perhaps we could (at the very least) define this as a class level constant so someone could change it? I.eclass CogStack(object): ES_TIMEOUT = 300 # and just refer to `self.ES_TIMEOUT` in `__connect`
I have added the class methods, etc. |
…tack2 and added deprecation warning
For backward compatibility, there are now two version of search. The original one also has a deprecation warning. |
@vitaliyok could you take a look at the typing issues raised by the CI?
PS: The project is currently set to support python 3.9 through to 3.12. Which is in line with what the |
OK. I have tested with Python 3.9. Seems to be working but lets see what CI says this time. |
The CI still has issues:
|
Fixed CI build issues but will see if it works during the next build. |
Still reporting one issue:
|
Fixed the type issue raised by CI. 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks
Hi,
Here are proposed changes to ES search and authentications options: