Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: MAST query result cache support outline #1578

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

orionlee
Copy link
Contributor

Closes #1577 .

So far this is an outline of the proposed change. I'd like to use the PR to solicit feedback (and gauge the efforts required) before proceeding further.

@pep8speaks
Copy link

pep8speaks commented Oct 25, 2019

Hello @orionlee! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-05 00:33:35 UTC

@astropy-bot astropy-bot bot added the mast label Oct 25, 2019
@orionlee orionlee force-pushed the mast_query_result_cache_1577 branch from ed2de79 to e9ecbe5 Compare October 25, 2019 19:42
@orionlee
Copy link
Contributor Author

Proposed changes outline. @ceb8 @barentsen feedback welcome.

  1. mast.Observations.query_criteria_async() : add optional parameters cache=False, cache_opts=None
  2. The cache parameters propagate to a new function, MastClass._request_w_cache(), which wraps around existing MastClass._request() function and implements caching.
  3. Caching can be similarly exposed to other MAST query API functions.

Discussion Points:

  1. The actual caching mechanism: follow the base astropy.BaseQuery's pattern. See its BaseQuery._request() function:

    else:
    response = query.from_cache(self.cache_location)
    if not response:
    response = query.request(self._session,
    self.cache_location,
    stream=stream,
    auth=auth,
    verify=verify)
    to_cache(response, query.request_file(self.cache_location))
    self._last_query = query

  2. Caching policy (expiration, etc.): TBD. I don't know astropy API well enough
    yet, but it should reuse existing astropy convention when possible.

@codecov
Copy link

codecov bot commented Oct 25, 2019

Codecov Report

Merging #1578 (d949888) into main (df3b6fc) will decrease coverage by 6.08%.
The diff coverage is 100.00%.

❗ Current head d949888 differs from pull request most recent head 1aa7078. Consider uploading reports for the commit 1aa7078 to get more accurate results

@@            Coverage Diff             @@
##             main    #1578      +/-   ##
==========================================
- Coverage   69.18%   63.11%   -6.08%     
==========================================
  Files         304      133     -171     
  Lines       22529    17348    -5181     
==========================================
- Hits        15587    10949    -4638     
+ Misses       6942     6399     -543     
Impacted Files Coverage Δ
astroquery/mast/discovery_portal.py 69.65% <100.00%> (+3.14%) ⬆️
astroquery/mast/observations.py 76.01% <100.00%> (+0.59%) ⬆️
astroquery/query.py 58.97% <100.00%> (-3.43%) ⬇️

... and 245 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bsipocz bsipocz added this to the 0.4 milestone Oct 25, 2019
@bsipocz
Copy link
Member

bsipocz commented Oct 25, 2019

Caching policy (expiration, etc.): TBD. I don't know astropy API well enough
yet, but it should reuse existing astropy convention when possible.

we don't yet have a good solution to that.

astroquery/mast/core.py Outdated Show resolved Hide resolved
astroquery/mast/core.py Outdated Show resolved Hide resolved
@orionlee
Copy link
Contributor Author

orionlee commented Nov 5, 2019

Basic caching support (cache=True / False, on par with astroquery) is added to
mast.Observations.query_criteria_async() for review.

Please review before I proceed any further. It it's agreed, similar pattern can be applied to other MAST query functions.

Notes from my side:

  • a change is added to the generic AstroQuery.from_cache() function, to accommodate MAST response (which is a list of Response rather than a single one). It might be more desirable to refactor the caching logic so that it can be used independent of AstroQuery, so that MAST-specific behavior can be accommodated more easily without adding to base AstroQuery, or duplicating the caching logic.

@bsipocz
Copy link
Member

bsipocz commented Nov 6, 2019

I don't yet see why the tests are failing, it looks very unrelated and something I couldn't yet reproduce locally.

Also, I'm travelling this week, but will try to get back to this and do a reasonable review soon.

@orionlee
Copy link
Contributor Author

orionlee commented Nov 6, 2019

For failing tests: for now I assume that they are due to some transient issues with CI, and expect they will recover in the next push.

@barentsen
Copy link
Contributor

barentsen commented Nov 7, 2019

(Just wanted to subscribe and say I am super excited about this PR, and will be happy to help test.)

@ceb8
Copy link
Member

ceb8 commented Nov 7, 2019

I'm swamped through Monday, but should have time to test this later next week.

@bsipocz bsipocz modified the milestones: v0.4, v4 Jan 23, 2020
@christinahedges
Copy link

This implements a really useful feature. At the moment I frequently query MAST in order to get the same "search result" as part of my pipeline to iterate through a large list of objects, even though the files themselves are cached. Is there anyway I can help get this merged?

@ceb8
Copy link
Member

ceb8 commented Jan 29, 2020

@christinahedges I will try to make time next week to look at this PR.

I am also in general concerned with the lack of an expiration on the cache, which has bitten me before when for example a colleague and I queried Ned repeatedly, not understanding why we were getting different results and it turned out that one of us had made the same query a year prior (according to the date on the cache file) and were still accessing that very old search result. This has also come up within astroquery.mast with Tesscut which does use astroquery caching, resulting in users not finding new results (there are TESS releases ~monthly).

@christinahedges
Copy link

@ceb8 I think this is a really good point. In particular if users query for e.g. TESS data, there will potentially be new targets and files every month. For the most part, I want this functionality so that when I re-run my scripts 5 times a day, I'm not waiting for a result from MAST. If we had an expiration date of e.g. 7 days, that would totally meet my needs.

@ceb8
Copy link
Member

ceb8 commented Jan 29, 2020

@christinahedges Yeah, I think ideally the cache will have a default timeout on the short scale, the user will have the ability to set it to a user specified value, and there will always be a no-cache option.

@bsipocz
Copy link
Member

bsipocz commented Jan 29, 2020

@ceb8 - There is a cascade of issues for caching, so certainly there is a need to get the caching solved for many modules: #1511 #1531 and even a very old PR, that may or may not be suitable to be thawed up: #442

cc @keflavich @jwoillez

@ceb8
Copy link
Member

ceb8 commented Jan 29, 2020

@bsipocz I will try to look at the caching situation overall next week then, and gather what needs to be done and make a plan. 😃

@christinahedges
Copy link

@ceb8 and @bsipocz I really appreciate it! This will be such great functionality, and will hopefully stop me pinging MAST 100 times a day...! Let me know if I can do anything to help!

@ceb8
Copy link
Member

ceb8 commented Feb 5, 2020

@bsipocz @christinahedges Opened WIP PR #1634 to address the general caching situation. Will look at this PR next.

@christinahedges
Copy link

That is awesome thank you! I can't wait for this functionality! 🎉 🎉 🎉 🎉 🎉 🎉 🎉

@orionlee
Copy link
Contributor Author

per @ceb , this PR will wait until #1634 (cache expiration policy, etc.) completes.

@bsipocz bsipocz added the cache label May 16, 2022
@ceb8 ceb8 force-pushed the mast_query_result_cache_1577 branch from d1a8f66 to d949888 Compare September 27, 2022 16:44
@bsipocz
Copy link
Member

bsipocz commented Sep 27, 2022

@ceb8 - can you rebase this on top of #1634, so to see it works with the refactored machinery?

@bsipocz
Copy link
Member

bsipocz commented Sep 27, 2022

(then this PR will contain the changes from that one, etc, but we'll sort that out along the way).

Also, this PR is very much considered WIP/draft as part of testing the other one, so please everyone, no detailed code reviews until we sort out the big picture!

@bsipocz bsipocz changed the title MAST query result cache support outline WIP: MAST query result cache support outline Sep 27, 2022
@ceb8 ceb8 force-pushed the mast_query_result_cache_1577 branch from d949888 to 2d73cb9 Compare September 27, 2022 17:01
@bsipocz
Copy link
Member

bsipocz commented Sep 27, 2022

Hmm, I think something went bad during the rebase. did you try to do it interactively, and keep only the one commit from this branch, rather than the ones from main, too?

@ceb8 ceb8 force-pushed the mast_query_result_cache_1577 branch from 2d73cb9 to 441c246 Compare September 29, 2022 16:09
@ceb8
Copy link
Member

ceb8 commented Sep 29, 2022

@bsipocz Found some interesting things:

  • Only one small tweak (latest commit) required and it's caching as promised (yay)
  • Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.

@bsipocz
Copy link
Member

bsipocz commented Sep 29, 2022

Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.

can you elaborate on this with an example? There are problems with the MAST (and gemini) class naming already, I wonder whether this would be a non-issue if that one is solved or are unrelated.

@ceb8
Copy link
Member

ceb8 commented Sep 29, 2022

@bsipocz It's unrelated. Basically MAST hands off querying to a difference class depending on what interface is needed, and that class does the caching, rather than the top level class the user queried.

Example:

from astroquery.mast import Observations

obs_table = Observations.query_criteria(dataproduct_type=["image"],
                                        proposal_pi="Osten*",
                                        s_dec=[43.5,45.5], cache=True)

In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.

@bsipocz
Copy link
Member

bsipocz commented Sep 30, 2022

In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.

Yeap, this is pretty bad. At least the issue of having two Observations class is known and in the works to fix it with a very long deprecation, it still won't solve the main issue.

Would you think adding one or more layers help? Or maybe the full namespace would be the solution, as we have substructure in a few modules (though caching everything esa together, and everything ipac, and everything solarsystem, etc doesn't feel that wrong as having just one flat layer.

@bsipocz
Copy link
Member

bsipocz commented Sep 30, 2022

btw, this is exactly the issue that I hoped to be able to recover before merging the other PR.

@ceb8
Copy link
Member

ceb8 commented Sep 30, 2022

@bsipocz We could use the full namespace, but from a user perspective... I think it might make more sense to cache at one higher level i.e. all mast together, all esa together etc... There is no reason for example that a user should need to know which particular MAST interface they are accessing.

Also: an additional thing I just noticed is that I'm using a astroquery-wide conf class that I put in the top level __init__.py class. But each module also has a conf class, which is causes name conflicts. I could change the name of the upper level class, but I could also make the module level conf classes inherit from the higher up class. Thoughts? I feel like the latter option is cleaner and better but involves more changes to the codebase, and so has a higher potential of causing problems.

@bsipocz
Copy link
Member

bsipocz commented Sep 30, 2022

I think it might make more sense to cache at one higher level i.e. all mast together, all esa together etc... There is no reason for example that a user should need to know which particular MAST interface they are accessing.

yes, but mast is very much the odd one out, the whole structure of that module is totally different from the rest. The substructure I referred to in the previous comment is esa.hubble, esa.hsa, or ipac.irsa, ipac.ned.
Either case, I'm happy if the cache dir structure is just one layer deep, it will be an improvement from the current, absolutely flat one.

@ceb8
Copy link
Member

ceb8 commented Sep 30, 2022

@bsipocz I'm wondering if there is a way to use the module conf structure to facilitate this? So that individual modules can override the defaults when it makes sense.

@bsipocz
Copy link
Member

bsipocz commented Sep 30, 2022

I'm wondering if there is a way to use the module conf structure to facilitate this? So that individual modules can override the defaults when it makes sense.

yes, I suppose potentially that can work. Go the way which is the simplest implementation. I really think that a one layer is enough, especially if that's easier to do.

@ceb8
Copy link
Member

ceb8 commented Oct 3, 2022

@bsipocz So it turned out the mechanism for overriding the default behavior is already in place. The default cache location using the class property self.name which is set to self.__class__.__name__.split("Class")[0] so all a class has to do is overide that and the default cache location can be any subdirectory the developer wants. Because Mast uses mutiple classes for API access under the hood it's a little more complicated, but it can all be set up in the main Mast class in core, as demonstrated by my latest commit.

@ceb8 ceb8 force-pushed the mast_query_result_cache_1577 branch from d377e99 to 460fd24 Compare March 16, 2023 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MAST: support query result caching
7 participants