WIP: MAST query result cache support outline #1578

orionlee · 2019-10-25T19:41:09Z

Closes #1577 .

So far this is an outline of the proposed change. I'd like to use the PR to solicit feedback (and gauge the efforts required) before proceeding further.

pep8speaks · 2019-10-25T19:41:13Z

Hello @orionlee! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-05 00:33:35 UTC

orionlee · 2019-10-25T20:11:28Z

Proposed changes outline. @ceb8 @barentsen feedback welcome.

mast.Observations.query_criteria_async() : add optional parameters cache=False, cache_opts=None
The cache parameters propagate to a new function, MastClass._request_w_cache(), which wraps around existing MastClass._request() function and implements caching.
Caching can be similarly exposed to other MAST query API functions.

Discussion Points:

The actual caching mechanism: follow the base astropy.BaseQuery's pattern. See its BaseQuery._request() function:

astroquery/astroquery/query.py

Lines 239 to 248 in ebdf195

    
           else: 
        
               response = query.from_cache(self.cache_location) 
        
               if not response: 
        
                   response = query.request(self._session, 
        
                                            self.cache_location, 
        
                                            stream=stream, 
        
                                            auth=auth, 
        
                                            verify=verify) 
        
                   to_cache(response, query.request_file(self.cache_location)) 
        
           self._last_query = query

Caching policy (expiration, etc.): TBD. I don't know astropy API well enough
yet, but it should reuse existing astropy convention when possible.

codecov · 2019-10-25T20:23:44Z

Codecov Report

Merging #1578 (d949888) into main (df3b6fc) will decrease coverage by 6.08%.
The diff coverage is 100.00%.

❗ Current head d949888 differs from pull request most recent head 1aa7078. Consider uploading reports for the commit 1aa7078 to get more accurate results

@@            Coverage Diff             @@
##             main    #1578      +/-   ##
==========================================
- Coverage   69.18%   63.11%   -6.08%     
==========================================
  Files         304      133     -171     
  Lines       22529    17348    -5181     
==========================================
- Hits        15587    10949    -4638     
+ Misses       6942     6399     -543

Impacted Files	Coverage Δ
astroquery/mast/discovery_portal.py	`69.65% <100.00%> (+3.14%)`	⬆️
astroquery/mast/observations.py	`76.01% <100.00%> (+0.59%)`	⬆️
astroquery/query.py	`58.97% <100.00%> (-3.43%)`	⬇️

... and 245 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bsipocz · 2019-10-25T21:44:19Z

Caching policy (expiration, etc.): TBD. I don't know astropy API well enough
yet, but it should reuse existing astropy convention when possible.

we don't yet have a good solution to that.

astroquery/mast/core.py

orionlee · 2019-11-05T00:46:26Z

Basic caching support (cache=True / False, on par with astroquery) is added to
mast.Observations.query_criteria_async() for review.

Please review before I proceed any further. It it's agreed, similar pattern can be applied to other MAST query functions.

Notes from my side:

a change is added to the generic AstroQuery.from_cache() function, to accommodate MAST response (which is a list of Response rather than a single one). It might be more desirable to refactor the caching logic so that it can be used independent of AstroQuery, so that MAST-specific behavior can be accommodated more easily without adding to base AstroQuery, or duplicating the caching logic.

bsipocz · 2019-11-06T22:09:45Z

I don't yet see why the tests are failing, it looks very unrelated and something I couldn't yet reproduce locally.

Also, I'm travelling this week, but will try to get back to this and do a reasonable review soon.

orionlee · 2019-11-06T23:57:09Z

For failing tests: for now I assume that they are due to some transient issues with CI, and expect they will recover in the next push.

barentsen · 2019-11-07T03:44:28Z

(Just wanted to subscribe and say I am super excited about this PR, and will be happy to help test.)

ceb8 · 2019-11-07T13:53:57Z

I'm swamped through Monday, but should have time to test this later next week.

christinahedges · 2020-01-29T18:36:24Z

This implements a really useful feature. At the moment I frequently query MAST in order to get the same "search result" as part of my pipeline to iterate through a large list of objects, even though the files themselves are cached. Is there anyway I can help get this merged?

ceb8 · 2020-01-29T19:08:03Z

@christinahedges I will try to make time next week to look at this PR.

I am also in general concerned with the lack of an expiration on the cache, which has bitten me before when for example a colleague and I queried Ned repeatedly, not understanding why we were getting different results and it turned out that one of us had made the same query a year prior (according to the date on the cache file) and were still accessing that very old search result. This has also come up within astroquery.mast with Tesscut which does use astroquery caching, resulting in users not finding new results (there are TESS releases ~monthly).

christinahedges · 2020-01-29T19:11:45Z

@ceb8 I think this is a really good point. In particular if users query for e.g. TESS data, there will potentially be new targets and files every month. For the most part, I want this functionality so that when I re-run my scripts 5 times a day, I'm not waiting for a result from MAST. If we had an expiration date of e.g. 7 days, that would totally meet my needs.

ceb8 · 2020-01-29T19:14:17Z

@christinahedges Yeah, I think ideally the cache will have a default timeout on the short scale, the user will have the ability to set it to a user specified value, and there will always be a no-cache option.

bsipocz · 2020-01-29T20:11:23Z

@ceb8 - There is a cascade of issues for caching, so certainly there is a need to get the caching solved for many modules: #1511 #1531 and even a very old PR, that may or may not be suitable to be thawed up: #442

cc @keflavich @jwoillez

ceb8 · 2020-01-29T21:32:33Z

@bsipocz I will try to look at the caching situation overall next week then, and gather what needs to be done and make a plan. 😃

christinahedges · 2020-01-29T21:39:01Z

@ceb8 and @bsipocz I really appreciate it! This will be such great functionality, and will hopefully stop me pinging MAST 100 times a day...! Let me know if I can do anything to help!

ceb8 · 2020-02-05T15:05:01Z

@bsipocz @christinahedges Opened WIP PR #1634 to address the general caching situation. Will look at this PR next.

christinahedges · 2020-02-05T18:07:23Z

That is awesome thank you! I can't wait for this functionality! 🎉 🎉 🎉 🎉 🎉 🎉 🎉

orionlee · 2020-08-26T00:58:25Z

per @ceb , this PR will wait until #1634 (cache expiration policy, etc.) completes.

bsipocz · 2022-09-27T16:56:09Z

@ceb8 - can you rebase this on top of #1634, so to see it works with the refactored machinery?

bsipocz · 2022-09-27T16:58:05Z

(then this PR will contain the changes from that one, etc, but we'll sort that out along the way).

Also, this PR is very much considered WIP/draft as part of testing the other one, so please everyone, no detailed code reviews until we sort out the big picture!

bsipocz · 2022-09-27T17:35:02Z

Hmm, I think something went bad during the rebase. did you try to do it interactively, and keep only the one commit from this branch, rather than the ones from main, too?

ceb8 · 2022-09-29T16:55:44Z

@bsipocz Found some interesting things:

Only one small tweak (latest commit) required and it's caching as promised (yay)
Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.

bsipocz · 2022-09-29T17:01:21Z

Because of the way we determine cache location and the way the MAST module is written, the cache directory is not the name of the top level class the user interacts with, which is definitely a problem. Not sure if this should be solved by the MAST module or if I need to tweak the caching infrastructure. I will explore.

can you elaborate on this with an example? There are problems with the MAST (and gemini) class naming already, I wonder whether this would be a non-issue if that one is solved or are unrelated.

ceb8 · 2022-09-29T17:08:26Z

@bsipocz It's unrelated. Basically MAST hands off querying to a difference class depending on what interface is needed, and that class does the caching, rather than the top level class the user queried.

Example:

from astroquery.mast import Observations

obs_table = Observations.query_criteria(dataproduct_type=["image"],
                                        proposal_pi="Osten*",
                                        s_dec=[43.5,45.5], cache=True)

In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.

bsipocz · 2022-09-30T02:56:23Z

In this example you would expect the cache location to be Observations.cache_location (.astropy/cache/astroquery/Observations) but it actually caches the file in .astropy/cache/astroquery/PortalAPI because the PortalAPI class is the one that handles this type of query.

Yeap, this is pretty bad. At least the issue of having two Observations class is known and in the works to fix it with a very long deprecation, it still won't solve the main issue.

Would you think adding one or more layers help? Or maybe the full namespace would be the solution, as we have substructure in a few modules (though caching everything esa together, and everything ipac, and everything solarsystem, etc doesn't feel that wrong as having just one flat layer.

bsipocz · 2022-09-30T02:57:09Z

btw, this is exactly the issue that I hoped to be able to recover before merging the other PR.

ceb8 · 2022-09-30T13:52:23Z

@bsipocz We could use the full namespace, but from a user perspective... I think it might make more sense to cache at one higher level i.e. all mast together, all esa together etc... There is no reason for example that a user should need to know which particular MAST interface they are accessing.

Also: an additional thing I just noticed is that I'm using a astroquery-wide conf class that I put in the top level __init__.py class. But each module also has a conf class, which is causes name conflicts. I could change the name of the upper level class, but I could also make the module level conf classes inherit from the higher up class. Thoughts? I feel like the latter option is cleaner and better but involves more changes to the codebase, and so has a higher potential of causing problems.

bsipocz · 2022-09-30T15:01:52Z

I think it might make more sense to cache at one higher level i.e. all mast together, all esa together etc... There is no reason for example that a user should need to know which particular MAST interface they are accessing.

yes, but mast is very much the odd one out, the whole structure of that module is totally different from the rest. The substructure I referred to in the previous comment is esa.hubble, esa.hsa, or ipac.irsa, ipac.ned.
Either case, I'm happy if the cache dir structure is just one layer deep, it will be an improvement from the current, absolutely flat one.

ceb8 · 2022-09-30T15:07:42Z

@bsipocz I'm wondering if there is a way to use the module conf structure to facilitate this? So that individual modules can override the defaults when it makes sense.

bsipocz · 2022-09-30T15:17:54Z

I'm wondering if there is a way to use the module conf structure to facilitate this? So that individual modules can override the defaults when it makes sense.

yes, I suppose potentially that can work. Go the way which is the simplest implementation. I really think that a one layer is enough, especially if that's easier to do.

ceb8 · 2022-10-03T16:57:18Z

@bsipocz So it turned out the mechanism for overriding the default behavior is already in place. The default cache location using the class property self.name which is set to self.__class__.__name__.split("Class")[0] so all a class has to do is overide that and the default cache location can be any subdirectory the developer wants. Because Mast uses mutiple classes for API access under the hood it's a little more complicated, but it can all be set up in the main Mast class in core, as demonstrated by my latest commit.

astropy-bot bot added the mast label Oct 25, 2019

orionlee force-pushed the mast_query_result_cache_1577 branch from ed2de79 to e9ecbe5 Compare October 25, 2019 19:42

bsipocz added this to the 0.4 milestone Oct 25, 2019

bsipocz reviewed Oct 25, 2019

View reviewed changes

astroquery/mast/core.py Outdated Show resolved Hide resolved

bsipocz reviewed Oct 25, 2019

View reviewed changes

astroquery/mast/core.py Outdated Show resolved Hide resolved

bsipocz modified the milestones: v0.4, v4 Jan 23, 2020

barentsen mentioned this pull request Feb 4, 2020

TESScut download slow when running a second time lightkurve/lightkurve#474

Closed

warrickball mentioned this pull request Mar 12, 2020

Use lightkurve's cache instead of maintaining our own grd349/PBjam#230

Closed

orionlee mentioned this pull request Aug 25, 2020

ENH: Cache refactoring #1634

Merged

bsipocz added the cache label May 16, 2022

ceb8 force-pushed the mast_query_result_cache_1577 branch from d1a8f66 to d949888 Compare September 27, 2022 16:44

bsipocz changed the title ~~MAST query result cache support outline~~ WIP: MAST query result cache support outline Sep 27, 2022

ceb8 force-pushed the mast_query_result_cache_1577 branch from d949888 to 2d73cb9 Compare September 27, 2022 17:01

ceb8 force-pushed the mast_query_result_cache_1577 branch from 2d73cb9 to 441c246 Compare September 29, 2022 16:09

jwoillez and others added 11 commits March 16, 2023 14:37

setup.py egg_info fix

70ddd89

additional caching functionality

c3c55a3

rethinking some of the caching

33840de

pep8

1e31d39

cleanup after rebase

5442b54

regularizing conf usage ande fixing suspend cache function

2589b8a

integrating better with astropy config framework

b55eb43

making new caching backwards compatible

7ab4992

MAST query result cache: Observations.query_criteria()

9cb1b84

adjsut for new cache work

23affe0

make path location get/set work properly

460fd24

ceb8 force-pushed the mast_query_result_cache_1577 branch from d377e99 to 460fd24 Compare March 16, 2023 15:42

making caching default and controllable with existing infrastructure

1aa7078

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: MAST query result cache support outline #1578

WIP: MAST query result cache support outline #1578

orionlee commented Oct 25, 2019

pep8speaks commented Oct 25, 2019 •

edited

Loading

orionlee commented Oct 25, 2019

codecov bot commented Oct 25, 2019 •

edited

Loading

bsipocz commented Oct 25, 2019

orionlee commented Nov 5, 2019

bsipocz commented Nov 6, 2019 •

edited

Loading

orionlee commented Nov 6, 2019

barentsen commented Nov 7, 2019 •

edited

Loading

ceb8 commented Nov 7, 2019

christinahedges commented Jan 29, 2020

ceb8 commented Jan 29, 2020

christinahedges commented Jan 29, 2020

ceb8 commented Jan 29, 2020

bsipocz commented Jan 29, 2020

ceb8 commented Jan 29, 2020

christinahedges commented Jan 29, 2020

ceb8 commented Feb 5, 2020

christinahedges commented Feb 5, 2020

orionlee commented Aug 26, 2020

bsipocz commented Sep 27, 2022

bsipocz commented Sep 27, 2022

bsipocz commented Sep 27, 2022

ceb8 commented Sep 29, 2022

bsipocz commented Sep 29, 2022

ceb8 commented Sep 29, 2022

bsipocz commented Sep 30, 2022

bsipocz commented Sep 30, 2022

ceb8 commented Sep 30, 2022 •

edited

Loading

bsipocz commented Sep 30, 2022

ceb8 commented Sep 30, 2022

bsipocz commented Sep 30, 2022

ceb8 commented Oct 3, 2022 •

edited

Loading

WIP: MAST query result cache support outline #1578

Are you sure you want to change the base?

WIP: MAST query result cache support outline #1578

Conversation

orionlee commented Oct 25, 2019

pep8speaks commented Oct 25, 2019 • edited Loading

Comment last updated at 2019-11-05 00:33:35 UTC

orionlee commented Oct 25, 2019

codecov bot commented Oct 25, 2019 • edited Loading

Codecov Report

bsipocz commented Oct 25, 2019

orionlee commented Nov 5, 2019

bsipocz commented Nov 6, 2019 • edited Loading

orionlee commented Nov 6, 2019

barentsen commented Nov 7, 2019 • edited Loading

ceb8 commented Nov 7, 2019

christinahedges commented Jan 29, 2020

ceb8 commented Jan 29, 2020

christinahedges commented Jan 29, 2020

ceb8 commented Jan 29, 2020

bsipocz commented Jan 29, 2020

ceb8 commented Jan 29, 2020

christinahedges commented Jan 29, 2020

ceb8 commented Feb 5, 2020

christinahedges commented Feb 5, 2020

orionlee commented Aug 26, 2020

bsipocz commented Sep 27, 2022

bsipocz commented Sep 27, 2022

bsipocz commented Sep 27, 2022

ceb8 commented Sep 29, 2022

bsipocz commented Sep 29, 2022

ceb8 commented Sep 29, 2022

bsipocz commented Sep 30, 2022

bsipocz commented Sep 30, 2022

ceb8 commented Sep 30, 2022 • edited Loading

bsipocz commented Sep 30, 2022

ceb8 commented Sep 30, 2022

bsipocz commented Sep 30, 2022

ceb8 commented Oct 3, 2022 • edited Loading

pep8speaks commented Oct 25, 2019 •

edited

Loading

codecov bot commented Oct 25, 2019 •

edited

Loading

bsipocz commented Nov 6, 2019 •

edited

Loading

barentsen commented Nov 7, 2019 •

edited

Loading

ceb8 commented Sep 30, 2022 •

edited

Loading

ceb8 commented Oct 3, 2022 •

edited

Loading