Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maxAge should be stored in file instead of setTimeout, so cache is properly invalidated across several runs #289

Closed
artsiommiksiuk opened this issue May 22, 2024 · 9 comments · Fixed by #297

Comments

@artsiommiksiuk
Copy link

My use case and confusion. I'm using memoize-fs to acquire jwt tokens for tests. Tokens living 1 hours, and their acquisition takes about 20 seconds, so memoize-fs effectively cutting those 20 seconds of from every test run.

maxAge is a timer based cache invalidation, however, because I'm using it to cache results across runs, I need invalidation happen based on the age of the cached data, not when memoization function was created.

This is at least to say confusion point, as documentation doesn't say anything about it. And I'd say, it's makes sense to store records age along with the data and verify this age on pull from disk. In this case maxAge would behave consistently with expectations (expectation is age of the data, not the memoization function instance in a process).

@artsiommiksiuk artsiommiksiuk changed the title maxAge should be stored in file instead of setTimeout, so cache is popperly invalidated across several runs maxAge should be stored in file instead of setTimeout, so cache is properly invalidated across several runs May 23, 2024
@vincerubinetti
Copy link

Oof... I thought I had found a library that met my needs (memoize fns + ttl disk cache), but this makes the library kinda useless :/ I'd say you can't really claim that the library has a maxAge if it's not persistent.

@borisdiakur
Copy link
Owner

Hi @artsiommiksiuk, thanks for creating the issue. I think your use case makes sense and we should cover it. Though I think there is no need in "storing records age along with the data", because each cache file already includes meta data (including creation and modification time) which can be accessed using the file system API, specifically using fs.stat or fs.statSync.

That being said, I cannot say when I will have time to work on this (I'm currently quite busy and I do need some code free time). I don't think the feature is difficult to implement, it just needs to be implemented in a way that it doesn't break existing projects (where a maxAge stored in memory is good enough). I guess it won't. I don't see a reason why it should. If the tests remain green, we're good. If not, maybe we can use another option. If you like, feel free to contribute a PR. This would very likely speed up the development of this feature.

@vincerubinetti Sorry, but your comment is not helpful.

@vincerubinetti
Copy link

vincerubinetti commented Jun 13, 2024

@vincerubinetti Sorry, but your comment is not helpful.

Let me elaborate then. The first line of the readme is Node.js solution for memoizing/caching function results on the file system. That's what people are coming to this library for; otherwise there are several other more popular options (memoize, cacache, memoizee). The main reason to use the file system for caching is to persist (another keyword right at the start of the readme which does not typically mean "in-memory") something across runs. It's very odd that all except one (presumably) of the options/features use file system persistence. It's even more frustrating for that difference to not be documented.

First thing I did when finding this library was look for "disk" and "max age", and I saw both in the readme with no caveats. It's lucky I looked at the issues before I spent time installing, configuring, and debugging why it wasn't working.

So here is my recommendation:

# maxAge (**in-memory only**)

With `maxAge` option you can ensure that cache for given call is cleared after a predefined period of time (in milliseconds).

Note that the cache for this call is only cleared after `maxAge` milliseconds since its _first write call in the current Node process_.
If you restart your Node process, the expiry countdown will start again after the first write.

Looking at the code, I think that's correct. Not sure what happens when something has a maxAge and it already exists in the cache without having been written to in the current process yet.


For anyone looking for 1) persistent disk cache, 2) single cache file to not pollute file system/git diff, 3) function wrapper memoizer, and 4) per-function time-to-live, here is the simple solution I came up with since none of the available libraries met all those requirements:

code
import { createHash } from "crypto";
import { existsSync, readFileSync, unlinkSync, writeFileSync } from "fs";
import { log } from "@/util/log";

type Options = {
  /** ttl (in seconds) */
  maxAge?: number;
};

/** on-disk cache */
const cachePath = "./util/cache.json";

export const memoize =
  <Args extends unknown[], Return>(
    func: (...args: Args) => Promise<Return>,
    options: Options = {},
  ) =>
  async (...args: Args) => {
    /** set options defaults */
    options.maxAge ??= 24 * 60 * 60;

    /** get cache key unique to function and arguments */
    const key = func.name + func.toString() + JSON.stringify(args);
    /** make shorter unique cache key */
    const hash = createHash("md5").update(key).digest("hex");
    /** possible cache hit */
    const cached = cache[hash];

    /** current timestamp */
    const now = Date.now();
    /** is cached entry expired */
    const expired = now - (cached?.timestamp ?? 0) > options.maxAge * 1000;

    /** return value */
    let result: Return;

    /** if cached value valid */
    if (cached && !expired) {
      log(`Using memoized return value for ${func.name}()`, "secondary");
      /** use cached value */
      result = cached.data as Return;
    } else {
      /** run function */
      result = await func(...args);
      /** set cache */
      cache[hash] = { timestamp: now, data: result };
    }

    /** update cache */
    saveCache();
    return result;
  };

/** global in-memory cache */
let cache: Cache = {};

/** cache store */
type Cache = Record<
  string,
  {
    timestamp: number;
    data: unknown;
  }
>;

if (process.env.NOCACHE) {
  /** clear disk cache */
  unlinkSync(cachePath);
} else {
  /** load disk cache into memory cache */
  if (existsSync(cachePath))
    cache = JSON.parse(readFileSync(cachePath, "utf-8")) as Cache;
}

/** save memory cache to disk cache */
const saveCache = () => writeFileSync(cachePath, JSON.stringify(cache));

Then using it:

await memoize(someFunction)(arg1, arg2);

@artsiommiksiuk
Copy link
Author

Hi @artsiommiksiuk, thanks for creating the issue. I think your use case makes sense and we should cover it. Though I think there is no need in "storing records age along with the data", because each cache file already includes meta data (including creation and modification time) which can be accessed using the file system API, specifically using fs.stat or fs.statSync.

That being said, I cannot say when I will have time to work on this (I'm currently quite busy and I do need some code free time). I don't think the feature is difficult to implement, it just needs to be implemented in a way that it doesn't break existing projects (where a maxAge stored in memory is good enough). I guess it won't. I don't see a reason why it should. If the tests remain green, we're good. If not, maybe we can use another option. If you like, feel free to contribute a PR. This would very likely speed up the development of this feature.

@vincerubinetti Sorry, but your comment is not helpful.

Okay, then I'll add maxAgeFs param for this. Does it sound reasonable?

@borisdiakur
Copy link
Owner

I think replacing the current setTimeout logic with a fs.stat logic should be the first approach we should try. We would not need another option, we could still use the same test. This shouldn't be a breaking change. We could then update the docs to clarify that the maxAge option refers to the age of the cache file.

  1. We'd need to remove the block where the setTimeout is created.
  2. Before reading the cache file we would implement the fs.stat logic that checks the age of the cache against the maxAge option (if set). If the max age is not exceeded, proceed with reading the cache data. If it is exceeded, execute cacheAndReturn.

One thing we should pay attention to is the performance implication associated with this approach, since we would now need to access filesystem metadata. One way to reduce system calls and improve performance would be to use fs.open to get a file descriptor and then use fs.fstat and fs.read on the file descriptor.

Alternatively we could store a timestamp along with the data as @vincerubinetti has done it in his code example above (thanks for the input btw). But to me this feels like we are saving redundant data. Also, I wouldn't like to change the current data format because it would be a breaking change.

@artsiommiksiuk If you like you can give it a try.

@borisdiakur
Copy link
Owner

@artsiommiksiuk @vincerubinetti I have created a PR that includes my proposed changes and published a release candidate to npm. Please install via npm i [email protected] and let me know if it works for you.

@artsiommiksiuk
Copy link
Author

@borisdiakur, thank you! I'm still busy with a lot of other issues I need to solve, but I'm going to try as soon as I have time.

@artsiommiksiuk
Copy link
Author

@borisdiakur , I've launched it on my system and code base (without even removing old cache) and it worked. Thanks!

@borisdiakur
Copy link
Owner

Thanks a lot @artsiommiksiuk !
I've just published [email protected] (it's the same as the 4.0.0-rc-1).
Cheers. 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants