Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: default max_records config for testing #1333

Open
pnadolny13 opened this issue Jan 18, 2023 · 12 comments
Open

[Feature]: default max_records config for testing #1333

pnadolny13 opened this issue Jan 18, 2023 · 12 comments

Comments

@pnadolny13
Copy link
Contributor

pnadolny13 commented Jan 18, 2023

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

It would be really nice to have something like a max_records config that I can set for my taps in test/CI environments that limits the amount of records to sync (e.g. I want around 100 records). We usually recommend limiting the tap's start date to something like yesterday but that could still replicate an excessive amount of data in some cases. If you only need 100 records but syncing from yesterday returns 1M then its a bad pattern. Ideally I could just configure my tap to pull 100 record and once it knows its exceeded that then it just exits. I think the test feature does something similar, exiting after it sees its first successful record.

My example from Squared CI:

Set a DEFAULT_START_DATE for my extractor start dates in my CICD meltano environment to yesterday's date. For cloudwatch, google analytics, slack, etc. I'm pulling way more data than I need to do simple CI tests. It would allow me to save on time and cost.

@pnadolny13 pnadolny13 added kind/Feature New feature or request valuestream/SDK labels Jan 18, 2023
@WillDaSilva WillDaSilva moved this to Up Next in Office Hours Jan 18, 2023
@aaronsteers
Copy link
Contributor

@pnadolny13 - My only question here is if it should be 100 max records total or 100 max records per stream. I think there's a good case that per-stream is a better limiter, just to ensure your tests have representative data. Wdyt?

@pnadolny13
Copy link
Contributor Author

@aaronsteers yep youre right, per stream is really what I'm looking for. I want a limited amount of data in each stream.

@aaronsteers
Copy link
Contributor

@pnadolny13 - Thanks for confirming. Here's the internal mechanism I mentioned in office hours, which could be used to deliver this:

_MAX_RECORDS_LIMIT: int | None = None

@kgpayne
Copy link
Contributor

kgpayne commented Jan 25, 2023

@edgarrmondragon I think this may be the right mechanism for users specifying the _MAX_RECORDS_LIMIT, that we can also leverage in the new testing framework. This is only 1 of 2 parts of it though:

  1. Users specify a max_records_limit in config.json, optionally at the stream level (as @aaronsteers suggests).This config sets the Stream._MAX_RECORDS_LIMIT under the hood. Related issue on standardising per-stream config here.
  2. Developers utilise _MAX_RECORDS_LIMIT as part of implementing Stream.get_records(), to only fetch the specified number of records. For SQL this likely means using LIMIT x as part of compiled sql, as we do in the default SQLStream here. For API's it likely means ending pagination early, once the specified number of records have been retrieved (cutting down on requests, which is handy for APIs with rate limits). Issue to implement the latter in the SDK here.

WDYT?

@edgarrmondragon
Copy link
Collaborator

  1. This config sets the Stream._MAX_RECORDS_LIMIT under the hood

@kgpayne I'm not a fan of changing this attribute at runtime, it feels clunky and is prone to error as you've seen (if you re-instantiate the streams, the value is reset to the class default).

I'd prefer at-runtime control in the form of a parameter in Tap.sync_all(max_records=None) that cascades through the streams. It could still be set from config:

tap.sync_all(max_records=tap.config.get("max_records"))

wdyt?

@kgpayne
Copy link
Contributor

kgpayne commented Jan 25, 2023

@edgarrmondragon this makes sense to me 👍

tap.sync_all(max_records=tap.config.get("max_records"))

By "This config sets the Stream._MAX_RECORDS_LIMIT under the hood" I just meant the config would result in a value being available on the stream as an attribute, for use in get_records() 🙂 I am totally open to both improving how _MAX_RECORDS_LIMIT is handled/applied and potentially renaming it or placing it behind a public get_record_limit() method to smooth the user experience in get_records().

@aaronsteers aaronsteers moved this from Discussed to Up Next in Office Hours Jan 25, 2023
@aaronsteers aaronsteers moved this from Up Next to Discussed in Office Hours Jan 25, 2023
@aaronsteers aaronsteers moved this from Discussed to Up Next in Office Hours Jan 25, 2023
@aaronsteers aaronsteers moved this from Up Next to To Discuss in Office Hours Jan 25, 2023
@JulesHuisman
Copy link

JulesHuisman commented Jan 25, 2023

I was looking for a feature like this.

We try to limit the ingestion in our CI pipeline, we currently do this by altering the start_date (Like Pat mentioned). However, this doesn't work for taps that are not incremental.

It would be great to be able to configurate this. Either by using the tap config or using an environment variable:

environments:
- name: ci
  env:
    MAX_RECORDS: 1000

@kgpayne
Copy link
Contributor

kgpayne commented Jan 26, 2023

@aaronsteers I think this is higher priority than first assumed, as it also blocks adopting the new standard tests framework with SQL taps 😬

@pnadolny13
Copy link
Contributor Author

Theres more discussion around alternative implementations and details of this in #1366.

@edgarrmondragon
Copy link
Collaborator

Another user requested this: https://meltano.slack.com/archives/CMN8HELB0/p1685438218728059

@yaozhang09
Copy link

I would also love something like this for our test CI pipelines. Any updates to this issue?

@visch
Copy link
Contributor

visch commented Mar 13, 2024

Is this one already done?

if self.ABORT_AT_RECORD_COUNT is not None:

Made a seperate issue here for adding this configuration for users to control themselves here #2312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

7 participants