New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Performance page update with parallelism pitfalls section #2240

Merged

sh-rp merged 7 commits into devel from docs/update-performance-page

Jan 28, 2025

Collaborator

sh-rp commented Jan 27, 2025

Description

This PR does the following:

Group all extraction related segments into a "optimize extraction" section.
Slightly regroup the pipeline parallelisation section and add a new pitfalls section with information about using unique staging buckets and staging datasets for parallel pipeline execution.


          update the performance page

d6f16e9

netlify bot commented Jan 27, 2025 •

edited

Loading

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`d2c9bc7`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6798d5f767409c0008909333
😎 Deploy Preview	https://deploy-preview-2240--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sh-rp linked an issue

that may be closed by this pull request

Multiple pipelines running at same time towards same dataset can cause loading to _dlt_pipeline_state to crash #2218

Closed

sh-rp marked this pull request as ready for review

January 27, 2025 11:59

sh-rp self-assigned this

AstrakhantsevaAA reviewed

View reviewed changes

docs/website/docs/reference/performance.md

@@ @@ -6,7 +6,13 @@ keywords: [scaling, parallelism, finetuning] @@
               # Optimizing dlt
-              ## Yield pages instead of rows
+              This page contains a collection of tips and tricks to optimize dlt pipelines for speed, scalability and memory footprint.

Contributor

AstrakhantsevaAA Jan 28, 2025

I would add here a small reminder that dlt works in three steps and link this page https://dlthub.com/docs/reference/explainers/how-dlt-works

docs/website/docs/reference/performance.md Outdated Show resolved Hide resolved

docs/website/docs/reference/performance.md Outdated

    
              Instead of using Python Requests directly, you can use the built-in [requests wrapper](../general-usage/http/requests) or [`RESTClient`](../general-usage/http/rest-client) for API calls. This will make your pipeline more resilient to intermittent network errors and other random glitches.

              2. If you are running pipelines in parallel against the same destination dataset and are using a staging destination, you should change the staging destination bucket subfolder to be unique for each pipeline or alternatively disable cleaning up the staging destination after each load for all pipelines: [how to prevent staging files truncation](../dlt-ecosystem/staging#how-to-prevent-staging-files-truncation) If you do not, files might be deleted by one pipeline that are still required to be loaded by another pipeline running in parallel.

Contributor

AstrakhantsevaAA Jan 28, 2025

can you rephrase it in a clearer way? something like:

2. If you're running multiple pipelines in parallel that write to the same destination dataset and use a staging area, make sure to do one of the following: 

    a. Assign a unique subfolder in the staging destination bucket for each pipeline, or  
    b. [Disable automatic cleanup of the staging area](../dlt-ecosystem/staging#how-to-prevent-staging-files-truncation) after each load for all pipelines.

   If you don’t do this, one pipeline might delete staging files that are still needed by another pipeline running at the same time.

docs/website/docs/reference/performance.md Outdated Show resolved Hide resolved

sh-rp and others added 5 commits

January 28, 2025 13:28


          Update docs/website/docs/reference/performance.md

07913e0

Co-authored-by: Alena Astrakhantseva <[email protected]>


          Update docs/website/docs/reference/performance.md

67dd93b

Co-authored-by: Alena Astrakhantseva <[email protected]>


          small changes

4dfc408


          Merge remote-tracking branch 'origin/docs/update-performance-page' in…

095c5d1

…to docs/update-performance-page


          fix link

c55b9b9

AstrakhantsevaAA reviewed

View reviewed changes

docs/website/docs/reference/performance.md Outdated Show resolved Hide resolved


          Update docs/website/docs/reference/performance.md

d2c9bc7

sh-rp merged commit 6cf8b74 into devel

49 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet