Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance page update with parallelism pitfalls section #2240

Merged
merged 7 commits into from
Jan 28, 2025

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Jan 27, 2025

Description

This PR does the following:

  • Group all extraction related segments into a "optimize extraction" section.
  • Slightly regroup the pipeline parallelisation section and add a new pitfalls section with information about using unique staging buckets and staging datasets for parallel pipeline execution.

Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit d2c9bc7
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6798d5f767409c0008909333
😎 Deploy Preview https://deploy-preview-2240--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@sh-rp sh-rp marked this pull request as ready for review January 27, 2025 11:59
@sh-rp sh-rp self-assigned this Jan 27, 2025
@@ -6,7 +6,13 @@ keywords: [scaling, parallelism, finetuning]

# Optimizing dlt

## Yield pages instead of rows
This page contains a collection of tips and tricks to optimize dlt pipelines for speed, scalability and memory footprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add here a small reminder that dlt works in three steps and link this page https://dlthub.com/docs/reference/explainers/how-dlt-works

docs/website/docs/reference/performance.md Outdated Show resolved Hide resolved

Instead of using Python Requests directly, you can use the built-in [requests wrapper](../general-usage/http/requests) or [`RESTClient`](../general-usage/http/rest-client) for API calls. This will make your pipeline more resilient to intermittent network errors and other random glitches.
2. If you are running pipelines in parallel against the same destination dataset and are using a staging destination, you should change the staging destination bucket subfolder to be unique for each pipeline or alternatively disable cleaning up the staging destination after each load for all pipelines: [how to prevent staging files truncation](../dlt-ecosystem/staging#how-to-prevent-staging-files-truncation) If you do not, files might be deleted by one pipeline that are still required to be loaded by another pipeline running in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rephrase it in a clearer way? something like:

2. If you're running multiple pipelines in parallel that write to the same destination dataset and use a staging area, make sure to do one of the following: 

    a. Assign a unique subfolder in the staging destination bucket for each pipeline, or  
    b. [Disable automatic cleanup of the staging area](../dlt-ecosystem/staging#how-to-prevent-staging-files-truncation) after each load for all pipelines.

   If you don’t do this, one pipeline might delete staging files that are still needed by another pipeline running at the same time.

docs/website/docs/reference/performance.md Outdated Show resolved Hide resolved
@sh-rp sh-rp merged commit 6cf8b74 into devel Jan 28, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple pipelines running at same time towards same dataset can cause loading to _dlt_pipeline_state to crash
2 participants