feat: Add `SqlStorageClient` based on `sqlalchemy` v2+ #1339

Mantisus · 2025-08-01T21:21:45Z

Description

Add SQLStorageClient which can accept a database connection string or a pre-configured AsyncEngine, or creates a default crawlee.db database in Configuration.storage_dir.

Issues

Closes: Add support for SQLite storage client #307

Copilot

Pull Request Overview

This PR implements a new SQL-based storage client (SQLStorageClient) that provides persistent data storage using SQLAlchemy v2+ for datasets, key-value stores, and request queues.

Key changes:

Adds SQLStorageClient with support for connection strings, pre-configured engines, or default SQLite database
Implements SQL-based clients for all three storage types with database schema management and transaction handling
Updates storage model configurations to support SQLAlchemy ORM mapping with from_attributes=True

Reviewed Changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/crawlee/storage_clients/_sql/`	New SQL storage implementation with database models, clients, and schema management
`tests/unit/storage_clients/_sql/`	Comprehensive test suite for SQL storage functionality
`tests/unit/storages/`	Updates to test fixtures to include SQL storage client testing
`src/crawlee/storage_clients/models.py`	Adds `from_attributes=True` to model configs for SQLAlchemy ORM compatibility
`pyproject.toml`	Adds new `sql` optional dependency group
`src/crawlee/storage_clients/__init__.py`	Adds conditional import for SQLStorageClient

Comments suppressed due to low confidence (1)

tests/unit/storages/test_request_queue.py:23

The test fixture only tests 'sql' storage client, but the removed 'memory' and 'file_system' parameters suggest this may have unintentionally reduced test coverage. Consider including all storage client types to ensure comprehensive testing.

@pytest.fixture(params=['sql'])

src/crawlee/storage_clients/_sql/_request_queue_client.py

Co-authored-by: Copilot <[email protected]>

Mantisus · 2025-08-01T22:11:03Z

When implementing, I opted out of SQLModel for several reasons:

Poor library support. As of today, SQLModel has a huge number of PRs and update requests, some of which are several years old. The latest releases have been mostly cosmetic (updating dependencies, documentation, builds, and checks, etc.).
Model hierarchy issue: if we use SQLModel, it's expected that we'll inherit existing Pydantic models from it. This greatly increases base dependencies (SQLModel, SQLAlchemy, aiosqlite). I don't think we should do this (see the last point).
It doesn't support optimization constraints for database tables, such as string length limits.
Poor typing when using anything other than select - Add an overload to the exec method with _Executable statement for update and delete statements fastapi/sqlmodel#909.
Overall, we can achieve the same behavior using only SQLAlchemy v2+ — https://docs.sqlalchemy.org/en/20/orm/dataclasses.html#integrating-with-alternate-dataclass-providers-such-as-pydantic. However, this retains the inheritance hierarchy and dependency issue.
I think that data models for SQL can be simpler while being better adapted for SQL than the models used in the framework. This way, we can optimize each data model for its task.

Mantisus · 2025-08-01T22:16:12Z

The storage client has been repeatedly tested with SQLLite and a local PostgreSQL (a simple container installation without fine-tuning).
Сode for testing

import asyncio

from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.storage_clients import SQLStorageClient
from crawlee.storages import RequestQueue, KeyValueStore
from crawlee import service_locator
from crawlee import ConcurrencySettings


LOCAL_POSTGRE = None  # 'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres'
USE_STATE = True
KVS = True
DATASET = True
CRAWLERS = 1
REQUESTS = 10000
DROP_STORAGES = True


async def main() -> None:
    service_locator.set_storage_client(
        SQLStorageClient(
            connection_string=LOCAL_POSTGRE if LOCAL_POSTGRE else None,
        )
    )

    kvs = await KeyValueStore.open()
    queue_1 = await RequestQueue.open(name='test_queue_1')
    queue_2 = await RequestQueue.open(name='test_queue_2')
    queue_3 = await RequestQueue.open(name='test_queue_3')

    urls = [f'https://crawlee.dev/page/{i}' for i in range(REQUESTS)]

    await queue_1.add_requests(urls)
    await queue_2.add_requests(urls)
    await queue_3.add_requests(urls)

    crawler_1 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_1)
    crawler_2 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_2)
    crawler_3 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_3)

    # Define the default request handler
    @crawler_1.router.default_handler
    @crawler_2.router.default_handler
    @crawler_3.router.default_handler
    async def request_handler(context: BasicCrawlingContext) -> None:
        if USE_STATE:
            # Use state to store data
            state_data = await context.use_state()
            state_data['a'] = context.request.url

        if KVS:
            # Use KeyValueStore to store data
            await kvs.set_value(context.request.url, {'url': context.request.url, 'title': 'Example Title'})
        if DATASET:
            await context.push_data({'url': context.request.url, 'title': 'Example Title'})

    crawlers = [crawler_1]
    if CRAWLERS > 1:
        crawlers.append(crawler_2)
    if CRAWLERS > 2:
        crawlers.append(crawler_3)

    # Run the crawler
    data = await asyncio.gather(*[crawler.run() for crawler in crawlers])

    print(data)

    if DROP_STORAGES:
        # Drop all storages
        await queue_1.drop()
        await queue_2.drop()
        await queue_3.drop()
        await kvs.drop()


if __name__ == '__main__':
    asyncio.run(main())

This allows you to load work with storage without real requests.

Mantisus · 2025-08-01T22:20:41Z

The use of accessed_modified_update_interval is related to optimization. Frequent updates to metadata just to change the access time can overload the database.

Pijukatel

First part review. I will do RQ and tests in second part.

I have only minor comments. My main suggestion is to extract more code that is shared in all 3 clients. It is easier to understand all the clients once the reader easily knows which part of the code is exactly the same in all clients and which part of the code is unique and specific to the client. It also makes it easier to maintain the code.

Drawback would be that understanding just one class in the isolation would be little bit harder. But who wants to understand just one client?

src/crawlee/storage_clients/_sql/_dataset_client.py

src/crawlee/storage_clients/_sql/_storage_client.py

src/crawlee/storage_clients/_sql/_dataset_client.py

Pijukatel · 2025-08-13T07:53:52Z

src/crawlee/storage_clients/_sql/_dataset_client.py

+        flatten: list[str] | None = None,
+        view: str | None = None,
+    ) -> DatasetItemsListPage:
+        # Check for unsupported arguments and log a warning if found.


Is this unsupported just in this initial commit or there is no plan for supporting them in the future?

I think this will complicate database queries quite a bit. I don't plan to support this. But we could reconsider this in the future.

Since SQLite now supports JSON operations, this is possible - https://sqlite.org/json1.html

It's not strictly necessary to implement this on the database level, is it? I'm fine with leaving this unimplemented for a while though...

src/crawlee/storage_clients/_sql/_dataset_client.py

src/crawlee/storage_clients/_sql/_db_models.py

src/crawlee/storage_clients/_sql/_dataset_client.py

Pijukatel · 2025-08-13T13:50:29Z

It would also be good to mention it in docs and maybe show an example use.

Pijukatel

I will continue with the review later. There are many ways how to approach the RQclient implementation. I guess I have some different expectations in my mind (I am not saying those are correct :D ). Maybe we should define the expectations first, so that I do the review correctly based on that.

My initial expectations for the RQclient:

Can be used on the APify platform and outside as well
Supports any persistance
Supports parallel consumers/producers (Use case being - speeding up crawlers on Apify platform with multiprocessing to fully utilize resources available -> for example Parsel based actor could have multiple ParselCrawlers under the hood and all of them working on the same RQ, but reducing the costs by avoiding ApifyRQClient)

Most typical use case:

Crawlee outside of Apify platform
Crawlee on Apify platform, but avoiding expensive ApifyRQClient

src/crawlee/storage_clients/_sql/_request_queue_client.py

docs/guides/storage_clients.mdx

src/crawlee/storage_clients/_sql/_storage_client.py

vdusek · 2025-08-30T07:53:05Z

src/crawlee/storage_clients/_sql/_storage_client.py

@@ -0,0 +1,291 @@
+from __future__ import annotations
+


Have you tried to open the crawlee.db in any SQLite viewer tool?

If I use https://sqliteonline.com/, I got this error:

SQLITE_CANTOPEN: sqlite3 result code 14: unable to open database file

And this error with this one https://inloop.github.io/sqlite-viewer/:

Error: no such table: null

Code:

import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee.storage_clients import SqlStorageClient async def main() -> None: storage_client = SqlStorageClient() crawler = ParselCrawler( storage_client=storage_client, max_requests_per_crawl=10, ) @crawler.router.default_handler async def request_handler(context: ParselCrawlingContext) -> None: context.log.info(f'Processing URL: {context.request.url}...') data = { 'url': context.request.url, 'title': context.selector.css('title::text').get(), } await context.push_data(data) await context.enqueue_links(strategy='same-domain') await crawler.run(['http://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main())

And this error with this one https://inloop.github.io/sqlite-viewer/:

This works for me, no errors.

But mostly I used the extension for VScode - SQLite.

src/crawlee/storage_clients/_sql/_client_mixin.py

Co-authored-by: Vlada Dusek <[email protected]>

…pify#1378)

- Adaptive crawler - prefer combining with Parsel. - Highlight some important parts of the code samples. - Prefer Impit over Httpx. - Expose `FingerprintGenerator` in the public API docs. - Add a note to the request loaders guide to highlight the usage with crawlers.

The name of the test was mangled by accident, and the test was not running. Fix the name so that the test runs again.

### Description - Persist the `SitemapRequestLoader` state ### Issues - Closes: apify#1269

docs/guides/storage_clients.mdx

src/crawlee/storage_clients/_sql/_dataset_client.py

vdusek · 2025-09-01T12:19:26Z

src/crawlee/storage_clients/_sql/_dataset_client.py

+    @override
+    async def push_data(self, data: list[dict[str, Any]] | dict[str, Any]) -> None:
+        """Add new items to the dataset."""


Could you please review all occurrences of @override methods and their docstrings, and decide whether a custom docstring is really needed? If not, the parent docstring will be used automatically.

For example, in this case, I don't think it makes sense to use """Add new items to the dataset.""" - The parent docstring is even more descriptive.

vdusek · 2025-09-01T12:22:17Z

src/crawlee/storage_clients/_sql/_request_queue_client.py

+    @override
+    async def add_batch_of_requests(
+        self,
+        requests: Sequence[Request],
+        *,
+        forefront: bool = False,
+    ) -> AddRequestsResponse:
+        return await self._add_batch_of_requests_optimization(requests, forefront=forefront)


Why additional internal method, that is used only here?

Good question. Now it really doesn't make sense 🙂

vdusek

Have you done any performance comparisons with the memory and file-system storage clients? If not, could you please run some? For example, you could run the Parsel crawler on crawlee.dev, enqueue all links, and store the URL + title to the dataset.

Mantisus added 13 commits July 29, 2025 17:07

base implementation sql client

a3c5fa0

resolve

3142bdd

add dataset tests

b056505

add kvs tests

ae3bc3d

add rq tests

49f2643

fix docs in tests

35a27fc

wrap SQLStorageClient in _try_import

52e1ad2

update db models

df41c45

dataset optimization

342c65a

kvs optimization

61a2666

optimization

7055f7d

reduce the refresh rate of accessed_at

1884f7d

up docs

a10e3cf

Mantisus requested review from janbuchar, vdusek and Pijukatel August 1, 2025 21:22

Mantisus self-assigned this Aug 1, 2025

Mantisus added this to the 1.0 milestone Aug 1, 2025

Mantisus requested a review from Copilot August 1, 2025 21:23

Copilot AI reviewed Aug 1, 2025

View reviewed changes

src/crawlee/storage_clients/_sql/_request_queue_client.py Outdated Show resolved Hide resolved

Mantisus and others added 2 commits August 2, 2025 00:25

Update src/crawlee/storage_clients/_sql/_request_queue_client.py

f7ebbe5

Co-authored-by: Copilot <[email protected]>

fix tests

83ca6d3

Merge master

1e3474c

Mantisus removed this from the 1.0 milestone Aug 4, 2025

Pijukatel reviewed Aug 13, 2025

View reviewed changes

Pijukatel reviewed Aug 14, 2025

View reviewed changes

polish sql-client

05f59ca

Mantisus requested a review from vdusek August 27, 2025 15:25

vdusek changed the title ~~feat: Implement SQLStorageClient based on sqlalchemy v2+~~ feat: Add SqlStorageClient based on sqlalchemy v2+ Aug 29, 2025

vdusek requested changes Aug 30, 2025

View reviewed changes

Mantisus and others added 15 commits August 30, 2025 14:23

Update docs/guides/storage_clients.mdx

473610d

Co-authored-by: Vlada Dusek <[email protected]>

Update docs/guides/storage_clients.mdx

f17f6ca

Co-authored-by: Vlada Dusek <[email protected]>

Update docs/guides/storage_clients.mdx

2ed4f06

Co-authored-by: Vlada Dusek <[email protected]>

Update docs/guides/storage_clients.mdx

88a60f3

Co-authored-by: Vlada Dusek <[email protected]>

chore(deps): update typescript-eslint monorepo to v8.41.0 (apify#1375)

a9b9671

docs: Update RequestLoader.fetch_next_request docblock (apify#1374)

f8b2879

chore(release): Update changelog and package version [skip ci]

4ba3a2e

chore(deps): update dependency types-cachetools to ~=6.2.0.20250827 (a…

1d0e531

…pify#1378)

chore(deps): update yarn to v4.9.4 (apify#1377)

5ae2c38

chore: Fix accidentally missing name of the test (apify#1380)

3f0bf8a

The name of the test was mangled by accident, and the test was not running. Fix the name so that the test runs again.

feat: Persist the SitemapRequestLoader state (apify#1347)

3241785

### Description - Persist the `SitemapRequestLoader` state ### Issues - Closes: apify#1269

chore(release): Update changelog and package version [skip ci]

caff701

suppose warning

29cf5af

up code block

bf47625

Mantisus requested a review from vdusek August 30, 2025 12:35

vdusek reviewed Sep 1, 2025

View reviewed changes

docs/guides/storage_clients.mdx Show resolved Hide resolved

Merge branch 'master' into sql-client

b0e9f66

vdusek reviewed Sep 1, 2025

View reviewed changes

src/crawlee/storage_clients/_sql/_dataset_client.py Outdated Show resolved Hide resolved

Mantisus added 3 commits September 1, 2025 11:46

up docs

4d5ade3

drop cast

74f8825

fix docs

d3a2ebc

Mantisus requested a review from vdusek September 1, 2025 12:08

vdusek reviewed Sep 1, 2025

View reviewed changes

vdusek requested changes Sep 1, 2025

View reviewed changes

feat: Add SqlStorageClient based on sqlalchemy v2+ #1339

Are you sure you want to change the base?

feat: Add SqlStorageClient based on sqlalchemy v2+ #1339

Conversation

Mantisus commented Aug 1, 2025

Description

Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Mantisus commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mantisus commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mantisus commented Aug 1, 2025

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pijukatel Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pijukatel commented Aug 13, 2025

Uh oh!

Pijukatel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

feat: Add `SqlStorageClient` based on `sqlalchemy` v2+ #1339

feat: Add `SqlStorageClient` based on `sqlalchemy` v2+ #1339

Mantisus commented Aug 1, 2025 •

edited

Loading

Mantisus commented Aug 1, 2025 •

edited

Loading

Pijukatel left a comment •

edited

Loading

vdusek Sep 1, 2025 •

edited

Loading