-
Notifications
You must be signed in to change notification settings - Fork 432
feat: Add SqlStorageClient
based on sqlalchemy
v2+
#1339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a new SQL-based storage client (SQLStorageClient
) that provides persistent data storage using SQLAlchemy v2+ for datasets, key-value stores, and request queues.
Key changes:
- Adds
SQLStorageClient
with support for connection strings, pre-configured engines, or default SQLite database - Implements SQL-based clients for all three storage types with database schema management and transaction handling
- Updates storage model configurations to support SQLAlchemy ORM mapping with
from_attributes=True
Reviewed Changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/crawlee/storage_clients/_sql/ |
New SQL storage implementation with database models, clients, and schema management |
tests/unit/storage_clients/_sql/ |
Comprehensive test suite for SQL storage functionality |
tests/unit/storages/ |
Updates to test fixtures to include SQL storage client testing |
src/crawlee/storage_clients/models.py |
Adds from_attributes=True to model configs for SQLAlchemy ORM compatibility |
pyproject.toml |
Adds new sql optional dependency group |
src/crawlee/storage_clients/__init__.py |
Adds conditional import for SQLStorageClient |
Comments suppressed due to low confidence (1)
tests/unit/storages/test_request_queue.py:23
- The test fixture only tests 'sql' storage client, but the removed 'memory' and 'file_system' parameters suggest this may have unintentionally reduced test coverage. Consider including all storage client types to ensure comprehensive testing.
@pytest.fixture(params=['sql'])
Co-authored-by: Copilot <[email protected]>
When implementing, I opted out of
|
The storage client has been repeatedly tested with SQLLite and a local PostgreSQL (a simple container installation without fine-tuning). import asyncio
from crawlee.crawlers import BasicCrawler, BasicCrawlingContext
from crawlee.storage_clients import SQLStorageClient
from crawlee.storages import RequestQueue, KeyValueStore
from crawlee import service_locator
from crawlee import ConcurrencySettings
LOCAL_POSTGRE = None # 'postgresql+asyncpg://myuser:mypassword@localhost:5432/postgres'
USE_STATE = True
KVS = True
DATASET = True
CRAWLERS = 1
REQUESTS = 10000
DROP_STORAGES = True
async def main() -> None:
service_locator.set_storage_client(
SQLStorageClient(
connection_string=LOCAL_POSTGRE if LOCAL_POSTGRE else None,
)
)
kvs = await KeyValueStore.open()
queue_1 = await RequestQueue.open(name='test_queue_1')
queue_2 = await RequestQueue.open(name='test_queue_2')
queue_3 = await RequestQueue.open(name='test_queue_3')
urls = [f'https://crawlee.dev/page/{i}' for i in range(REQUESTS)]
await queue_1.add_requests(urls)
await queue_2.add_requests(urls)
await queue_3.add_requests(urls)
crawler_1 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_1)
crawler_2 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_2)
crawler_3 = BasicCrawler(concurrency_settings=ConcurrencySettings(desired_concurrency=50), request_manager=queue_3)
# Define the default request handler
@crawler_1.router.default_handler
@crawler_2.router.default_handler
@crawler_3.router.default_handler
async def request_handler(context: BasicCrawlingContext) -> None:
if USE_STATE:
# Use state to store data
state_data = await context.use_state()
state_data['a'] = context.request.url
if KVS:
# Use KeyValueStore to store data
await kvs.set_value(context.request.url, {'url': context.request.url, 'title': 'Example Title'})
if DATASET:
await context.push_data({'url': context.request.url, 'title': 'Example Title'})
crawlers = [crawler_1]
if CRAWLERS > 1:
crawlers.append(crawler_2)
if CRAWLERS > 2:
crawlers.append(crawler_3)
# Run the crawler
data = await asyncio.gather(*[crawler.run() for crawler in crawlers])
print(data)
if DROP_STORAGES:
# Drop all storages
await queue_1.drop()
await queue_2.drop()
await queue_3.drop()
await kvs.drop()
if __name__ == '__main__':
asyncio.run(main()) This allows you to load work with storage without real requests. |
The use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part review. I will do RQ and tests in second part.
I have only minor comments. My main suggestion is to extract more code that is shared in all 3 clients. It is easier to understand all the clients once the reader easily knows which part of the code is exactly the same in all clients and which part of the code is unique and specific to the client. It also makes it easier to maintain the code.
Drawback would be that understanding just one class in the isolation would be little bit harder. But who wants to understand just one client?
flatten: list[str] | None = None, | ||
view: str | None = None, | ||
) -> DatasetItemsListPage: | ||
# Check for unsupported arguments and log a warning if found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this unsupported just in this initial commit or there is no plan for supporting them in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will complicate database queries quite a bit. I don't plan to support this. But we could reconsider this in the future.
Since SQLite now supports JSON operations, this is possible - https://sqlite.org/json1.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not strictly necessary to implement this on the database level, is it? I'm fine with leaving this unimplemented for a while though...
It would also be good to mention it in docs and maybe show an example use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will continue with the review later. There are many ways how to approach the RQclient implementation. I guess I have some different expectations in my mind (I am not saying those are correct :D ). Maybe we should define the expectations first, so that I do the review correctly based on that.
My initial expectations for the RQclient:
- Can be used on the APify platform and outside as well
- Supports any persistance
- Supports parallel consumers/producers (Use case being - speeding up crawlers on Apify platform with multiprocessing to fully utilize resources available -> for example Parsel based actor could have multiple ParselCrawlers under the hood and all of them working on the same RQ, but reducing the costs by avoiding ApifyRQClient)
Most typical use case:
- Crawlee outside of Apify platform
- Crawlee on Apify platform, but avoiding expensive ApifyRQClient
SQLStorageClient
based on sqlalchemy
v2+SqlStorageClient
based on sqlalchemy
v2+
@@ -0,0 +1,291 @@ | |||
from __future__ import annotations | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried to open the crawlee.db
in any SQLite viewer tool?
If I use https://sqliteonline.com/, I got this error:
SQLITE_CANTOPEN: sqlite3 result code 14: unable to open database file
And this error with this one https://inloop.github.io/sqlite-viewer/:
Error: no such table: null
Code:
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.storage_clients import SqlStorageClient
async def main() -> None:
storage_client = SqlStorageClient()
crawler = ParselCrawler(
storage_client=storage_client,
max_requests_per_crawl=10,
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing URL: {context.request.url}...')
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links(strategy='same-domain')
await crawler.run(['http://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this error with this one https://inloop.github.io/sqlite-viewer/:
This works for me, no errors.
But mostly I used the extension for VScode - SQLite.
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
- Adaptive crawler - prefer combining with Parsel. - Highlight some important parts of the code samples. - Prefer Impit over Httpx. - Expose `FingerprintGenerator` in the public API docs. - Add a note to the request loaders guide to highlight the usage with crawlers.
The name of the test was mangled by accident, and the test was not running. Fix the name so that the test runs again.
### Description - Persist the `SitemapRequestLoader` state ### Issues - Closes: apify#1269
@override | ||
async def push_data(self, data: list[dict[str, Any]] | dict[str, Any]) -> None: | ||
"""Add new items to the dataset.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please review all occurrences of @override
methods and their docstrings, and decide whether a custom docstring is really needed? If not, the parent docstring will be used automatically.
For example, in this case, I don't think it makes sense to use """Add new items to the dataset."""
- The parent docstring is even more descriptive.
@override | ||
async def add_batch_of_requests( | ||
self, | ||
requests: Sequence[Request], | ||
*, | ||
forefront: bool = False, | ||
) -> AddRequestsResponse: | ||
return await self._add_batch_of_requests_optimization(requests, forefront=forefront) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why additional internal method, that is used only here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Now it really doesn't make sense 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you done any performance comparisons with the memory and file-system storage clients? If not, could you please run some? For example, you could run the Parsel crawler on crawlee.dev
, enqueue all links, and store the URL + title to the dataset.
Description
SQLStorageClient
which can accept a database connection string or a pre-configuredAsyncEngine
, or creates a defaultcrawlee.db
database inConfiguration.storage_dir
.Issues