-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker crashes randomly with SQL exceptions #1108
Comments
Logs suggest the root cause is that the pyscopg driver failed to obtain a connection and timed out.
The procrastinate worker will give up when the main coroutine throws an error, which is what happened here. Maybe the library could add more resiliency by retrying non critical errors. In the meantime, you might be able to achieve some of that by wrapping the app run in a loop that catches errors and restarts the worker. That would prevent crashing the whole application. |
@onlyann thank you for your reply. how can I run a loop to catch these errors? Database connection is available, I was able to validate other Psycog library connections, not sure why single_worker runs into this error, also looks like we need to improve the logs |
so I added the loop and re-try mechanism, but seems like following SQL error breaks the full flow db | 2024-07-10 18:00:31.609 UTC [37] STATEMENT: SELECT id, status, task_name, priority, lock, queueing_lock, args, scheduled_at, queue_name, attempts
db | FROM procrastinate_fetch_job($1);
db | 2024-07-10 18:00:31.615 UTC [36] ERROR: canceling statement due to user request
db | 2024-07-10 18:00:31.615 UTC [36] CONTEXT: SQL statement "INSERT
db | INTO procrastinate_periodic_defers (task_name, periodic_id, defer_timestamp)
db | VALUES (_task_name, _periodic_id, _defer_timestamp)
db | ON CONFLICT DO NOTHING
db | RETURNING id"
db | PL/pgSQL function procrastinate_defer_periodic_job(character varying,character varying,character varying,character varying,character varying,bigint,jsonb) line 7 at SQL statement
db | 2024-07-10 18:00:31.615 UTC [36] STATEMENT: SELECT procrastinate_defer_periodic_job($1, $2, $3, $4, $5, $6, $7) AS id;
db | 2024-07-10 18:00:36.635 UTC [39] ERROR: column "priority" does not exist at character 31
db | 2024-07-10 18:00:36.635 UTC [39] STATEMENT: SELECT id, status, task_name, priority, lock, queueing_lock, args, scheduled_at, queue_name, attempts
db | FROM procrastinate_fetch_job($1);
db | 2024-07-10 18:00:36.647 UTC [38] ERROR: canceling statement due to user request
db | 2024-07-10 18:00:36.647 UTC [38] CONTEXT: SQL statement "INSERT
db | INTO procrastinate_periodic_defers (task_name, periodic_id, defer_timestamp)
db | VALUES (_task_name, _periodic_id, _defer_timestamp)
db | ON CONFLICT DO NOTHING
db | RETURNING id"
db | PL/pgSQL function procrastinate_defer_periodic_job(character varying,character varying,character varying,character varying,character varying,bigint,jsonb) line 7 at SQL statement
db | 2024-07-10 18:00:36.647 UTC [38] STATEMENT: SELECT procrastinate_defer_periodic_job($1, $2, $3, $4, $5, $6, $7) AS id; almost all of the errors I run into is because of the above SQL exception |
What version of Procrastinate are you using? Have you recently upgraded?
|
@medihack I am using 2.7.0 version, since I did not lock the package version and yes it's very frustrating whole worker would crash again and again due to missing column SQL exception if there are schema changes, package should detect new changes, give warnings or there should be some kind of backward compatibility, regardless I think worker should not crash on such small exceptions |
I am unsure if this is related to the new package release 2.7.0, but the worker crashed repeatedly on a small SQL exception. and the error log produced by procrastinate are not helpful at all, unless you look into Postgres DB error logs Error produced by procrastinate: app | Main coroutine error, initiating remaining coroutines stop. Cause: ConnectorException('\n Database error.\n ')
app | single_worker error: ConnectorException('\n Database error.\n ')
app | NoneType: None Error produced by Postgres Db | 2024-07-11 04:33:31.386 UTC [36] ERROR: Job was not found or not in "doing" status (job id: 75)
Db | 2024-07-11 04:33:31.386 UTC [36] CONTEXT: PL/pgSQL function procrastinate_retry_job(bigint,timestamp with time zone) line 12 at RAISE
Db | 2024-07-11 04:33:31.386 UTC [36] STATEMENT: SELECT procrastinate_retry_job($1, $2); I am un-clear why would worker crash on such small issues? |
Yes, it would be great to improve the experience on the migration aspect. In the meantime, I invite you to read the documentation on how migrations are handled for this library: https://procrastinate.readthedocs.io/en/stable/howto/production/migrations.html
The Database Error." happens to be the default message of That said, it is possible there is an issue within the library that doesn't output enough details. What do you say @ewjoachim ?
It may look like a "small" issue, but there is not much the worker could do here. |
There's backwards compatibility, the other way around (or forward compatibility if you want, it depends how you see it): you can run older versions of the code with newer versions of the schema, this is required for creating a no-downtime migration path. Having both forward & backward compatibility (being able to run the old code with the new schema or the old schema with the new code) seems like a very complicated thing to do, I think, especially since we don't know how many versions people will skip and if they'll actually end up running the migrations. I'd rather find a way to ensure you can't run procrastinate with an old schema. |
Maybe we should also add a more obvious hint on the documentation's start page or the quickstart page that a manual migration step may be required after a package upgrade for non-Django users. And maybe also that the version should be locked to a minor version when using tools like poetry, pipenv, rye, ... |
thank you for the help. and makes sense. I think for now locking a version would help still unclear on why the worker crashes every time on small minor errors like following, which are not related to schema but more related to state of a task Db | 2024-07-11 04:33:31.386 UTC [36] ERROR: Job was not found or not in "doing" status (job id: 75)
Db | 2024-07-11 04:33:31.386 UTC [36] CONTEXT: PL/pgSQL function procrastinate_retry_job(bigint,timestamp with time zone) line 12 at RAISE
Db | 2024-07-11 04:33:31.386 UTC [36] STATEMENT: SELECT procrastinate_retry_job($1, $2); the above is a minor state that should not crash the full worke |
Have you tried upgrading the package and applying the migrations? Does the problem still exist? What's the retry strategy? How does the task look like? |
@medihack I have upgraded the package, it has fixed the schema inconsistency error but I still run into the following error consistently from time to time, even re-starting the worker won't help, the only way is to clear out jobs that are not successful. the worker crash entirely on minor SQL exceptions, following is the exception
This is how worker is initialized in FastAPI task_queue = App(connector=PsycopgConnector(
conninfo=os.getenv("PROCRASTINATE_DB_URL")
))
@asynccontextmanager
async def lifespan(app: FastAPI):
async with task_queue.open_async():
worker = asyncio.create_task(
task_queue.run_worker_async(install_signal_handlers=False)
)
# Set to 100 to test the ungraceful shutdown
await sleep.defer_async(length=5)
print("STARTUP")
yield
print("SHUTDOWN")
worker.cancel()
try:
await asyncio.wait_for(worker, timeout=10)
except asyncio.TimeoutError:
print("Ungraceful shutdown")
except asyncio.CancelledError:
print("Graceful shutdown")
app = FastAPI(lifespan=lifespan) example jobs with re-try strategy @task_queue.task(retry=RetryStrategy(
max_attempts=5,
wait=5,
exponential_wait=5
))
async def process_file(file_id: str):
…
@task_queue.periodic(cron="*/10 * * * *")
@task_queue.task(queueing_lock="retry_stalled_jobs", pass_context=True)
async def retry_stalled_jobs(context, timestamp):
… |
I wouldn't call it minor. It's something that should not happen. From looking at the worker code, the retry happens before the status of the job is changed and the job may be deleted. But I wonder if your |
(Unfortunately, unable to reproduce so far. If anyone has additional info, please reopen!) |
Running into an issue where the worker crashes now and then and never restarts. I keep getting the following error on periodic runs
single_worker error: ConnectorException('\n Database error.\n ')
And when I looked into Postgres we simultaneously got the following exception
Following is my job setup
I am running the worker in async mode through FastAPI asynccontextmaanger
few times we get the following error. which is not very descriptive either
Any idea what's going on?
The text was updated successfully, but these errors were encountered: