-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server container hanging [500] #1649
Comments
@Sticcia - sorry to hear this is happening. Have you deployed the server via Kubernetes or Docker Compose? |
We deployed the server using Docker Compose. It happens once a week on average, our previous deployment was on Digital Ocean and the issue persisted after the migration. I have enabled debug logging now, will come back with an updated log file next time this happens. Thanks for the support. |
Here is the error (GraphQLError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?) which probably triggered the last time server container was hanging. The full log is attached. |
Hi @saschabul Thanks for sending the logs. Knex is the module used to connect to the Postgres database. This error may indicate a number of things; it may be that it temporarily lost connection to the Postgres database, or that more connections were requested than there were available. It may also indicate other issues, such as the server not freeing up connections quickly enough. Do you know if this failure occurred during any particular operation, for example sending a particular model from a Speckle connector? Do you have a monitoring system in place to display the prometheus-format monitoring data that Speckle server generates? You may be able to see how the number of database connections change over time, and whether this correlates with higher demand on Speckle server. |
Hi @iainsproat No, we don't see any relation with actions taken by users. |
I've created a Pull Request with a change which will include a health check for Speckle Server when run via Docker Compose. This health check with repeatedly poll Speckle Server, and if it stops responding for approximately 30 seconds will automatically restart it. Could you try out the change in your environment and let me know if this works for you? https://github.com/specklesystems/speckle-server/pull/1651/files Unfortunately this doesn't address the root cause, but will help reduce the amount of downtime and need for manual intervention to restart it. |
I updated the docker compose configuration following your changes. version: '2.4'
services:
####
# Speckle Server dependencies
#######
postgres:
image: 'postgres:14.5-alpine'
restart: always
environment:
POSTGRES_DB: speckle
POSTGRES_USER: speckle
POSTGRES_PASSWORD: speckle
volumes:
- ./postgres-data:/var/lib/postgresql/data/
ports:
- '127.0.0.1:5432:5432'
redis:
image: 'redis:7.0-alpine'
restart: always
volumes:
- ./redis-data:/data
ports:
- '127.0.0.1:6379:6379'
minio:
image: 'minio/minio'
command: server /data --console-address ":9001"
restart: always
volumes:
- ./minio-data:/data
ports:
- '127.0.0.1:9000:9000'
- '127.0.0.1:9001:9001'
####
# Speckle Server
#######
speckle-frontend:
image: speckle/speckle-frontend:2.13.3
restart: always
ports:
- '127.0.0.1:8080:8080'
environment:
FILE_SIZE_LIMIT_MB: 100
speckle-server:
image: speckle/speckle-server:2.13.3
restart: always
healthcheck:
test:
- CMD
- node
- -e
- "require('node:http').request({headers: {'Content-Type': 'application/json'}, port:3000, hostname:'127.0.0.1', path:'/graphql?query={serverInfo{version}}', method: 'GET' }, (res) => { body = ''; res.on('data', (chunk) => {body += chunk;}); res.on('end', () => {process.exit(res.statusCode != 200 || body.toLowerCase().includes('error'));}); }).end();"
interval: 10s
timeout: 10s
retries: 3
start_period: 5s
ports:
- '127.0.0.1:3000:3000'
command: ['bash', '-c', '/wait && node bin/www']
environment:
CANONICAL_URL: "https://geo.stykka.com"
SESSION_SECRET: "***"
STRATEGY_LOCAL: 'true'
LOG_LEVEL: 'debug'
POSTGRES_URL: 'postgres'
POSTGRES_USER: 'speckle'
POSTGRES_PASSWORD: 'speckle'
POSTGRES_DB: 'speckle'
REDIS_URL: 'redis://redis'
WAIT_HOSTS: 'postgres:5432, redis:6379, minio:9000'
EMAIL: "true"
EMAIL_HOST: "smtp.sendgrid.net"
EMAIL_PORT: "465"
EMAIL_USERNAME: "apikey"
EMAIL_PASSWORD: "***"
EMAIL_FROM: "***"
EMAIL_SECURE: 'true'
S3_ENDPOINT: 'http://minio:9000'
S3_ACCESS_KEY: 'minioadmin'
S3_SECRET_KEY: 'minioadmin'
S3_BUCKET: 'speckle-server'
S3_CREATE_BUCKET: 'true'
S3_REGION: '' # optional, defaults to 'us-east-1'
FILE_SIZE_LIMIT_MB: 500
SENTRY_DSN: "***"
speckle-preview-service:
image: speckle/speckle-preview-service:2.13.3
restart: always
mem_limit: '1000m'
memswap_limit: '1000m'
command: ['bash', '-c', '/wait && node bin/www']
environment:
LOG_LEVEL: 'info'
PG_CONNECTION_STRING: 'postgres://speckle:speckle@postgres/speckle'
WAIT_HOSTS: 'postgres:5432'
speckle-webhook-service:
image: speckle/speckle-webhook-service:2.13.3
restart: always
command: ['bash', '-c', '/wait && node main.js']
environment:
LOG_LEVEL: 'info'
PG_CONNECTION_STRING: 'postgres://speckle:speckle@postgres/speckle'
WAIT_HOSTS: 'postgres:5432'
fileimport-service:
image: speckle/speckle-fileimport-service:2.13.3
restart: always
command: ['bash', '-c', '/wait && node src/daemon.js']
environment:
LOG_LEVEL: 'info'
PG_CONNECTION_STRING: 'postgres://speckle:speckle@postgres/speckle'
WAIT_HOSTS: 'postgres:5432'
SPECKLE_SERVER_URL: 'http://speckle-server:3000' |
@Sticcia - thanks for looking at it. After you updated the docker compose file version, did the server start and work as normal? |
@iainsproat - yes, the server is running as normal with the configuration above (I just now did a small edit). We will let you know if the server crashes again. Thanks for the support. |
@iainsproat - we had a crash again Friday with the same error pasted by @saschabul above. We set up an up-time check with the same settings you shared in the yaml configuration. The check notified us of the failure but the container did not automatically restart. Also, I tried to look into the logs with
|
Hi @iainsproat the issue is happening more often and we cannot trace it to any actions that we take ourselves. The Docker Compose configuration you shared with us does not restart the service automatically. We can set it up ourselves to do this but it won't solve the issue. Do you have any news or information regarding the issue? |
Hi @Sticcia - can you share the version of docker ( Can you also run Please also run Are there any other events reported by Docker around that time that might help debug this issue? |
Docker versions:
Docker inspect: {"Status":"unhealthy","FailingStreak":12,"Log":[{"Start":"2023-07-07T12:18:58.144622803Z","End":"2023-07-07T12:19:08.144699892Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:19:18.15169539Z","End":"2023-07-07T12:19:28.151803699Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:19:38.158376948Z","End":"2023-07-07T12:19:48.158506217Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:19:58.165514836Z","End":"2023-07-07T12:20:08.165713169Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"},{"Start":"2023-07-07T12:20:18.171790741Z","End":"2023-07-07T12:20:28.171900318Z","ExitCode":-1,"Output":"Health check exceeded timeout (10s)"}]} Docker events: Thanks for the help |
I can see in the events that it is returning an Docker's documentation states that:
So neither Despite this, it is reporting the container as The latest Docker version is |
Hi @Sticcia - I've made a change to the healthcheck to ensure it times out correctly and responds with the exit code expected by docker. The healthcheck command should now be:
Let me know how you get on with this change, and please provide the docker events output, docker inspect output, and logs if you encounter any further issues. |
@iainsproat - the server crashed again and didn't restart: Here are the logs you mentioned in the last comment. Docker inspect output:
Docker events output:
Sentry error: {"event_id":"13bbb381b042402e9d6b4dd03245d292","project":4505408537427968,"release":null,"dist":null,"platform":"node","message":"","datetime":"2023-07-12T10:47:58+00:00","tags":[["environment","production"],["handled","yes"],["kind","query"],["level","error"],["mechanism","generic"],["runtime","node v18.15.0"],["runtime.name","node"],["user","id:93d69b57eb759c2ef030eba5ee7186b4acdf0629e2762c3fdc36b84222f71807"],["server_name","0f7e8173e604"],["transaction","GET /graphql"],["url","http://127.0.0.1/graphql"]],"_metrics":{"bytes.ingested.event":5619,"bytes.stored.event":14234},"breadcrumbs":{"values":[{"timestamp":1688975303.689,"type":"http","category":"http","level":"info","data":{"method":"HEAD","status_code":200,"url":"http://minio:9000/speckle-server/"}},{"timestamp":1689158878.485,"type":"default","category":"query-path","level":"debug","message":"serverInfo"}]},"contexts":{"runtime":{"name":"node","version":"v18.15.0","type":"runtime"},"trace":{"trace_id":"7fe1728c5e1e4bc09d148704c179edf0","span_id":"bbe45e0d09007e38","op":"GQL query serverInfo","status":"unknown","type":"trace"}},"culprit":"GET /graphql","environment":"production","errors":[{"type":"js_no_source","url":"/speckle-server/node_modules/knex/lib/client.js"},{"type":"js_no_source","url":"/speckle-server/node_modules/knex/lib/execution/runner.js"},{"type":"js_no_source","url":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js"},{"type":"js_no_source","url":"/speckle-server/packages/server/dist/modules/core/services/generic.js"}],"exception":{"values":[{"type":"GraphQLError","value":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?","stacktrace":{"frames":[{"function":"Object.serverInfo","module":"server","filename":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","abs_path":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","lineno":7,"colno":20,"pre_context":["'use strict';","const { validateServerRole, validateScopes } = require('@/modules/shared');","const { updateServerInfo, getServerInfo, getPublicScopes, getPublicRoles } = require('../../services/generic');","module.exports = {"," Query: {"," async serverInfo() {"],"context_line":" return await getServerInfo();","post_context":[" }"," },"," ServerInfo: {"," async roles() {"," return await getPublicRoles();"," },"," async scopes() {"],"in_app":true},{"function":"getServerInfo","module":"generic","filename":"/speckle-server/packages/server/dist/modules/core/services/generic.js","abs_path":"/speckle-server/packages/server/dist/modules/core/services/generic.js","lineno":11,"colno":28,"pre_context":["const Scopes = () => knex('scopes');","const Info = () => knex('server_config');","module.exports = {"," /**"," * @returns {Promise<import('@/modules/core/helpers/types').ServerInfo>}"," */"," async getServerInfo() {"],"context_line":" const serverInfo = await Info().select('*').first();","post_context":[" serverInfo.version = process.env.SPECKLE_SERVER_VERSION || 'dev';"," serverInfo.canonicalUrl = process.env.CANONICAL_URL || 'localhost';"," return serverInfo;"," },"," async getAllScopes() {"," return await Scopes().select('*');"," },"],"in_app":true},{"function":"Runner.run","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":30,"colno":19,"pre_context":["\r"," // \"Run\" the target, calling \"toSQL\" on the builder, returning\r"," // an object or array of queries to run, each of which are run on\r"," // a single connection.\r"," async run() {\r"," const runner = this;\r"," try {\r"],"context_line":" const res = await this.ensureConnection(ensureConnectionCallback);\r","post_context":["\r"," // Fire a single \"end\" event on the builder when\r"," // all queries have successfully completed.\r"," runner.builder.emit('end');\r"," return res;\r","\r"," // If there are any \"error\" listeners, we fire an error event\r"],"in_app":false},{"function":"Runner.ensureConnection","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":287,"colno":28,"pre_context":["\r"," if (this.connection) {\r"," return cb(this, cbParams);\r"," }\r","\r"," let acquiredConnection;\r"," try {\r"],"context_line":" acquiredConnection = await this.client.acquireConnection();\r","post_context":[" } catch (error) {\r"," if (!(error instanceof KnexTimeoutError)) {\r"," return Promise.reject(error);\r"," }\r"," if (this.builder) {\r"," error.sql = this.builder.sql;\r"," error.bindings = this.builder.bindings;\r"],"in_app":false},{"function":"Client_PG.acquireConnection","module":"knex.lib:client","filename":"/speckle-server/node_modules/knex/lib/client.js","abs_path":"/speckle-server/node_modules/knex/lib/client.js","lineno":312,"colno":26,"pre_context":[" try {\r"," const connection = await this.pool.acquire().promise;\r"," debug('acquired connection from pool: %s', connection.__knexUid);\r"," return connection;\r"," } catch (error) {\r"," let convertedError = error;\r"," if (error instanceof TimeoutError) {\r"],"context_line":" convertedError = new KnexTimeoutError(\r","post_context":[" 'Knex: Timeout acquiring a connection. The pool is probably full. ' +\r"," 'Are you missing a .transacting(trx) call?'\r"," );\r"," }\r"," throw convertedError;\r"," }\r"," }\r"],"in_app":false}]},"raw_stacktrace":{"frames":[{"function":"Object.serverInfo","module":"server","filename":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","abs_path":"/speckle-server/packages/server/dist/modules/core/graph/resolvers/server.js","lineno":7,"colno":20,"pre_context":["'use strict';","const { validateServerRole, validateScopes } = require('@/modules/shared');","const { updateServerInfo, getServerInfo, getPublicScopes, getPublicRoles } = require('../../services/generic');","module.exports = {"," Query: {"," async serverInfo() {"],"context_line":" return await getServerInfo();","post_context":[" }"," },"," ServerInfo: {"," async roles() {"," return await getPublicRoles();"," },"," async scopes() {"],"in_app":true},{"function":"getServerInfo","module":"generic","filename":"/speckle-server/packages/server/dist/modules/core/services/generic.js","abs_path":"/speckle-server/packages/server/dist/modules/core/services/generic.js","lineno":11,"colno":28,"pre_context":["const Scopes = () => knex('scopes');","const Info = () => knex('server_config');","module.exports = {"," /**"," * @returns {Promise<import('@/modules/core/helpers/types').ServerInfo>}"," */"," async getServerInfo() {"],"context_line":" const serverInfo = await Info().select('*').first();","post_context":[" serverInfo.version = process.env.SPECKLE_SERVER_VERSION || 'dev';"," serverInfo.canonicalUrl = process.env.CANONICAL_URL || 'localhost';"," return serverInfo;"," },"," async getAllScopes() {"," return await Scopes().select('*');"," },"],"in_app":true},{"function":"Runner.run","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":30,"colno":19,"pre_context":["\r"," // \"Run\" the target, calling \"toSQL\" on the builder, returning\r"," // an object or array of queries to run, each of which are run on\r"," // a single connection.\r"," async run() {\r"," const runner = this;\r"," try {\r"],"context_line":" const res = await this.ensureConnection(ensureConnectionCallback);\r","post_context":["\r"," // Fire a single \"end\" event on the builder when\r"," // all queries have successfully completed.\r"," runner.builder.emit('end');\r"," return res;\r","\r"," // If there are any \"error\" listeners, we fire an error event\r"],"in_app":false},{"function":"Runner.ensureConnection","module":"knex.lib.execution:runner","filename":"/speckle-server/node_modules/knex/lib/execution/runner.js","abs_path":"/speckle-server/node_modules/knex/lib/execution/runner.js","lineno":287,"colno":28,"pre_context":["\r"," if (this.connection) {\r"," return cb(this, cbParams);\r"," }\r","\r"," let acquiredConnection;\r"," try {\r"],"context_line":" acquiredConnection = await this.client.acquireConnection();\r","post_context":[" } catch (error) {\r"," if (!(error instanceof KnexTimeoutError)) {\r"," return Promise.reject(error);\r"," }\r"," if (this.builder) {\r"," error.sql = this.builder.sql;\r"," error.bindings = this.builder.bindings;\r"],"in_app":false},{"function":"Client_PG.acquireConnection","module":"knex.lib:client","filename":"/speckle-server/node_modules/knex/lib/client.js","abs_path":"/speckle-server/node_modules/knex/lib/client.js","lineno":312,"colno":26,"pre_context":[" try {\r"," const connection = await this.pool.acquire().promise;\r"," debug('acquired connection from pool: %s', connection.__knexUid);\r"," return connection;\r"," } catch (error) {\r"," let convertedError = error;\r"," if (error instanceof TimeoutError) {\r"],"context_line":" convertedError = new KnexTimeoutError(\r","post_context":[" 'Knex: Timeout acquiring a connection. The pool is probably full. ' +\r"," 'Are you missing a .transacting(trx) call?'\r"," );\r"," }\r"," throw convertedError;\r"," }\r"," }\r"],"in_app":false}]},"mechanism":{"type":"generic","handled":true}}]},"extra":{"query":"{serverInfo{version}}","variables":"[undefined]"},"fingerprint":["{{ default }}"],"grouping_config":{"enhancements":"eJybzDRxc15qeXFJZU6qlZGBkbGugaGuoeEEAHJMCAM","id":"newstyle:2023-01-11"},"hashes":["417af1dfb2d7a62282628d53be75280d","398a1bbc2e7bcd6edbe5b994316b8afd"],"ingest_path":[{"version":"23.6.2","public_key":"XE7QiyuNlja9PZ7I9qJlwQotzecWrUIN91BAO7Q5R38"}],"key_id":"3236202","level":"error","location":"/speckle-server/packages/server/dist/modules/core/services/generic.js","logger":"","metadata":{"display_title_with_tree_label":false,"filename":"/speckle-server/packages/server/dist/modules/core/services/generic.js","function":"getServerInfo","type":"GraphQLError","value":"Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?"},"nodestore_insert":1689158879.262104,"processed_by_symbolicator":true,"received":1689158878.589285,"request":{"url":"http://127.0.0.1/graphql","method":"GET","query_string":[["query","{serverInfo{version}}"]],"headers":[["Connection","close"],["Content-Type","application/json"],["Host","127.0.0.1:3000"],["X-Request-Id","2c5a421e-dfce-45ec-9e16-7d5823587d96"]],"inferred_content_type":"application/json"},"sdk":{"name":"sentry.javascript.node","version":"6.19.7","integrations":["InboundFilters","FunctionToString","ContextLines","Console","OnUncaughtException","OnUnhandledRejection","LinkedErrors","Postgres","Http","Express"],"packages":[{"name":"npm:@sentry/node","version":"6.19.7"}]},"timestamp":1689158878.485,"title":"GraphQLError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(t...","transaction":"GET /graphql","type":"error","user":{"id":"93d69b57eb759c2ef030eba5ee7186b4acdf0629e2762c3fdc36b84222f71807"},"version":"7"} The only difference we noticed is that the Sentry error was triggered many times (untill we stopped and started it manually), we believe it was every health check (alternating between the container check on 127.0.0.1 and our monitoring check on the public URL). |
Thanks for providing this. After reading a little more about Docker Compose's behaviour, it seems that it will not restart unhealthy containers 😞 . (For security reasons I would avoid the 'autoheal' solution mentioned in that link, as it requires mounting the docker socket to that container, effectively giving it access to your host machine.) From your error message we're again seeing the same
If you are able to connect Prometheus or Grafana to the metrics endpoint ( As a possible solution to saturated connections, you could attempt to increase the number of available connections by adding the You could also provide redundancy by providing a one or more replicas of the speckle server container. If one container was to fail, or its database connections become temporarily saturated, then the other could continue to operate and prevent downtime. Please refer to Docker's documentation on replicas. |
Hi again @iainsproat I have set-up a Grafana server on a separate VM instance but I am having trouble configuring the Prometheus data source. |
Can you connect to your speckle server and query |
Hi @iainsproat, we have set-up Grafana and the Prometheus logging.
Also, the server usage was basically zero at the time of the crash (being a Sunday with most of us on vacation), so no relation to user actions was seen. |
Thanks @Sticcia |
@iainsproat - your understanding is correct but when I mentioned a health-check in the previous comment I meant a check made by us, not the Docker Compose health check, I wasn't very clear, sorry. We had already made the same fix as you just did. |
Grafana is helpful, as I think we see a dramatic increase in connections at around 14:30. But it's hard to tell if this is related or due to the restart, and its a consequence rather than the cause. Do you still have the data available in Grafana to plot the metric I'd like to establish more exact timeline; do we see the increase in connections before or after the restart? |
I was mistaken, the server crashed twice this weekend. Once at 10:45 where our uptime check got stuck in a restart loop (this is clearly visible in the graphs too, there are gaps during the restarts). And once at 14:20, where the health check was disabled and the restart was done manually at 17:00. Sorry for the confusion. Here is the data you asked for the crash at 14:20 (with no action taken in this time slot). Data from 14:18 - 14:35Speckle Server container logs: Grafana data export: The increase of connections is before the restart. |
Hi again @iainsproat We managed to replicate the crash fairly consistently (although not 100% of the time) by lowering the VM type (CPU and memory) and overloading the server with many open connections. We don't know if this was the same issue that happened sporadically before but the crash errors are the same. We will write some tests to load the server in a more controlled way and see if that affects the performance/crash frequency. |
@Sticcia - that's a great finding, it sounds as if it might be a resource management issue. You may wish to run the server in |
@iainsproat - thanks for the quick response, if I understand correctly I should edit the docker-compose configuration with something like speckle-server:
image: speckle/speckle-server:2.13.3
ports:
- '127.0.0.1:3000:3000'
- '127.0.0.1:3000:9229'
command: ['bash', '-c', '/wait && node --inspect=9229 bin/www'] and then I can run something like Chrome DevTools on geo.stykka.com:9229 to see the debug output? |
@Sticcia - yes, though I think the command can just be |
Any update on this? We are facing the very same issue as well. We have increased VM sizes to
and still face random crashes |
up @iainsproat |
Hi @Sticcia - did you manage to find time to investigate this further? Does it still occur if you use the latest version of speckle's server? |
Hi @iainsproat, |
Prerequisites
What package are you referring to?
Server (backend)
Describe the bug
Our Speckle Server is unexpectedly going in an unresponsive state.
Restarting the speckle-server Docker container will solve the issue but we can't understand what is causing it.
We have investigated the logs and only found some error messages from /graphql API requests (part of the log file attached below).
Following the unexpected crash all requests to the Speckle server return:
Error: failed with status code 500
.Expected behavior
Server running smoothly and all requests handled correctly
System Info
Google Cloud VM Instance:
Failure Logs
Last 500 lines of the speckle-server container logs
Affected Projects
Speckle Frontend UI and Javascript viewer are not loading, .NET requests are failing.
The text was updated successfully, but these errors were encountered: