I have been running prefect server for the last mo...
# ask-community
o
I have been running prefect server for the last month and my containers and services have stayed up and healthy. I am not sure what changed in the last few hours but my agents can no longer connect to the server and the UI no longer renders. I am not sure how to debug this. From what I can see, the services are still up. For the agents, Timeout exception, the UI tries to load but doesn't render. I can shut down and restart all containers but I would like to keep the metadata stored in the database. Can anyone help?
These are the apollo logs
Copy code
(node:53) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see <https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode>). (rejection id: 1015)
Sending telemetry to Prefect Technologies, Inc.: {"source":"prefect_server","type":"heartbeat","payload":{"id":"10b9bf37-9aba-4de4-b89f-02cc42e36e75","prefect_server_version":"2021.11.09","api_version":"0.2.0"}}
(node:53) UnhandledPromiseRejectionWarning: FetchError: request to <https://sens-o-matic.prefect.io/> failed, reason: getaddrinfo EAI_AGAIN <http://sens-o-matic.prefect.io|sens-o-matic.prefect.io>
    at ClientRequest.<anonymous> (/apollo/node_modules/node-fetch/lib/index.js:1461:11)
    at ClientRequest.emit (events.js:315:20)
    at TLSSocket.socketErrorListener (_http_client.js:469:9)
    at TLSSocket.emit (events.js:315:20)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)
No logs returns from hasura and postgres, other services return healthy logs
I tried to restart the apollo container, still no luck
Copy code
GraphQL service healthy!
> @ serve /apollo
> node dist/index.js
Building schema...
Building schema complete!
Server ready at <http://0.0.0.0:4200> :rocket: (version: 2021.11.09)
Sending telemetry to Prefect Technologies, Inc.: {"source":"prefect_server","type":"startup","payload":{"id":"b678a0f0-e4e9-4071-bf5d-0f2e53528ee6","prefect_server_version":"2021.11.09","api_version":"0.2.0"}}
(node:29) UnhandledPromiseRejectionWarning: FetchError: request to <https://sens-o-matic.prefect.io/> failed, reason: getaddrinfo EAI_AGAIN <http://sens-o-matic.prefect.io|sens-o-matic.prefect.io>
    at ClientRequest.<anonymous> (/apollo/node_modules/node-fetch/lib/index.js:1461:11)
    at ClientRequest.emit (events.js:315:20)
    at TLSSocket.socketErrorListener (_http_client.js:469:9)
    at TLSSocket.emit (events.js:315:20)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:29) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see <https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode>). (rejection id: 1)
(node:29) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
a
How did you start your Server - can you share the exact command? Did you persist the Postgres database volume? If so, you can recreate all components and do DB migration As for the reason why, it looks like sending the telemetry (anonymous usage statistics) to Prefect failed. You can disable those as shown here https://docs.prefect.io/orchestration/server/telemetry.html
o
Thanks @Anna Geller, I started prefect server the vanilla way.
prefect server start.
I didn't persist the volumes when I started the server. If there's no way around it, I guess I'd have to lose the metadata but would make sure to persist the volume when I recreate. Is there an article I can follow to do this? Hope you feel better soon
🙏 1
k
The persistence can be found here
o
@Kevin Kho, thank you. I destroyed and recreated the containers and the server doesn't start. This are the logs. Any idea why all of a sudden I am unable to send the telemetry data. I have updated the config file to stop sending this.
k
I don’t know why you are not able to. How did you turn it off? The config.toml?
o
Yes
Copy code
[telemetry]
    [server.telemetry]
        enabled = false
k
Let me ask someone here
🙏 1
This appears to be your error. Can you try the fixes? Are you on windows?
o
I stopped, removed and recreated all containers but that doesn't seem to help. I am on a Linux (CentOS) which I use VSCode remote server to ssh into (Not sure if this info is relevant). I am working my way down the response to see which other ones might be relevant
k
What is your Prefect version?
o
0.15.10
k
I don’t think it’s related but did you start with the
--expose
flag?
o
No, I didn't. I can give it a try. The last time I started the server and it worked, I didn't need the --expose flag. Edit: Tried it, the server didn't start
k
What was your traceback?
o
For the towel service
Copy code
towel_1     | {"severity": "ERROR", "name": "prefect-server.ZombieKiller", "message": "Unexpected error: APIError('Unable to complete operation. An internal API error occurred.')", "exc_info": "Traceback (most recent call last):\n  File \"/prefect-server/src/prefect_server/utilities/exceptions.py\", line 87, in reraise_as_api_error\n    yield\n  File \"/prefect-server/src/prefect_server/utilities/graphql.py\", line 64, in execute\n    timeout=30,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1385, in post\n    timeout=timeout,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1148, in request\n    request, auth=auth, allow_redirects=allow_redirects, timeout=timeout,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1169, in send\n    request, auth=auth, timeout=timeout, allow_redirects=allow_redirects,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1196, in send_handling_redirects\n    request, auth=auth, timeout=timeout, history=history\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1232, in send_handling_auth\n    response = await self.send_single_request(request, timeout)\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1269, in send_single_request\n    timeout=timeout.as_dict(),\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_async/connection_pool.py\", line 153, in request\n    method, url, headers=headers, stream=stream, timeout=timeout\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py\", line 65, in request\n    self.socket = await self._open_socket(timeout)\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py\", line 86, in _open_socket\n    hostname, port, ssl_context, timeout\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_backends/auto.py\", line 38, in open_tcp_stream\n    return await self.backend.open_tcp_stream(hostname, port, ssl_context, timeout)\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_backends/asyncio.py\", line 234, in open_tcp_stream\n    stream_reader=stream_reader, stream_writer=stream_writer\n  File \"/usr/local/lib/python3.7/contextlib.py\", line 130, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_exceptions.py\", line 12, in map_exceptions\n    raise to_exc(exc) from None\nhttpcore._exceptions.ConnectError: [Errno 111] Connect call failed ('XXX.XX.0.3', 3000)\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/prefect-server/src/prefect_server/services/loop_service.py\", line 60, in run\n    await self.run_once()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 216, in run_once\n    await self.reap_zombie_task_runs()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 153, in reap_zombie_task_runs\n    apply_schema=False,\n  File \"/prefect-server/src/prefect_server/database/orm.py\", line 501, in get\n    as_box=not apply_schema,\n  File \"/prefect-server/src/prefect_server/database/hasura.py\", line 85, in execute\n    as_box=as_box,\n  File \"/prefect-server/src/prefect_server/utilities/graphql.py\", line 64, in execute\n    timeout=30,\n  File \"/usr/local/lib/python3.7/contextlib.py\", line 188, in __aexit__\n    await self.gen.athrow(typ, value, traceback)\n  File \"/prefect-server/src/prefect_server/utilities/exceptions.py\", line 93, in reraise_as_api_error\n    raise APIError() from exc\nprefect_server.utilities.exceptions.APIError: Unable to complete operation. An internal API error occurred."}
Hasura
Copy code
{"type":"pg-client","timestamp":"2022-02-08T18:40:22.795+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(0)."}}
{"type":"pg-client","timestamp":"2022-02-08T18:40:22.795+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(1)."}}
{"type":"startup","timestamp":"2022-02-08T18:40:22.795+0000","level":"error","detail":{"kind":"catalog_migrate","info":{"internal":"could not connect to server: Connection refused\n\tIs the server running on host \"postgres\" (XXX.XX.0.X) and accepting\n\tTCP/IP connections on port 5432?\n","path":"$","error":"connection error","code":"postgres-error"}}}
{"internal":"could not connect to server: Connection refused\n\tIs the server running on host \"postgres\" (XXX.XX.0.X) and accepting\n\tTCP/IP connections on port 5432?\n","path":"$","error":"connection error","code":"postgres-error"}
{"type":"startup","timestamp":"2022-02-08T18:40:26.672+0000","level":"error","detail":{"kind":"catalog_migrate","info":{"internal":{"statement":"CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA public","prepared":false,"error":{"exec_status":"FatalError","hint":null,"message":"duplicate key value violates unique constraint \"pg_extension_name_index\"","status_code":"23505","description":"Key (extname)=(pgcrypto) already exists."},"arguments":[]},"path":"$","error":"pgcrypto extension is required, but it could not be created; encountered unknown postgres error","code":"postgres-error"}}}
{"internal":{"statement":"CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA public","prepared":false,"error":{"exec_status":"FatalError","hint":null,"message":"duplicate key value violates unique constraint \"pg_extension_name_index\"","status_code":"23505","description":"Key (extname)=(pgcrypto) already exists."},"arguments":[]},"path":"$","error":"pgcrypto extension is required, but it could not be created; encountered unknown postgres error","code":"postgres-error"}
Postgres
Copy code
he files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok
Success. You can now start the database server using:
    pg_ctl -D /var/lib/postgresql/data -l logfile start
waiting for server to start....2022-02-08 18:40:22.751 UTC [46] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-08 18:40:22.792 UTC [47] LOG:  database system was shut down at 2022-02-08 18:40:22 UTC
2022-02-08 18:40:22.806 UTC [46] LOG:  database system is ready to accept connections
 done
server started
CREATE DATABASE
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
waiting for server to shut down....2022-02-08 18:40:24.281 UTC [46] LOG:  received fast shutdown request
2022-02-08 18:40:24.287 UTC [46] LOG:  aborting any active transactions
2022-02-08 18:40:24.294 UTC [46] LOG:  background worker "logical replication launcher" (PID 53) exited with exit code 1
2022-02-08 18:40:24.300 UTC [48] LOG:  shutting down
2022-02-08 18:40:24.327 UTC [46] LOG:  database system is shut down
 done
server stopped
PostgreSQL init process complete; ready for start up.
2022-02-08 18:40:24.422 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-08 18:40:24.423 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-08 18:40:24.434 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-08 18:40:24.516 UTC [74] LOG:  database system was shut down at 2022-02-08 18:40:24 UTC
2022-02-08 18:40:24.525 UTC [1] LOG:  database system is ready to accept connections
2022-02-08 18:40:26.962 UTC [82] ERROR:  duplicate key value violates unique constraint "pg_extension_name_index"
2022-02-08 18:40:26.962 UTC [82] DETAIL:  Key (extname)=(pgcrypto) already exists.
2022-02-08 18:40:26.962 UTC [82] STATEMENT:  CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA public
The other services are healthy
k
So that database error is normally from concurrent attempts to upgrade the database. But you are starting from scratch right?
o
Yes, I kill all services, make sure I have no containers running and then
prefect server start
k
i think there is a persistent database that might have been started? could we try starting it without persistence just to try?
o
How do I check this? I can't see any container. I also don't specify the --use-volume tag when I start the server but I have volumes from all the times I have tried to start the server
These volumes are not linked to any containers
k
oh that should be good then because it’s starting something new. will ask the team again
o
Thank you
So I deleted all volume, images etc, rebooted the VM and ran prefect server start. The DB container is stuck on status: restarting The logs now show:
Copy code
find: '/var/lib/postgresql/data': Permission denied
chown: changing ownership of '/var/lib/postgresql/data': Permission denied
chmod: changing permissions of '/var/lib/postgresql/data': Permission denied
find: '/var/lib/postgresql/data': Permission denied
chown: changing ownership of '/var/lib/postgresql/data': Permission denied
chmod: changing permissions of '/var/lib/postgresql/data': Permission denied
find: '/var/lib/postgresql/data': Permission denied
chown: changing ownership of '/var/lib/postgresql/data': Permission denied
I have sudo access on my VM so not sure. The traceback also looks slightly different
Copy code
graphql_1   | Error: (psycopg2.OperationalError) could not translate host name "postgres" to address: Temporary failure in name resolution
graphql_1   | 
graphql_1   | (Background on this error at: <https://sqlalche.me/e/14/e3q8>)
graphql_1   | 
graphql_1   | Could not upgrade the database!
apollo_1    | Checking GraphQL service at <http://graphql:4201/health> ...
tmp_graphql_1 exited with code 1
graphql_1   | 
graphql_1   | Running Alembic migrations...
apollo_1    | Checking GraphQL service at <http://graphql:4201/health> ...
apollo_1    | Checking GraphQL service at <http://graphql:4201/health> ...
hasura_1    | {"type":"pg-client","timestamp":"2022-02-08T20:38:01.019+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(0)."}}
graphql_1   | Running Alembic migrations...
hasura_1    | {"type":"pg-client","timestamp":"2022-02-08T20:38:21.143+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(1)."}}
hasura_1    | {"type":"startup","timestamp":"2022-02-08T20:38:21.143+0000","level":"error","detail":{"kind":"catalog_migrate","info":{"internal":"could not translate host name \"postgres\" to address: Temporary failure in name resolution\n","path":"$","error":"connection error","code":"postgres-error"}}}
hasura_1    | {"internal":"could not translate host name \"postgres\" to address: Temporary failure in name resolution\n","path":"$","error":"connection error","code":"postgres-error"
k
Looking at this
Do you have enough memory on the machine? Really dont know what is going on
Could also be something like this
There looks to be something weird with starting postgres we have to resolve first. A bunch of different stuff about it
o
Yeah, I have enough memory, I changed the permission to the postgresql folder and now it's back to the initial errors
Now the DM starts but the logs from the Hasura and towel services are the same as I posted earlier
I am so confused, I have been able to run server for more than a month and nothing in my environment changed between yesterday and today
k
But the hasura logs seem to indicate it can’t connect to the db? Are you actually able to query from the database?
o
Yes I can
k
Chatted with a few team members and noone really knows what is up here. Can I see your most recent image versions?
o
I pulled them today
k
this all looks good. I thought this was related to the Hasura upgrade to 2.0, but these versions should be stable
o
I am stomped.
@Kevin Kho, I updated the prefect version to 0.15.11 and tried to restart the server, the server still doesn't start. The logs look clean. I started prefect server on a different machine and the logs look the same. towel
Copy code
{"severity": "ERROR", "name": "prefect-server.ZombieKiller", "message": "Unexpected error: ValueError([{'extensions': {'path': '$.selectionSet.task_run', 'code': 'validation-failed'}, 'message': 'field \"task_run\" not found in type: \\'query_root\\''}])", "exc_info": "Traceback (most recent call last):\n  File \"/prefect-server/src/prefect_server/services/loop_service.py\", line 60, in run\n    await self.run_once()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 216, in run_once\n    await self.reap_zombie_task_runs()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 153, in reap_zombie_task_runs\n    apply_schema=False,\n  File \"/prefect-server/src/prefect_server/database/orm.py\", line 501, in get\n    as_box=not apply_schema,\n  File \"/prefect-server/src/prefect_server/database/hasura.py\", line 85, in execute\n    as_box=as_box,\n  File \"/prefect-server/src/prefect_server/utilities/graphql.py\", line 84, in execute\n    raise ValueError(result[\"errors\"])\nValueError: [{'extensions': {'path': '$.selectionSet.task_run', 'code': 'validation-failed'}, 'message': 'field \"task_run\" not found in type: \\'query_root\\''}]"}
{"severity": "INFO", "name": "prefect-server.Scheduler", "message": "Scheduled 0 flow runs."}
k
Will look more into this tom
Logs are the same on a different machine? Is there any common network configuration between them?
What is your OS? Will try to replicate tom
o
No there is not. The machine with the issues is my work machine and the other my personal machine. By the same, I mean in general, no errors. The OS of my work machine is CentOS 7
k
Wait, sorry am a bit confused. Server on the personal machine was working? Cuz these logs look good. And then work machine still doesn’t right?
o
Yes, server was on personal machine but doesn't start on work machine. I tried on my personal machine so I could compare the logs and they looked generally same, nothing out of place. I have been running server on my work machine for the last month with no issues before this