I have been running prefect server for the last month and my Prefect Community #ask-community

I have been running prefect server for the last mo...

Ovo Ojameruaye

02/08/2022, 7:58 AM

I have been running prefect server for the last month and my containers and services have stayed up and healthy. I am not sure what changed in the last few hours but my agents can no longer connect to the server and the UI no longer renders. I am not sure how to debug this. From what I can see, the services are still up. For the agents, Timeout exception, the UI tries to load but doesn't render. I can shut down and restart all containers but I would like to keep the metadata stored in the database. Can anyone help?

Ovo Ojameruaye

02/08/2022, 8:38 AM

These are the apollo logs

Copy code

(node:53) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see <https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode>). (rejection id: 1015)
Sending telemetry to Prefect Technologies, Inc.: {"source":"prefect_server","type":"heartbeat","payload":{"id":"10b9bf37-9aba-4de4-b89f-02cc42e36e75","prefect_server_version":"2021.11.09","api_version":"0.2.0"}}
(node:53) UnhandledPromiseRejectionWarning: FetchError: request to <https://sens-o-matic.prefect.io/> failed, reason: getaddrinfo EAI_AGAIN <http://sens-o-matic.prefect.io|sens-o-matic.prefect.io>
    at ClientRequest.<anonymous> (/apollo/node_modules/node-fetch/lib/index.js:1461:11)
    at ClientRequest.emit (events.js:315:20)
    at TLSSocket.socketErrorListener (_http_client.js:469:9)
    at TLSSocket.emit (events.js:315:20)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)

Ovo Ojameruaye

02/08/2022, 8:40 AM

No logs returns from hasura and postgres, other services return healthy logs

Ovo Ojameruaye

02/08/2022, 8:44 AM

I tried to restart the apollo container, still no luck

Copy code

GraphQL service healthy!
> @ serve /apollo
> node dist/index.js
Building schema...
Building schema complete!
Server ready at <http://0.0.0.0:4200> :rocket: (version: 2021.11.09)
Sending telemetry to Prefect Technologies, Inc.: {"source":"prefect_server","type":"startup","payload":{"id":"b678a0f0-e4e9-4071-bf5d-0f2e53528ee6","prefect_server_version":"2021.11.09","api_version":"0.2.0"}}
(node:29) UnhandledPromiseRejectionWarning: FetchError: request to <https://sens-o-matic.prefect.io/> failed, reason: getaddrinfo EAI_AGAIN <http://sens-o-matic.prefect.io|sens-o-matic.prefect.io>
    at ClientRequest.<anonymous> (/apollo/node_modules/node-fetch/lib/index.js:1461:11)
    at ClientRequest.emit (events.js:315:20)
    at TLSSocket.socketErrorListener (_http_client.js:469:9)
    at TLSSocket.emit (events.js:315:20)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:80:21)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:29) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see <https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode>). (rejection id: 1)
(node:29) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Anna Geller

02/08/2022, 9:14 AM

How did you start your Server - can you share the exact command? Did you persist the Postgres database volume? If so, you can recreate all components and do DB migration As for the reason why, it looks like sending the telemetry (anonymous usage statistics) to Prefect failed. You can disable those as shown here https://docs.prefect.io/orchestration/server/telemetry.html

Ovo Ojameruaye

02/08/2022, 2:39 PM

Thanks @Anna Geller, I started prefect server the vanilla way.

prefect server start.

I didn't persist the volumes when I started the server. If there's no way around it, I guess I'd have to lose the metadata but would make sure to persist the volume when I recreate. Is there an article I can follow to do this? Hope you feel better soon

🙏 1

Kevin Kho

02/08/2022, 3:12 PM

The persistence can be found here

Ovo Ojameruaye

02/08/2022, 4:05 PM

@Kevin Kho, thank you. I destroyed and recreated the containers and the server doesn't start. This are the logs. Any idea why all of a sudden I am unable to send the telemetry data. I have updated the config file to stop sending this.

logs.txt

Kevin Kho

02/08/2022, 4:15 PM

I don’t know why you are not able to. How did you turn it off? The config.toml?

Ovo Ojameruaye

02/08/2022, 4:17 PM

Yes

Copy code

[telemetry]
    [server.telemetry]
        enabled = false

Kevin Kho

02/08/2022, 4:21 PM

Let me ask someone here

🙏 1

Kevin Kho

02/08/2022, 6:05 PM

This appears to be your error. Can you try the fixes? Are you on windows?

Ovo Ojameruaye

02/08/2022, 6:28 PM

I stopped, removed and recreated all containers but that doesn't seem to help. I am on a Linux (CentOS) which I use VSCode remote server to ssh into (Not sure if this info is relevant). I am working my way down the response to see which other ones might be relevant

Kevin Kho

02/08/2022, 6:36 PM

What is your Prefect version?

Ovo Ojameruaye

02/08/2022, 6:37 PM

0.15.10

Kevin Kho

02/08/2022, 6:38 PM

I don’t think it’s related but did you start with the

--expose

flag?

Ovo Ojameruaye

02/08/2022, 6:40 PM

No, I didn't. I can give it a try. The last time I started the server and it worked, I didn't need the --expose flag. Edit: Tried it, the server didn't start

Kevin Kho

02/08/2022, 6:46 PM

What was your traceback?

Ovo Ojameruaye

02/08/2022, 6:47 PM

For the towel service

Copy code

towel_1     | {"severity": "ERROR", "name": "prefect-server.ZombieKiller", "message": "Unexpected error: APIError('Unable to complete operation. An internal API error occurred.')", "exc_info": "Traceback (most recent call last):\n  File \"/prefect-server/src/prefect_server/utilities/exceptions.py\", line 87, in reraise_as_api_error\n    yield\n  File \"/prefect-server/src/prefect_server/utilities/graphql.py\", line 64, in execute\n    timeout=30,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1385, in post\n    timeout=timeout,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1148, in request\n    request, auth=auth, allow_redirects=allow_redirects, timeout=timeout,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1169, in send\n    request, auth=auth, timeout=timeout, allow_redirects=allow_redirects,\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1196, in send_handling_redirects\n    request, auth=auth, timeout=timeout, history=history\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1232, in send_handling_auth\n    response = await self.send_single_request(request, timeout)\n  File \"/usr/local/lib/python3.7/site-packages/httpx/_client.py\", line 1269, in send_single_request\n    timeout=timeout.as_dict(),\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_async/connection_pool.py\", line 153, in request\n    method, url, headers=headers, stream=stream, timeout=timeout\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py\", line 65, in request\n    self.socket = await self._open_socket(timeout)\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_async/connection.py\", line 86, in _open_socket\n    hostname, port, ssl_context, timeout\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_backends/auto.py\", line 38, in open_tcp_stream\n    return await self.backend.open_tcp_stream(hostname, port, ssl_context, timeout)\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_backends/asyncio.py\", line 234, in open_tcp_stream\n    stream_reader=stream_reader, stream_writer=stream_writer\n  File \"/usr/local/lib/python3.7/contextlib.py\", line 130, in __exit__\n    self.gen.throw(type, value, traceback)\n  File \"/usr/local/lib/python3.7/site-packages/httpcore/_exceptions.py\", line 12, in map_exceptions\n    raise to_exc(exc) from None\nhttpcore._exceptions.ConnectError: [Errno 111] Connect call failed ('XXX.XX.0.3', 3000)\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/prefect-server/src/prefect_server/services/loop_service.py\", line 60, in run\n    await self.run_once()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 216, in run_once\n    await self.reap_zombie_task_runs()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 153, in reap_zombie_task_runs\n    apply_schema=False,\n  File \"/prefect-server/src/prefect_server/database/orm.py\", line 501, in get\n    as_box=not apply_schema,\n  File \"/prefect-server/src/prefect_server/database/hasura.py\", line 85, in execute\n    as_box=as_box,\n  File \"/prefect-server/src/prefect_server/utilities/graphql.py\", line 64, in execute\n    timeout=30,\n  File \"/usr/local/lib/python3.7/contextlib.py\", line 188, in __aexit__\n    await self.gen.athrow(typ, value, traceback)\n  File \"/prefect-server/src/prefect_server/utilities/exceptions.py\", line 93, in reraise_as_api_error\n    raise APIError() from exc\nprefect_server.utilities.exceptions.APIError: Unable to complete operation. An internal API error occurred."}

Ovo Ojameruaye

02/08/2022, 6:47 PM

Hasura

Copy code

{"type":"pg-client","timestamp":"2022-02-08T18:40:22.795+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(0)."}}
{"type":"pg-client","timestamp":"2022-02-08T18:40:22.795+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(1)."}}
{"type":"startup","timestamp":"2022-02-08T18:40:22.795+0000","level":"error","detail":{"kind":"catalog_migrate","info":{"internal":"could not connect to server: Connection refused\n\tIs the server running on host \"postgres\" (XXX.XX.0.X) and accepting\n\tTCP/IP connections on port 5432?\n","path":"$","error":"connection error","code":"postgres-error"}}}
{"internal":"could not connect to server: Connection refused\n\tIs the server running on host \"postgres\" (XXX.XX.0.X) and accepting\n\tTCP/IP connections on port 5432?\n","path":"$","error":"connection error","code":"postgres-error"}
{"type":"startup","timestamp":"2022-02-08T18:40:26.672+0000","level":"error","detail":{"kind":"catalog_migrate","info":{"internal":{"statement":"CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA public","prepared":false,"error":{"exec_status":"FatalError","hint":null,"message":"duplicate key value violates unique constraint \"pg_extension_name_index\"","status_code":"23505","description":"Key (extname)=(pgcrypto) already exists."},"arguments":[]},"path":"$","error":"pgcrypto extension is required, but it could not be created; encountered unknown postgres error","code":"postgres-error"}}}
{"internal":{"statement":"CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA public","prepared":false,"error":{"exec_status":"FatalError","hint":null,"message":"duplicate key value violates unique constraint \"pg_extension_name_index\"","status_code":"23505","description":"Key (extname)=(pgcrypto) already exists."},"arguments":[]},"path":"$","error":"pgcrypto extension is required, but it could not be created; encountered unknown postgres error","code":"postgres-error"}

Ovo Ojameruaye

02/08/2022, 6:48 PM

Postgres

Copy code

he files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok
Success. You can now start the database server using:
    pg_ctl -D /var/lib/postgresql/data -l logfile start
waiting for server to start....2022-02-08 18:40:22.751 UTC [46] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-08 18:40:22.792 UTC [47] LOG:  database system was shut down at 2022-02-08 18:40:22 UTC
2022-02-08 18:40:22.806 UTC [46] LOG:  database system is ready to accept connections
 done
server started
CREATE DATABASE
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
waiting for server to shut down....2022-02-08 18:40:24.281 UTC [46] LOG:  received fast shutdown request
2022-02-08 18:40:24.287 UTC [46] LOG:  aborting any active transactions
2022-02-08 18:40:24.294 UTC [46] LOG:  background worker "logical replication launcher" (PID 53) exited with exit code 1
2022-02-08 18:40:24.300 UTC [48] LOG:  shutting down
2022-02-08 18:40:24.327 UTC [46] LOG:  database system is shut down
 done
server stopped
PostgreSQL init process complete; ready for start up.
2022-02-08 18:40:24.422 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-08 18:40:24.423 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-08 18:40:24.434 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-08 18:40:24.516 UTC [74] LOG:  database system was shut down at 2022-02-08 18:40:24 UTC
2022-02-08 18:40:24.525 UTC [1] LOG:  database system is ready to accept connections
2022-02-08 18:40:26.962 UTC [82] ERROR:  duplicate key value violates unique constraint "pg_extension_name_index"
2022-02-08 18:40:26.962 UTC [82] DETAIL:  Key (extname)=(pgcrypto) already exists.
2022-02-08 18:40:26.962 UTC [82] STATEMENT:  CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA public

Ovo Ojameruaye

02/08/2022, 6:48 PM

The other services are healthy

Kevin Kho

02/08/2022, 6:58 PM

So that database error is normally from concurrent attempts to upgrade the database. But you are starting from scratch right?

Ovo Ojameruaye

02/08/2022, 7:00 PM

Yes, I kill all services, make sure I have no containers running and then

prefect server start

Kevin Kho

02/08/2022, 7:08 PM

i think there is a persistent database that might have been started? could we try starting it without persistence just to try?

Ovo Ojameruaye

02/08/2022, 7:40 PM

How do I check this? I can't see any container. I also don't specify the --use-volume tag when I start the server but I have volumes from all the times I have tried to start the server

Ovo Ojameruaye

02/08/2022, 7:40 PM

These volumes are not linked to any containers

Kevin Kho

02/08/2022, 7:48 PM

oh that should be good then because it’s starting something new. will ask the team again

Ovo Ojameruaye

02/08/2022, 7:54 PM

Thank you

Ovo Ojameruaye

02/08/2022, 8:46 PM

So I deleted all volume, images etc, rebooted the VM and ran prefect server start. The DB container is stuck on status: restarting The logs now show:

Copy code

find: '/var/lib/postgresql/data': Permission denied
chown: changing ownership of '/var/lib/postgresql/data': Permission denied
chmod: changing permissions of '/var/lib/postgresql/data': Permission denied
find: '/var/lib/postgresql/data': Permission denied
chown: changing ownership of '/var/lib/postgresql/data': Permission denied
chmod: changing permissions of '/var/lib/postgresql/data': Permission denied
find: '/var/lib/postgresql/data': Permission denied
chown: changing ownership of '/var/lib/postgresql/data': Permission denied

I have sudo access on my VM so not sure. The traceback also looks slightly different

Copy code

graphql_1   | Error: (psycopg2.OperationalError) could not translate host name "postgres" to address: Temporary failure in name resolution
graphql_1   | 
graphql_1   | (Background on this error at: <https://sqlalche.me/e/14/e3q8>)
graphql_1   | 
graphql_1   | Could not upgrade the database!
apollo_1    | Checking GraphQL service at <http://graphql:4201/health> ...
tmp_graphql_1 exited with code 1
graphql_1   | 
graphql_1   | Running Alembic migrations...
apollo_1    | Checking GraphQL service at <http://graphql:4201/health> ...
apollo_1    | Checking GraphQL service at <http://graphql:4201/health> ...
hasura_1    | {"type":"pg-client","timestamp":"2022-02-08T20:38:01.019+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(0)."}}
graphql_1   | Running Alembic migrations...
hasura_1    | {"type":"pg-client","timestamp":"2022-02-08T20:38:21.143+0000","level":"warn","detail":{"message":"postgres connection failed, retrying(1)."}}
hasura_1    | {"type":"startup","timestamp":"2022-02-08T20:38:21.143+0000","level":"error","detail":{"kind":"catalog_migrate","info":{"internal":"could not translate host name \"postgres\" to address: Temporary failure in name resolution\n","path":"$","error":"connection error","code":"postgres-error"}}}
hasura_1    | {"internal":"could not translate host name \"postgres\" to address: Temporary failure in name resolution\n","path":"$","error":"connection error","code":"postgres-error"

Kevin Kho

02/08/2022, 9:01 PM

Looking at this

Kevin Kho

02/08/2022, 9:06 PM

Do you have enough memory on the machine? Really dont know what is going on

Kevin Kho

02/08/2022, 9:07 PM

Could also be something like this

Kevin Kho

02/08/2022, 9:11 PM

There looks to be something weird with starting postgres we have to resolve first. A bunch of different stuff about it

Ovo Ojameruaye

02/08/2022, 9:25 PM

Yeah, I have enough memory, I changed the permission to the postgresql folder and now it's back to the initial errors

Ovo Ojameruaye

02/08/2022, 9:26 PM

Now the DM starts but the logs from the Hasura and towel services are the same as I posted earlier

Ovo Ojameruaye

02/08/2022, 9:27 PM

I am so confused, I have been able to run server for more than a month and nothing in my environment changed between yesterday and today

Kevin Kho

02/08/2022, 9:46 PM

But the hasura logs seem to indicate it can’t connect to the db? Are you actually able to query from the database?

Ovo Ojameruaye

02/08/2022, 10:31 PM

Yes I can

Kevin Kho

02/08/2022, 11:12 PM

Chatted with a few team members and noone really knows what is up here. Can I see your most recent image versions?

Ovo Ojameruaye

02/08/2022, 11:13 PM

I pulled them today

Kevin Kho

02/08/2022, 11:17 PM

this all looks good. I thought this was related to the Hasura upgrade to 2.0, but these versions should be stable

Ovo Ojameruaye

02/08/2022, 11:19 PM

I am stomped.

Ovo Ojameruaye

02/09/2022, 7:57 AM

@Kevin Kho, I updated the prefect version to 0.15.11 and tried to restart the server, the server still doesn't start. The logs look clean. I started prefect server on a different machine and the logs look the same. towel

Copy code

{"severity": "ERROR", "name": "prefect-server.ZombieKiller", "message": "Unexpected error: ValueError([{'extensions': {'path': '$.selectionSet.task_run', 'code': 'validation-failed'}, 'message': 'field \"task_run\" not found in type: \\'query_root\\''}])", "exc_info": "Traceback (most recent call last):\n  File \"/prefect-server/src/prefect_server/services/loop_service.py\", line 60, in run\n    await self.run_once()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 216, in run_once\n    await self.reap_zombie_task_runs()\n  File \"/prefect-server/src/prefect_server/services/towel/zombie_killer.py\", line 153, in reap_zombie_task_runs\n    apply_schema=False,\n  File \"/prefect-server/src/prefect_server/database/orm.py\", line 501, in get\n    as_box=not apply_schema,\n  File \"/prefect-server/src/prefect_server/database/hasura.py\", line 85, in execute\n    as_box=as_box,\n  File \"/prefect-server/src/prefect_server/utilities/graphql.py\", line 84, in execute\n    raise ValueError(result[\"errors\"])\nValueError: [{'extensions': {'path': '$.selectionSet.task_run', 'code': 'validation-failed'}, 'message': 'field \"task_run\" not found in type: \\'query_root\\''}]"}
{"severity": "INFO", "name": "prefect-server.Scheduler", "message": "Scheduled 0 flow runs."}

traceback.txt

Kevin Kho

02/09/2022, 7:59 AM

Will look more into this tom

Kevin Kho

02/09/2022, 8:01 AM

Logs are the same on a different machine? Is there any common network configuration between them?

Kevin Kho

02/09/2022, 8:06 AM

What is your OS? Will try to replicate tom

Ovo Ojameruaye

02/09/2022, 8:07 AM

No there is not. The machine with the issues is my work machine and the other my personal machine. By the same, I mean in general, no errors. The OS of my work machine is CentOS 7

Kevin Kho

02/09/2022, 8:18 AM

Wait, sorry am a bit confused. Server on the personal machine was working? Cuz these logs look good. And then work machine still doesn’t right?

Ovo Ojameruaye

02/09/2022, 4:19 PM

Yes, server was on personal machine but doesn't start on work machine. I tried on my personal machine so I could compare the logs and they looked generally same, nothing out of place. I have been running server on my work machine for the last month with no issues before this

99 Views

Open in Slack

Previous Next