Hi ! Coming in at work today, it seems like our Pr...
# prefect-server
s
Hi ! Coming in at work today, it seems like our Prefect server is broken and I don't have the slightest clue as to why. We're running the server on Kubernetes with a PostgreSql database. Logs that seem relevant are in the thread below.
Seems like the issue comes from Postgre :
Copy code
2021-11-12 07:54:48.140 GMT [10288] ERROR:  relation "hdb_catalog.event_log" does not exist at character 15
2021-11-12 07:54:48.140 GMT [10288] STATEMENT:  
	      UPDATE hdb_catalog.event_log
	      SET locked = 't'
	      WHERE id IN ( SELECT l.id
	                    FROM hdb_catalog.event_log l
	                    WHERE l.delivered = 'f' and l.error = 'f' and l.locked = 'f'
	                          and (l.next_retry_at is NULL or l.next_retry_at <= now())
	                          and l.archived = 'f'
	                    ORDER BY created_at
	                    LIMIT $1
	                    FOR UPDATE SKIP LOCKED )
	      RETURNING id, schema_name, table_name, trigger_name, payload::json, tries, created_at
This log seems to be repeating but I don't have any idea where it comes from or what I could do to fix it.
a
Which Prefect version do you run on your Server and agent? And does this version match with the Prefect version in the environment from which you register? This may be relevant
s
Agent and Server seem to have this (weird) version number :
0.15.6+11.ga6988f64e
and the environment registering uses
0.15.3
.
a
In that case it should be fine - in general, the Server version must be higher or equal to the one used with flow registration. This “weird” version number typically happens when you use a custom editable version installed by cloning the Github repo and installing from a directory, e.g. when you use your own fork with custom modifications. Could be that this causes some issue? How did you find out about the error - did a specific flow fail, or all of them? Are all components healthy incl. the agent?
s
It seems like it happened after the prefect job pod got OOMKilled by Kubernetes. Maybe it caused some issues with Postgre ? Anyway we restarted every component including postgre, flushed the database and recreated a tenant and everything seems fine now. We didn't mind losing the flows history as we are still in a development phase. I'll keep you posted if the issue happens again, might have been a blip.
It actually feels kind of weird that the job pod got OOMKilled since we use a DaskExecutor. The memory consumption of the master pod should be pretty constant, right ?
a
good job with the restart! With master pod you mean the Dask scheduler pod? if so, you’re correct, acc. to this response from Matthew Rocklin, a task should take up less than a kilobyte on the scheduler. Can you share how you define DaskExecutor class? I think there are several variables here that may play a role: 1. Whether you use a temporary or a long-running cluster 2. What is the resource allocation for Dask workers 3. What the tasks are doing, etc
s
By "master" I meant the prefect-job pod that starts the Dask cluster. Looking at resources consumption, I see that its memory consumption skyrocketed at some point which led to it being killed. Unfortunately I don't have access to its logs nor to the Dask pods logs which makes it harder to debug.
a
interesting, so in the Prefect UI in the flow run logs, you can only see that the flow started running, executor created, but then it crashed due to OOM and you have no more logs there on the flow run / task run page? Perhaps to avoid it in the future, you can configure some extra logging service for the entire namespace on which you use Prefect so that if something goes wrong, you still have logs there. I mean, the infrastructure specific logs. Thanks for sharing!
s
To be perfectly fair, I did not have any access to the UI this morning. It failed when trying to load the dashboard. What I saw was the pod was killed by k8s due to OOM but could not access the logs for other reasons (trying to work on that on our end too). The flow was a quite long running one. I estimate that it failed after 10-15h but not on startup.
a
I see, regarding the UI crashing, I can definitely recommend Prefect Cloud to mitigate this. Btw, did you know we’ve recently doubled the task runs on the free tier? Now, it’s 20,000. And regarding the long running job, you can add this env variable to your KubernetesRun - it can help a lot with long running jobs that loose flow’s heartbeat due to OOM:
Copy code
flow.run_config = KubernetesRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})
s
As much as I'd love to switch to Prefect Cloud, the decision is unfortunately not in my hands right now 😅 Thanks for the tip, I'll use this and see if it improves things 🙂
🙌 1