hi i am self hosting prefect orion as well as the flows in k Prefect Community #ask-community

hi! i am self hosting prefect (orion as well as th...

Deepanshu Aggarwal

12/22/2022, 4:01 PM

hi! i am self hosting prefect (orion as well as the flows) in kubernetes cluster . i used https://github.com/PrefectHQ/prefect-helm to setup my orion server and https://github.com/anna-geller/dataflow-ops-aws-eks/blob/main/.github/workflows/eks_prefect_agent.yml for deploying agents. im using kubernetes job block as the infrastructure block and s3 block for the storage. My flows have high memory usage compared to cpu usage ( tending to 1-2 cores and 12gb memory) im running 5 flows at a time with 2000-3000 batches of 50 tasks and running 1000s of flows throughout the day . i have implemented task concurrency limits and flow concurrency limits on every work queue. Im consistently experiencing flow run crashes . Besides concurrency limits , what options do i have to control the load and/or are there any methods to control the resource usage of the jobs in my cluster?

upvote 1

✅ 1

Anna Geller

12/22/2022, 4:07 PM

There are work queue concurrency limits for flow runs and task concurrency limits for tasks You can try Karpenter to help scale EKS cluster more easily

Deepanshu Aggarwal

12/22/2022, 4:07 PM

ive implemented both of the limits.

Zanie

12/22/2022, 4:08 PM

What kind of crashes are you seeing?

Deepanshu Aggarwal

12/22/2022, 4:09 PM

Copy code

Flow run infrastructure exited with non-zero status code -1.

generates no other logs in agent or the job pod PS: it also leaves the tasks running thus filling up the concurrency limit with orphan tasks

Zanie

12/22/2022, 4:12 PM

Hm. Do you have DEBUG level logs on?

Deepanshu Aggarwal

12/22/2022, 4:12 PM

Copy code

PREFECT_LOGGING_LEVEL: INFO
PREFECT_LOGGING_SERVER_LEVEL: WARNING
PREFECT_LOGGING_SETTINGS_PATH: ${PREFECT_HOME}/logging.yml
PREFECT_LOGGING_EXTRA_LOGGERS: 
PREFECT_LOGGING_LOG_PRINTS: false

this is the current config. i can try changing it

Zanie

12/22/2022, 4:13 PM

I think -1 usually means that we could not monitor the job

Zanie

12/22/2022, 4:14 PM

https://github.com/PrefectHQ/prefect/blob/main/src/prefect/infrastructure/kubernetes.py#L570-L576 https://github.com/PrefectHQ/prefect/blob/main/src/prefect/infrastructure/kubernetes.py#L604-L605

Zanie

12/22/2022, 4:14 PM

job_watch_timeout_seconds

set to

None

Zanie

12/22/2022, 4:15 PM

https://github.com/PrefectHQ/prefect/pull/7786

Deepanshu Aggarwal

12/22/2022, 4:15 PM

i havent configured this myself. it must be the default value

Zanie

12/22/2022, 4:16 PM

The default value changed to fix this issue; what version are you on?

Zanie

12/22/2022, 4:16 PM

https://github.com/PrefectHQ/prefect/issues/7854

👀 1

Deepanshu Aggarwal

12/22/2022, 4:17 PM

im using 2.7.1

Deepanshu Aggarwal

12/22/2022, 4:20 PM

ill check out these resources. thank you

Zanie

12/22/2022, 4:21 PM

Great! Let us know if you need anything else

👍 1

❤️ 1

panda dancing 1

Aram Karapetyan

12/22/2022, 4:56 PM

we have same issue, prefect is 2.7.1, having random crashes with

Flow run infrastructure exited with non-zero status code -1

Deepanshu Aggarwal

12/26/2022, 4:25 AM

actually i have noticed this error in 2 cases. 1. when we were not scheduling jobs to go to separate nodes and they all got scheduled on the same node, exhausting the resources for it and thus crashing.Adding podaffinity and antiaffinity helped with this 2. when we run around 6000-7000 tasks in batches of 50 the ui starts to lag , the mini representation of radar also lags and shows crashed (in this case the job somehow keeps running in the background using all resources accordingly). i need some assistance with the second case because this is unexpected behaviour. if it crashes it should stop running in the background

Aram Karapetyan

12/26/2022, 6:35 AM

in our case was non of this, job_ and pod_ timeouts ware the issue. We set to very long number and the issue was gone. Like Michael have told above, confusion is 2.7.1 fixes the issue but it does not. You have to manually set values otherwise after 60 seconds randomly majority of scheduled jobs would be marked as crashed.

3 Views

Open in Slack

Previous Next