hello, i am running a self hosted prefect server and a prefect worker (k8). I am running a flow wher...

Ying Ting Loo

02/14/2024, 6:53 AM

hello, i am running a self hosted prefect server and a prefect worker (k8). I am running a flow where it requires a different image, and it is deployed specifically in the flow deployment. There is random/transient error that says module not found (attached stacktrace in 🧵) when the job is trying to start, but is often fixed with a 2nd or 3rd run retry max(where i will have to trigger it manually). Sometimes of 50% chance it will succeed without any retries. I have specified retries on my flows, however whenever this error appears, the flow is never retried. The error appears to me that the pod for the flow is not properly provisioned with the right package for the flow run to happen. Any idea how do i troubleshoot or proceed further, the failure rate of this is almost 10/21 of the past 3 weeks(without manual retry).

Ying Ting Loo

02/14/2024, 6:58 AM

Copy code

Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/tmp/tmph1ld7n0hprefect/data-science-main/flows/transcription/transcription_flow.py", line 9, in <module>
    import whisper_transcribe_tasks as w
  File "/tmp/tmph1ld7n0hprefect/data-science-main/flows/transcription/whisper_transcribe_tasks.py", line 7, in <module>
    from faster_whisper import WhisperModel
ModuleNotFoundError: No module named 'faster_whisper'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/engine.py", line 422, in retrieve_flow_then_begin_flow_run
    else await load_flow_from_flow_run(flow_run, client=client)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/client/utilities.py", line 51, in with_injected_client
    return await fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/deployments/deployments.py", line 264, in load_flow_from_flow_run
    flow = await run_sync_in_worker_thread(load_flow_from_entrypoint, str(import_path))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/flows.py", line 1550, in load_flow_from_entrypoint
    flow = import_object(entrypoint)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/importtools.py", line 201, in import_object
    module = load_script_as_module(script_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/importtools.py", line 164, in load_script_as_module
    raise ScriptError(user_exc=exc, path=path) from exc
prefect.exceptions.ScriptError: Script at 'flows/transcription/transcription_flow.py' encountered an exception: ModuleNotFoundError("No module named 'faster_whisper'")

Andy Irvine

02/14/2024, 10:21 AM

Sounds like you might have a cached version of the custom image? Often happens when

:latest

is used Set

image_pull_policy : Always

or use a tag/hash you know has the dependencies.

Ying Ting Loo

02/14/2024, 10:34 AM

u mean this right. it is already set at the k8 worker config, is there a way to specify it at the deployment? i am also not using

:latest

Andy Irvine

02/14/2024, 10:45 AM

Yes that is the correct param. You can check the Events of a failed K8S job to verify if the image is pulled or not. I was just guessing based on what you described. Not a Prefect expert, just saw your question in passing

Copy code

Normal  Scheduled  7m3s  default-scheduler  Successfully assigned prefect/kappa5-drayan-r-mmfnw-lwzph to gke-ba3ba89f-8acf                                                                                        │
│   Normal  Pulling    7m2s  kubelet            Pulling image "europe-west3-docker.pkg.dev/prefect-base/master:xyx"                                                                                                             │
│   Normal  Pulled     7m2s  kubelet            Successfully pulled image "europe-west3-docker.pkg.dev/prefect-base/master:xyx" in 429.609283ms (429.627533ms including waiting)                                                │
│   Normal  Created    7m2s  kubelet            Created container prefect-job                                                                                                                                                                           │
│   Normal  Started    7m2s  kubelet            Started container prefect-job

Nate

02/14/2024, 2:26 PM

is there a way to specify it at the deployment?

yep you can override any of the

job_variables

present on a work pool at the deployment level in your

prefect.yaml

like this example (where I'm overriding

image

, but you could just as well override

image_pull_policy

)

Ying Ting Loo

02/28/2024, 6:33 AM

explicitly specify the image to be Always but still having the issues, one more thing i found when i am troubleshooting is that i was not able to find the pod in my k8-cluster after it failed.

Nate

02/28/2024, 1:45 PM

do you have a finished job TTL set on your work pool?

Ying Ting Loo

02/29/2024, 2:16 AM

yes. 5 days but should not be this. the failed pod could not be found even before i set the ttl

Matthew Bell

03/11/2024, 6:56 AM

Any resolution on this @Ying Ting Loo? I’m also seeing “flow could not be retrieved from deployment” along with a module not found error when deploying a custom image to Google Cloud Run

Matthew Bell

03/11/2024, 6:56 AM

(But the image runs locally)

Ying Ting Loo

03/11/2024, 7:01 AM

no i have not be able to fix this. a hacky fix is to rerun it whenever this happens and hope that it is built properly.

Matthew Bell

03/11/2024, 7:05 AM

Ah mine isn’t intermittent so probably separate issues

Ying Ting Loo

06/10/2024, 7:48 AM

would like to add more information on this that i found running it for longer and have noticed a new pattern. tldr: we have a self hosted prefect server on kubernetes, prefect-worker, and prefect-agent running (all hosted on kubernetes cluster). For intensive load flows, we schedule it on prefect worker. Version is 2.14.21 for server and 2.14-kubernetes for prefect-worker and 2.14.14 for agent However, i noticed that the intensive job is not properly submitted to the kubernetes cluster consistently, 50% or not it actually got submitted to the prefect agent instead (causing oom and restart in the agent), and it doesnt restart because the flow will just be permanently running on the ui once the agent restarts. current solution is actually to rerun it and check the logs to make sure it runs on kubernetes worker but it is causing problem with prefect agent restart and the need to manually restart. Any one have any idea?

Parash

06/14/2024, 3:44 AM

Hi @Ying Ting Loo and others - Greetings. We are looking to deploy the self hosted server and worker on K8s. I've following questions 1. Do you have complete deployment guide? If do, please share. 2. What are the security configuration settings to be done? If any, as such if any guidelines you have followed 3. Are you folks running it in prod? or still testing? 4. Do we have to do any hardening after the deployment? Please do help. Appreciate your timely response

62 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.