hello, i am running a self hosted prefect server and a prefect worker (k8). I am running a flow wher...
y
hello, i am running a self hosted prefect server and a prefect worker (k8). I am running a flow where it requires a different image, and it is deployed specifically in the flow deployment. There is random/transient error that says module not found (attached stacktrace in đź§µ) when the job is trying to start, but is often fixed with a 2nd or 3rd run retry max(where i will have to trigger it manually). Sometimes of 50% chance it will succeed without any retries. I have specified retries on my flows, however whenever this error appears, the flow is never retried. The error appears to me that the pod for the flow is not properly provisioned with the right package for the flow run to happen. Any idea how do i troubleshoot or proceed further, the failure rate of this is almost 10/21 of the past 3 weeks(without manual retry).
Copy code
Flow could not be retrieved from deployment.
Traceback (most recent call last):
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/tmp/tmph1ld7n0hprefect/data-science-main/flows/transcription/transcription_flow.py", line 9, in <module>
    import whisper_transcribe_tasks as w
  File "/tmp/tmph1ld7n0hprefect/data-science-main/flows/transcription/whisper_transcribe_tasks.py", line 7, in <module>
    from faster_whisper import WhisperModel
ModuleNotFoundError: No module named 'faster_whisper'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/engine.py", line 422, in retrieve_flow_then_begin_flow_run
    else await load_flow_from_flow_run(flow_run, client=client)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/client/utilities.py", line 51, in with_injected_client
    return await fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/deployments/deployments.py", line 264, in load_flow_from_flow_run
    flow = await run_sync_in_worker_thread(load_flow_from_entrypoint, str(import_path))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/flows.py", line 1550, in load_flow_from_entrypoint
    flow = import_object(entrypoint)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/importtools.py", line 201, in import_object
    module = load_script_as_module(script_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/importtools.py", line 164, in load_script_as_module
    raise ScriptError(user_exc=exc, path=path) from exc
prefect.exceptions.ScriptError: Script at 'flows/transcription/transcription_flow.py' encountered an exception: ModuleNotFoundError("No module named 'faster_whisper'")
a
Sounds like you might have a cached version of the custom image? Often happens when
:latest
is used Set
image_pull_policy : Always
or use a tag/hash you know has the dependencies.
y
u mean this right. it is already set at the k8 worker config, is there a way to specify it at the deployment? i am also not using
:latest
a
Yes that is the correct param. You can check the Events of a failed K8S job to verify if the image is pulled or not. I was just guessing based on what you described. Not a Prefect expert, just saw your question in passing
Copy code
Normal  Scheduled  7m3s  default-scheduler  Successfully assigned prefect/kappa5-drayan-r-mmfnw-lwzph to gke-ba3ba89f-8acf                                                                                        │
│   Normal  Pulling    7m2s  kubelet            Pulling image "europe-west3-docker.pkg.dev/prefect-base/master:xyx"                                                                                                             │
│   Normal  Pulled     7m2s  kubelet            Successfully pulled image "europe-west3-docker.pkg.dev/prefect-base/master:xyx" in 429.609283ms (429.627533ms including waiting)                                                │
│   Normal  Created    7m2s  kubelet            Created container prefect-job                                                                                                                                                                           │
│   Normal  Started    7m2s  kubelet            Started container prefect-job
n
is there a way to specify it at the deployment?
yep you can override any of the
job_variables
present on a work pool at the deployment level in your
prefect.yaml
like this example (where I'm overriding
image
, but you could just as well override
image_pull_policy
)
y
explicitly specify the image to be Always but still having the issues, one more thing i found when i am troubleshooting is that i was not able to find the pod in my k8-cluster after it failed.
n
do you have a finished job TTL set on your work pool?
y
yes. 5 days but should not be this. the failed pod could not be found even before i set the ttl
m
Any resolution on this @Ying Ting Loo? I’m also seeing “flow could not be retrieved from deployment” along with a module not found error when deploying a custom image to Google Cloud Run
(But the image runs locally)
y
no i have not be able to fix this. a hacky fix is to rerun it whenever this happens and hope that it is built properly.
m
Ah mine isn’t intermittent so probably separate issues
y
would like to add more information on this that i found running it for longer and have noticed a new pattern. tldr: we have a self hosted prefect server on kubernetes, prefect-worker, and prefect-agent running (all hosted on kubernetes cluster). For intensive load flows, we schedule it on prefect worker. Version is 2.14.21 for server and 2.14-kubernetes for prefect-worker and 2.14.14 for agent However, i noticed that the intensive job is not properly submitted to the kubernetes cluster consistently, 50% or not it actually got submitted to the prefect agent instead (causing oom and restart in the agent), and it doesnt restart because the flow will just be permanently running on the ui once the agent restarts. current solution is actually to rerun it and check the logs to make sure it runs on kubernetes worker but it is causing problem with prefect agent restart and the need to manually restart. Any one have any idea?
p
Hi @Ying Ting Loo and others - Greetings. We are looking to deploy the self hosted server and worker on K8s. I've following questions 1. Do you have complete deployment guide? If do, please share. 2. What are the security configuration settings to be done? If any, as such if any guidelines you have followed 3. Are you folks running it in prod? or still testing? 4. Do we have to do any hardening after the deployment? Please do help. Appreciate your timely response