https://prefect.io logo
Title
s

Santiago Gonzalez

04/21/2023, 12:33 PM
Hi, I ’ve started a worker last night and this morning I ve just found it crashed because of this
return fn(*args, **kwargs)
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/utilities/asyncutils.py", line 260, in coroutine_wrapper
    return call()
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/_internal/concurrency/calls.py", line 245, in __call__
    return self.result()
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/_internal/concurrency/calls.py", line 173, in result
    return self.future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/_internal/concurrency/calls.py", line 218, in _run_async
    result = await coro
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/cli/worker.py", line 142, in start
    started_event = await worker._emit_worker_started_event()
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/anyio/_backends/_asyncio.py", line 702, in _run_wrapped_task
    await coro
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/utilities/services.py", line 104, in critical_service_loop
    raise RuntimeError("Service exceeded error threshold.")
RuntimeError: Service exceeded error threshold.
An exception occurred.
And the thing is that, no flow has been running last night. Do you know what is the cause of the crash of the worker? Prefect version
2.10.5
.
1
w

Will Raphaelson

04/21/2023, 3:03 PM
Hey @Santiago Gonzalez thanks for raising this. Tried to repro and cant get that error to surface. It feels a bit transient / networky, but thats just a hunch. After restarting your worker, has the issue persisted? And/or are there any additional repro / pod logs you could share? Thanks.
s

Santiago Gonzalez

04/21/2023, 3:18 PM
Yes, I got the same error again
Started two workers this morning (both on different EC2 Instances), got the same error in both
w

Will Raphaelson

04/21/2023, 3:20 PM
hmm okay, sorry about that. Would you mind filing an issue in our github repository with relevant repro steps? https://github.com/PrefectHQ/prefect
s

Santiago Gonzalez

04/25/2023, 1:42 PM
Hey this happens to me all the time, please can you help me with this issue? I need to start the worker every time. Actually, yesterday it crashed when a flow was running over it.
w

Will Raphaelson

04/25/2023, 2:08 PM
Thanks @Santiago Gonzalez, the first step to getting this resolved is going to be by filing a github issue with the relevant information, that way I can get it over to the engineering team to investigate.
👍 1
u

UUBOY scy

04/27/2023, 7:15 AM
Hi, I got the same problem, I write a script to check health of agent and then restart it if needed. But this is not a long-trem solution.
w

Will Raphaelson

04/27/2023, 5:27 PM
Thanks, can one of you please file an issue in the prefect github reposity linked above with your reproduction steps so that I can assign an engineer to work on it? Thank you.
a

alex

04/27/2023, 5:40 PM
@Santiago Gonzalez We’ve seen HTTP2 sometime be the source of crashes like this. Could you try disabling HTTP2 on your worker instance with
prefect config set PREFECT_API_ENABLE_HTTP2=False
? Hopefully, that increases the stability of your worker, but either way, it would still be valuable to open an issue so that we can investigate further.
👍 1
👀 1
s

Santiago Gonzalez

04/28/2023, 1:37 AM
I just updated it. Tomorrow I will tell you how it went
It seems that it is working with that fix
🙌 1
However, today I will open an issue
It is broken again
File "/home/sgonzalez/venv/lib/python3.7/site-packages/anyio/_backends/_asyncio.py", line 702, in _run_wrapped_task
    await coro
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/utilities/services.py", line 46, in critical_service_loop
    await workload()
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/workers/base.py", line 590, in sync_with_backend
    await self._update_local_work_pool_info()
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/workers/base.py", line 546, in _update_local_work_pool_info
    work_pool_name=self._work_pool_name
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/client/orchestration.py", line 2124, in read_work_pool
    response = await self._client.get(f"/work_pools/{work_pool_name}")
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/httpx/_client.py", line 1766, in get
    extensions=extensions,
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/httpx/_client.py", line 1533, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/client/base.py", line 278, in send
    response.raise_for_status()
  File "/home/sgonzalez/venv/lib/python3.7/site-packages/prefect/client/base.py", line 135, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url '<https://api.prefect.cloud/api/accounts/7bafcec2-0687-406d-a451-54d8dfab110e/workspaces/14c859a2-1303-4228-ac50-abb2dea6785f/work_pools/pool>'
Response: {'exception_message': 'Internal Server Error'}
For more information check: <https://httpstatuses.com/500>
An exception occurred.
e

Emil Ordoñez

05/10/2023, 5:07 PM
Hello @Santiago Gonzalez did you create the issue?
I was having a similar issue, but I'm running my agent on an ECS Service. I tried the suggested solution: adding
PREFECT_API_ENABLE_HTTP2=False
to my environment variables and that seems to work, but I want to keep an eye on the issue.
s

Santiago Gonzalez

05/10/2023, 6:15 PM
Not required, just checked the workers (that run on EC2 Instances) and they are still alive (with version 2.10.7)