Hi there! Back again- I've been building a fun lit...
# ask-community
p
Hi there! Back again- I've been building a fun little data ecosystem with prefect as our primary gardener. We're approaching a tipping point of turning some of these tools into user-facing products- The biggest bugbear though is that the prefect workers I run on our machines will occasionally (although reliably) crash once every couple of weeks, requiring a manual reboot. I'm only running them manually in tmux windows, as opposed to as systemd services- just while I've been in dev phase- but I wanted to get some insight into what might be going on here. I'll copy the error in the thread:
>
Copy code
Traceback (most recent call last):
>  File "/usr/local/lib/python3.10/dist-packages/anyio/_core/_sockets.py", line 189, in connect_tcp
>  addr_obj = ip_address(remote_host)
>  File "/usr/lib/python3.10/ipaddress.py", line 54, in ip_address
>  raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
>  ValueError: 'api.prefect.cloud' does not appear to be an IPv4 or IPv6 address
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>  File "/usr/local/lib/python3.10/dist-packages/httpcore/_exceptions.py", line 10, in map_exceptions
>  yield
>  File "/usr/local/lib/python3.10/dist-packages/httpcore/_backends/anyio.py", line 114, in connect_tcp
>  stream: anyio.abc.ByteStream = await anyio.connect_tcp(
>  File "/usr/local/lib/python3.10/dist-packages/anyio/_core/_sockets.py", line 192, in connect_tcp
>  gai_res = await getaddrinfo(
>  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
>  result = self.fn(*self.args, **self.kwargs)
>  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
>  for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
>  socket.gaierror: [Errno -3] Temporary failure in name resolution
This error recurs a handful of times before the worker process just gives up. Its a funny looking error and I'm wondering if there's a fix out there, some configuration thing I might be missing, or if it's just standard operating procedure and I should just plan on having a systemd service prepared to restart the worker whenever this happens in our eventual production infrastructure.
I'm still using prefect cloud for orchestration, but our use case demands that the actual processes being orchestrated run on our own hardware.
(Just running directly on an Ubuntu 22.04 machine)
Also notably the crash doesn't seem to be related to any specific deployment or flow being run- it typically crashes at odd hours when nothing is running
c
Hey Paige! Every time I've seen name resolution errors as been something going wrong with local DNS / local networking; unfortunately I don't have a good idea for how to debug or prove such a thing (our issue backlog has a few references but i couldn't find an obvious solution written down, e.g., https://github.com/PrefectHQ/prefect/issues/5812) Also P.S. I love the botanical analogy!!
p
Ok interesting! Thanks @Chris White! We are inside of a university system, so there might be a firewall thing going on like what's described in the issue you linked. The intermittent nature of my error looks different from anything I can see on the issue tracker, but it makes sense if there's some low-frequency telemetry hook on the worker that's getting blocked! I'll submit an issue next week, and look into the local DNS settings in the meantime!