https://prefect.io logo
Title
s

Stéphan Taljaard

02/01/2023, 2:27 PM
Hi. Before creating an issue on Github, I thought to first check in here. I used to run my agent on a local Windows PC and now moved over to a Debian VM in GCP. My flows are the same, though I was using Python 3.10 on Windows, and now 3.11 on Debian. I just switched my deployments' queue to my new agent. Those are the only differences. I'm running using a process infra block. Now suddenly so many of my flows are experiencing asyncio.exceptions.CancelledError/TimeoutError/httpx.ConnectTimeout Any one experienced that?
It's happening in tasks where I use OrionClient, e.g.
flow_run = await client.read_flow_run(flow_run_id)
It's sporadic; it does not happen to many but not all of my flow runs.
c

Christopher Boyd

02/01/2023, 4:02 PM
Can you elaborate? This is a very generic error, and without any logs, configuration, details, there’s no way of telling
Perhaps some tracebacks for a specific failing flow, and how it’s configured, how / where it’s registered, what infra its executing on, where the agent is running, and perhaps an agent log would be helpful in determining what’s occurring . Specifically however, your entire infrastructure changed from local windows to remote debian in GCP and you’re receiving timeouts
s

Stéphan Taljaard

02/01/2023, 4:07 PM
👍 We can move over to Github if you prefer?
Running Prefect 2.7.10, Python 3.11.1, on Debian, in GCP User:
2a0672be-6e64-4081-bdfd-a4dfded5a802
Workspace:
b70a86f1-659f-408f-add7-9665c6bfa327
Example flow run:
d0a4a7f6-35e9-405f-a9a7-66d2751313a1
Deployment saved in a GCP Bucket; it's successfully pulled when running the deployment's flow Infra: stock-standard Process block Default task executor; no `.submit`ed tasks, so all are run immediately/sequentially Some tasks are async, but used non-awaited inside the non-async flow func. The flow runs sucessfully for 2 minutes, one of the last tasks is to read the deployment name using async OrionClient, which is where task run
a4194641-d636-4f79-8d27-ebb6d1bf5e70
crashed, with this trace (failure at 04:05:25 PM UTC+2)
Here's another similar one, but it failed instead of crashed, with a different error after I set
PREFECT_API_REQUEST_TIMEOUT
to 600 https://pastebin.com/DgTStCFP
Example failing task:
@task
async def get_project_dir_name() -> str:
    task_run_context = get_run_context()
    flow_run_id = task_run_context.task_run.flow_run_id

    async with prefect.client.orion.get_client() as client:
        flow_run = await client.read_flow_run(flow_run_id)
        flow_id = flow_run.flow_id
        flow = await client.read_flow(flow_id)
        flow_name = flow.name
        if deployment_id := flow_run.deployment_id:
            deployment = await client.read_deployment(deployment_id)
            deployment_name = deployment.name
        else:
            deployment_name = "Testing"
        dir_name = f"{flow_name}/{deployment_name}"

    return project_name_to_project_dir(dir_name, allow_dir=True)

def project_name_to_project_dir(project_name: str, allow_dir: bool = False) -> str:
    # This is the opposite of cli.register_flows.format_project_name
    fn = sanitize_filepath if allow_dir else sanitize_filename
    return fn(project_name.lower().replace(" ", "-"))
c

Christopher Boyd

02/01/2023, 5:23 PM
What time zone are you in ? I’ve seen a couple of these (503's) on set_state in the last day
s

Stéphan Taljaard

02/01/2023, 5:27 PM
UTC+2
(SAST)
1
I think the set_states were around 4AM SAST if I remember correctly? I saw a few errors re. cancelling states, and thought it might be known already and to be fixed by https://github.com/PrefectHQ/prefect/pull/8315 I had problems with my concurrency slots on the queue - flows reported as cancelled, but ever so often the agent will attempt to cancel them again and they never free up slots, so my flow runs were starting to pile up. So that's why I fast-tracked my moving back from a local Windows server to my Linux VM in GCP. It's there where I got the timeouts, between 12:30 and 17:00-ish, after which I moved back to Windows. It's production workflows so I can't afford too much downtime.
I tried to reproduce this today. I'll continue tomorrow, and if I get decent findings, I'll open a GH issue
c

Christopher Boyd

02/02/2023, 2:05 PM
Let us know, I believe the issue was transient and being investigated server side for your particular issue
s

Stéphan Taljaard

02/02/2023, 2:27 PM
I'll rather check now instead of tomorrow. I managed to get some errors show up on a test agent around 2 Feb 12:00 UTC+2. @Christopher Boyd Do you believe the issue was resolved after that? Then I can try again to reproduce. Otherwise, I will check all my logs to try and determine if it was a recreation of what happened yesterday.
c

Christopher Boyd

02/02/2023, 2:28 PM
We haven’t had any additional reports of this error (yesterday, there were several), and we haven’t see any issues in traffic on these endpoints, so my suspect answer is resolved, but that’s really just based on what we’ve seen and investigated currently
s

Stéphan Taljaard

02/02/2023, 2:28 PM
Cool. For safety's sake, I'll first see if I can get errors to show up again, then take it from there.
Okay looks like I managed to reproduce it Time slot: 16:32-16:38+02:00 Example flow run
8a8e26b7-423e-46ff-8618-aae5f359f274
, task run
f3ee56c5-7e4e-4f41-8b07-73b919a0e0af
Here's my test flow I used to reproduce + its logs. I transferred over some functions from my real flows to kind of keep the steps the same. Still, the test flow is less complex workflow, uses less RAM, but higher CPU usage than my actual flows of yesterday. It might seem like a CPU problem, but I don't think so, since my real ETL flows of yesterday had enough headroom left.
Looking at the search results, it seems this issue has also been occurring more frequently for others over the last while (sorry for the ping @Christopher Boyd)