Hi Before creating an issue on Github I thought to first che Prefect Community #ask-community

Hi. Before creating an issue on Github, I thought ...

Stéphan Taljaard

02/01/2023, 2:27 PM

Hi. Before creating an issue on Github, I thought to first check in here. I used to run my agent on a local Windows PC and now moved over to a Debian VM in GCP. My flows are the same, though I was using Python 3.10 on Windows, and now 3.11 on Debian. I just switched my deployments' queue to my new agent. Those are the only differences. I'm running using a process infra block. Now suddenly so many of my flows are experiencing asyncio.exceptions.CancelledError/TimeoutError/httpx.ConnectTimeout Any one experienced that?

Stéphan Taljaard

02/01/2023, 2:27 PM

It's happening in tasks where I use OrionClient, e.g.

Copy code

flow_run = await client.read_flow_run(flow_run_id)

It's sporadic; it does not happen to many but not all of my flow runs.

Christopher Boyd

02/01/2023, 4:02 PM

Can you elaborate? This is a very generic error, and without any logs, configuration, details, there’s no way of telling

Christopher Boyd

02/01/2023, 4:04 PM

Perhaps some tracebacks for a specific failing flow, and how it’s configured, how / where it’s registered, what infra its executing on, where the agent is running, and perhaps an agent log would be helpful in determining what’s occurring . Specifically however, your entire infrastructure changed from local windows to remote debian in GCP and you’re receiving timeouts

Stéphan Taljaard

02/01/2023, 4:07 PM

👍 We can move over to Github if you prefer?

Stéphan Taljaard

02/01/2023, 4:39 PM

Running Prefect 2.7.10, Python 3.11.1, on Debian, in GCP User:

2a0672be-6e64-4081-bdfd-a4dfded5a802

Workspace:

b70a86f1-659f-408f-add7-9665c6bfa327

Example flow run:

d0a4a7f6-35e9-405f-a9a7-66d2751313a1

Deployment saved in a GCP Bucket; it's successfully pulled when running the deployment's flow Infra: stock-standard Process block Default task executor; no `.submit`ed tasks, so all are run immediately/sequentially Some tasks are async, but used non-awaited inside the non-async flow func. The flow runs sucessfully for 2 minutes, one of the last tasks is to read the deployment name using async OrionClient, which is where task run

a4194641-d636-4f79-8d27-ebb6d1bf5e70

crashed, with this trace (failure at 040525 PM UTC+2)

Stéphan Taljaard

02/01/2023, 4:44 PM

Here's another similar one, but it failed instead of crashed, with a different error after I set

PREFECT_API_REQUEST_TIMEOUT

to 600 https://pastebin.com/DgTStCFP

Stéphan Taljaard

02/01/2023, 4:48 PM

Example failing task:

Copy code

@task
async def get_project_dir_name() -> str:
    task_run_context = get_run_context()
    flow_run_id = task_run_context.task_run.flow_run_id

    async with prefect.client.orion.get_client() as client:
        flow_run = await client.read_flow_run(flow_run_id)
        flow_id = flow_run.flow_id
        flow = await client.read_flow(flow_id)
        flow_name = flow.name
        if deployment_id := flow_run.deployment_id:
            deployment = await client.read_deployment(deployment_id)
            deployment_name = deployment.name
        else:
            deployment_name = "Testing"
        dir_name = f"{flow_name}/{deployment_name}"

    return project_name_to_project_dir(dir_name, allow_dir=True)

def project_name_to_project_dir(project_name: str, allow_dir: bool = False) -> str:
    # This is the opposite of cli.register_flows.format_project_name
    fn = sanitize_filepath if allow_dir else sanitize_filename
    return fn(project_name.lower().replace(" ", "-"))

Christopher Boyd

02/01/2023, 5:23 PM

What time zone are you in ? I’ve seen a couple of these (503's) on set_state in the last day

Stéphan Taljaard

02/01/2023, 5:27 PM

UTC+2

Stéphan Taljaard

02/01/2023, 5:28 PM

(SAST)

✅ 1

Stéphan Taljaard

02/01/2023, 5:41 PM

I think the set_states were around 4AM SAST if I remember correctly? I saw a few errors re. cancelling states, and thought it might be known already and to be fixed by https://github.com/PrefectHQ/prefect/pull/8315 I had problems with my concurrency slots on the queue - flows reported as cancelled, but ever so often the agent will attempt to cancel them again and they never free up slots, so my flow runs were starting to pile up. So that's why I fast-tracked my moving back from a local Windows server to my Linux VM in GCP. It's there where I got the timeouts, between 12:30 and 17:00-ish, after which I moved back to Windows. It's production workflows so I can't afford too much downtime.

Stéphan Taljaard

02/02/2023, 1:30 PM

I tried to reproduce this today. I'll continue tomorrow, and if I get decent findings, I'll open a GH issue

Christopher Boyd

02/02/2023, 2:05 PM

Let us know, I believe the issue was transient and being investigated server side for your particular issue

Stéphan Taljaard

02/02/2023, 2:27 PM

I'll rather check now instead of tomorrow. I managed to get some errors show up on a test agent around 2 Feb 12:00 UTC+2. @Christopher Boyd Do you believe the issue was resolved after that? Then I can try again to reproduce. Otherwise, I will check all my logs to try and determine if it was a recreation of what happened yesterday.

Christopher Boyd

02/02/2023, 2:28 PM

We haven’t had any additional reports of this error (yesterday, there were several), and we haven’t see any issues in traffic on these endpoints, so my suspect answer is resolved, but that’s really just based on what we’ve seen and investigated currently

Stéphan Taljaard

02/02/2023, 2:28 PM

Cool. For safety's sake, I'll first see if I can get errors to show up again, then take it from there.

Stéphan Taljaard

02/02/2023, 2:59 PM

Okay looks like I managed to reproduce it Time slot: 1632 1638+02:00 Example flow run

8a8e26b7-423e-46ff-8618-aae5f359f274

, task run

f3ee56c5-7e4e-4f41-8b07-73b919a0e0af

Here's my test flow I used to reproduce + its logs. I transferred over some functions from my real flows to kind of keep the steps the same. Still, the test flow is less complex workflow, uses less RAM, but higher CPU usage than my actual flows of yesterday. It might seem like a CPU problem, but I don't think so, since my real ETL flows of yesterday had enough headroom left.

Stéphan Taljaard

02/07/2023, 7:32 AM

Looking at the search results, it seems this issue has also been occurring more frequently for others over the last while (sorry for the ping @Christopher Boyd)

4 Views

Open in Slack

Previous Next