https://prefect.io logo
Title
t

Tomas Moreno

05/12/2023, 1:31 PM
hey all, on prefect 1.0 my team uses CICD to register flows to different projects on a pr merge. we have
PREFECT__CLOUD__REQUEST_TIMEOUT=60
in our CICD env, but I see timeouts happening in 14s sometimes. anyone have any tips on what could be going on?
here's a stacktrace, definitely coming from the graphql
Traceback (most recent call last):
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 670, in <module>
    main()
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 664, in main
    create_proj_and_register_flows(flows, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 300, in create_proj_and_register_flows
    register_flow(flow, flow_file, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 409, in register_flow
    flow.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/core/flow.py", line 1727, in register
    registered_flow = client.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 1176, in register
    res = self.graphql(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
b

Bianca Hoch

05/15/2023, 6:43 PM
Hey Tomas, it may be a possibility that you have a large number of registered flows in your tenant, which are causing the timeouts. Are you by chance passing in a flow group ID when registering the flow? After taking a look at the v1 docs, I saw that when registering a flow, "if no version group id is provided at registration, the platform checks if any other flows in the same project have the same name as the new flow." That process could be contributing to the timeout if no version group ID is used.
If you aren't already, maybe try passing in that ID at registration to see if it helps.
t

Tomas Moreno

05/15/2023, 7:59 PM
ooo this is great thank you @Bianca Hoch! the first time we register a flow it won't have any version group ID right? so I'd still pass in null for those?
b

Bianca Hoch

05/15/2023, 8:02 PM
I believe that is correct! ^
t

Tomas Moreno

05/16/2023, 2:06 PM
implemented the version group Ids yesterday but still timing out =/
[2021-01-14 22:00:00.000] ERROR    --- [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
Traceback (most recent call last):
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 681, in <module>
    main()
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 675, in main
    create_proj_and_register_flows(flows, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 302, in create_proj_and_register_flows
    register_flow(flow, flow_file, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 419, in register_flow
    flow.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/core/flow.py", line 1727, in register
    registered_flow = client.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 1176, in register
    res = self.graphql(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
b

Bianca Hoch

05/16/2023, 2:14 PM
Hmm..I'll raise this to the team to see what next steps would look like for remedying this.
t

Tomas Moreno

05/16/2023, 2:23 PM
sounds good! thank you for the support. I'm happy to send along any code or anything that might help. upgrading to v2 is our project for H2 this year
b

Bianca Hoch

05/16/2023, 7:02 PM
Hey Tomas, our engineering team has been putting in a few fixes to help out with this problem. Can you try the registration process again and let us know if the Timeouts persist?
t

Tomas Moreno

05/16/2023, 7:09 PM
yeah for sure! give me a minute to kick some cicd
still getting some timeouts on our registration scripts. about half of them are completing successfully
👀 1
s

Scott Aefsky

05/17/2023, 7:53 PM
@Bianca Hoch Just to add another voice, this has been happening fairly regularly for my team, as well. The stack trace is a bit different, though:
Traceback (most recent call last):
      File "/root/.pyenv/versions/3.8.10/lib/python3.8/site-packages/prefect/cli/build_register.py", line 475, in build_and_register
    flow_id, flow_version, is_new = register_serialized_flow(
      File "/root/.pyenv/versions/3.8.10/lib/python3.8/site-packages/prefect/cli/build_register.py", line 399, in register_serialized_flow
    res = client.graphql(
      File "/root/.pyenv/versions/3.8.10/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
    prefect.exceptions.ClientError: [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
👀 1
b

Bianca Hoch

05/17/2023, 8:21 PM
Hey all, thanks for raising! Will share this information with the team.
👍 1
b

Beizhen

05/19/2023, 12:44 PM
Hi @Bianca Hoch, we have experienced a similar issue, as we have seen a big amount of flow hanging in RUNNING state in the UI but not actually running in the cluster. Here is the log of the agent:
[2023-05-19 11:56:17,686] ERROR - agent | Failed to query for ready flow runs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/agent.py", line 320, in _submit_deploy_flow_run_jobs
    flow_run_ids = self._get_ready_flow_runs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/agent.py", line 571, in _get_ready_flow_runs
    result = self.client.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['get_runs_in_queue'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
agent run on prefect 1.2.0-python3.8 I could see from the log that we have at least had this problem since 2023-04-19. Would be so nice to resolve this. Please let me know if I need to provide more info. Thanks.
An update, the error still occurs after updating the agent to use prefect1.4.1-python3.8 Is there any news from the team? 🙂 @Bianca Hoch
b

Bianca Hoch

05/24/2023, 9:41 PM
Hi @Beizhen! After sharing the error you sent with the team, that specific timeout is at the agent level and shouldn't affect the flow run states or the health of the agent. Is this error intermittent? How often to you see it pop up?
Also my apologies for not circling back to this thread sooner everyone
b

Beizhen

05/25/2023, 10:01 AM
Hi @Bianca Hoch Thanks for coming back. It does not affect the flow run states but I have a suspicion that it might be the reason why flows get stuck at Running state in UI when the pods were already terminated in the cluster? This error pops up couple of times every day both before and after update to 1.4.1-python 3.8