hey all, on prefect 1.0 my team uses CICD to regis...
# prefect-community
t
hey all, on prefect 1.0 my team uses CICD to register flows to different projects on a pr merge. we have
PREFECT__CLOUD__REQUEST_TIMEOUT=60
in our CICD env, but I see timeouts happening in 14s sometimes. anyone have any tips on what could be going on?
here's a stacktrace, definitely coming from the graphql
Copy code
Traceback (most recent call last):
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 670, in <module>
    main()
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 664, in main
    create_proj_and_register_flows(flows, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 300, in create_proj_and_register_flows
    register_flow(flow, flow_file, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 409, in register_flow
    flow.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/core/flow.py", line 1727, in register
    registered_flow = client.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 1176, in register
    res = self.graphql(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
b
Hey Tomas, it may be a possibility that you have a large number of registered flows in your tenant, which are causing the timeouts. Are you by chance passing in a flow group ID when registering the flow? After taking a look at the v1 docs, I saw that when registering a flow, "if no version group id is provided at registration, the platform checks if any other flows in the same project have the same name as the new flow." That process could be contributing to the timeout if no version group ID is used.
If you aren't already, maybe try passing in that ID at registration to see if it helps.
t
ooo this is great thank you @Bianca Hoch! the first time we register a flow it won't have any version group ID right? so I'd still pass in null for those?
b
I believe that is correct! ^
t
implemented the version group Ids yesterday but still timing out =/
Copy code
[2021-01-14 22:00:00.000] ERROR    --- [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
Traceback (most recent call last):
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 681, in <module>
    main()
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 675, in main
    create_proj_and_register_flows(flows, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 302, in create_proj_and_register_flows
    register_flow(flow, flow_file, args)
  File "/home/runner/work/dwh/dwh/deploy/register_flows.py", line 419, in register_flow
    flow.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/core/flow.py", line 1727, in register
    registered_flow = client.register(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 1176, in register
    res = self.graphql(
  File "/home/runner/work/dwh/dwh/.venv/lib/python3.10/site-packages/prefect/client/client.py", line 570, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
b
Hmm..I'll raise this to the team to see what next steps would look like for remedying this.
t
sounds good! thank you for the support. I'm happy to send along any code or anything that might help. upgrading to v2 is our project for H2 this year
b
Hey Tomas, our engineering team has been putting in a few fixes to help out with this problem. Can you try the registration process again and let us know if the Timeouts persist?
t
yeah for sure! give me a minute to kick some cicd
still getting some timeouts on our registration scripts. about half of them are completing successfully
👀 1
s
@Bianca Hoch Just to add another voice, this has been happening fairly regularly for my team, as well. The stack trace is a bit different, though:
Copy code
Traceback (most recent call last):
      File "/root/.pyenv/versions/3.8.10/lib/python3.8/site-packages/prefect/cli/build_register.py", line 475, in build_and_register
    flow_id, flow_version, is_new = register_serialized_flow(
      File "/root/.pyenv/versions/3.8.10/lib/python3.8/site-packages/prefect/cli/build_register.py", line 399, in register_serialized_flow
    res = client.graphql(
      File "/root/.pyenv/versions/3.8.10/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
    prefect.exceptions.ClientError: [{'path': ['create_flow_from_compressed_string'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
👀 1
b
Hey all, thanks for raising! Will share this information with the team.
👍 1
b
Hi @Bianca Hoch, we have experienced a similar issue, as we have seen a big amount of flow hanging in RUNNING state in the UI but not actually running in the cluster. Here is the log of the agent:
Copy code
[2023-05-19 11:56:17,686] ERROR - agent | Failed to query for ready flow runs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/agent.py", line 320, in _submit_deploy_flow_run_jobs
    flow_run_ids = self._get_ready_flow_runs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/agent.py", line 571, in _get_ready_flow_runs
    result = self.client.graphql(
  File "/usr/local/lib/python3.8/site-packages/prefect/client/client.py", line 473, in graphql
    raise ClientError(result["errors"])
prefect.exceptions.ClientError: [{'path': ['get_runs_in_queue'], 'message': 'Operation timed out', 'extensions': {'code': 'API_ERROR'}}]
agent run on prefect 1.2.0-python3.8 I could see from the log that we have at least had this problem since 2023-04-19. Would be so nice to resolve this. Please let me know if I need to provide more info. Thanks.
An update, the error still occurs after updating the agent to use prefect1.4.1-python3.8 Is there any news from the team? 🙂 @Bianca Hoch
b
Hi @Beizhen! After sharing the error you sent with the team, that specific timeout is at the agent level and shouldn't affect the flow run states or the health of the agent. Is this error intermittent? How often to you see it pop up?
Also my apologies for not circling back to this thread sooner everyone
b
Hi @Bianca Hoch Thanks for coming back. It does not affect the flow run states but I have a suspicion that it might be the reason why flows get stuck at Running state in UI when the pods were already terminated in the cluster? This error pops up couple of times every day both before and after update to 1.4.1-python 3.8