b

    Bo

    4 months ago
    Hi, I am storing code in GitHub, and I have some scheduled flows erroring out semi-regularly due to a github api timeout. Is there a way to increase the timeout limit and/or increase or add an exponential backoff. It also doesn't seem like the flow tries to rerun. Thanks!
    ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='<http://api.github.com|api.github.com>', port=443): Max retries exceeded with url: *redacted* (Caused by ConnectTimeoutError(, 'Connection to <http://api.github.com|api.github.com> timed out. (connect timeout=15)'))
    Kevin Kho

    Kevin Kho

    4 months ago
    Is that from the Prefect client? You can set the environment variable:
    flow.run_config = KubernetesRun(..., env={"PREFECT__CLOUD__REQUEST_TIMEOUT": 60})
    you can edit the config.toml here
    b

    Bo

    4 months ago
    This is the agent trying to retrieve the flow code from GitHub. Isn't what you sent is for the agent to send api calls to prefect cloud/server?
    Kevin Kho

    Kevin Kho

    4 months ago
    Can you give me a longer traceback? I was thinking of that but I was thinking it might be the Prefect Client making this call but I’m not sure. This timeout is on the Prefect Client
    b

    Bo

    4 months ago
    11:00:38
    INFO
    agent
    Submitted for execution: Task arn:aws:ecs:*****
    
    
    11:01:24
    INFO
    GitHub
    Downloading flow from GitHub storage - repo: '****', path: '****.py'
    
    
    11:01:40
    ERROR
    execute flow-run
    Failed to load and execute flow run: ConnectTimeout(MaxRetryError("HTTPSConnectionPool(host='<http://api.github.com|api.github.com>', port=443): Max retries exceeded with url: **** (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7ff417584a10>, 'Connection to <http://api.github.com|api.github.com> timed out. (connect timeout=15)'))"))
    Kevin Kho

    Kevin Kho

    4 months ago
    Ah that doesn’t help. One second let me read the source
    Yeah you might be right that this is not the Prefect client. We use the github library under the hood so seeing if there is a way to increase that
    b

    Bo

    4 months ago
    @Kevin Kho Do you know if there's any way I would be able to retry on an error like this? I know how to retry tasks, but since this is storage, I don't see how I can retry the flow
    Kevin Kho

    Kevin Kho

    4 months ago
    Hey Bo, I will be slow to respond due to PyCon today and tomorrow. I will leave a message with Anna and the team about this.
    So I didn’t find this last time I looked but the Github class under the hood exposes a timeout. Maybe we can expose it to increase it here. Also left messages with the team about that
    b

    Bo

    4 months ago
    Ahh good catch, that would be excellent! Also, there is a retry parameter too. That would would be just as, if not more, important to expose
    Kevin Kho

    Kevin Kho

    4 months ago
    You may even be able to edit your own version of Prefect for now to increase though. I don’t have a timeline, or even if this will be an accepted change yet. Would you be interested in making a PR? It seems pretty doable
    Anna Geller

    Anna Geller

    4 months ago
    Catching up here. I wonder whether we could approach it a bit differently. It seems that the problem is that sometimes due to transient error, your flow run fails to start because Prefect flow run cannot pull the flow code from your Github storage within your ECS task, correct? Since you are on Prefect Cloud, you could leverage Automations to catch such issues and react to them. For instance, in the image below, you can see how you can start a new run (effectively a "flow-level restart") if your flow run fails to start after e.g. 120 seconds - if so, you can start a new run of the same flow.
    b

    Bo

    4 months ago
    Thanks! Going to try this out and test over the weekend. If that fails, I will issue a PR with the proposed changes
    @Anna Geller Yes, that is the correct diagnosis of the problem. And by the way, I don't seem to have "does not start" as an option under automation
    Ohh nevermind I see, I have to be on the standard or enterprise plan for that