l

    Lukas N.

    1 year ago
    [✔️ solved] Mapped task issue with apollo (more in thread)
    We're running Prefect server version
    0.13.19
    in our own k8s cluster. Everything works fine, but from time to time some mapped tasks fail with what seems like an internal error in communication with Apollo. As you can see from the screenshot from the flow run, 5 instances of the mapped task are missing. In the logs I found the error posted below. I don't understand how 19 tasks work just fine but 5 of them fail with client error. Unfortunately I don't have a reproducible example yet, as this error occurs rarely. Also the GraphQL query looks ok to me. Does someone have any ideas to nudge me before I dive deeper into the problem 🙏 ?
    Failed to retrieve task state with error: ClientError('400 Client Error: Bad Request for url: <http://prefect-apollo:4200/graphql>\n\nThe following error messages were provided by the GraphQL server:\n\n    GRAPHQL_VALIDATION_FAILED: Cannot query field "get_or_create_task_run_info" on\n        type "Mutation". Did you mean "get_or_create_task_run" or\n        "get_or_create_mapped_task_run_children"?\n\nThe GraphQL query was:\n\n    mutation {\n            get_or_create_task_run_info(input: { flow_run_id: "6817c28a-1bba-4caa-9728-e02606a74321", task_id: "9de63fe3-c41e-44f0-8139-ed307f3edb5c", map_index: 8 }) {\n                id\n                version\n                serialized_state\n        }\n    }\n\nThe passed variables were:\n\n    null\n',)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 360, in _send_request
        response.raise_for_status()
      File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: <http://prefect-apollo:4200/graphql>
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/site-packages/prefect/engine/cloud/task_runner.py", line 193, in initialize_run
        map_index=map_index,
      File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 1380, in get_task_run_info
        result = self.graphql(mutation)  # type: Any
      File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 303, in graphql
        retry_on_api_error=retry_on_api_error,
      File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 219, in post
        retry_on_api_error=retry_on_api_error,
      File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 445, in _request
        session=session, method=method, url=url, params=params, headers=headers
      File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 373, in _send_request
        raise ClientError(f"{exc}\n{graphql_msg}") from exc
    prefect.utilities.exceptions.ClientError: 400 Client Error: Bad Request for url: <http://prefect-apollo:4200/graphql>
    
    The following error messages were provided by the GraphQL server:
    
        GRAPHQL_VALIDATION_FAILED: Cannot query field "get_or_create_task_run_info" on
            type "Mutation". Did you mean "get_or_create_task_run" or
            "get_or_create_mapped_task_run_children"?
    
    The GraphQL query was:
    
        mutation {
                get_or_create_task_run_info(input: { flow_run_id: "6817c28a-1bba-4caa-9728-e02606a74321", task_id: "9de63fe3-c41e-44f0-8139-ed307f3edb5c", map_index: 8 }) {
                    id
                    version
                    serialized_state
            }
        }
    
    The passed variables were:
    
        null
    s

    Spencer

    1 year ago
    I'd say verify the server version and the runtime version?
    get_or_create_task_run_info
    was added very recently.
    Michael Adkins

    Michael Adkins

    1 year ago
    Yeah this is a version mismatch error
    l

    Lukas N.

    1 year ago
    Yeah, indeed it was a version mismatch error. We're running multiple replicas of Apollo. The requests goes through a service, which explains why only a bunch of requests was failing, 1 out of 3 Apollo pods was responding with 400. I guess we have a bug in the rolling kubernetes update. Looking at Apollo code a bit I think this might have happened: • update pod for apollo -> it asks for graphql schema, but that request is handled by an old instance of graphql
    Michael Adkins

    Michael Adkins

    1 year ago
    I’d like to find a way to avoid this in the deployment but I’m not sure what the best way is yet!