https://prefect.io logo
l

Lukas N.

12/14/2020, 2:52 PM
[✔️ solved] Mapped task issue with apollo (more in thread)
We're running Prefect server version
0.13.19
in our own k8s cluster. Everything works fine, but from time to time some mapped tasks fail with what seems like an internal error in communication with Apollo. As you can see from the screenshot from the flow run, 5 instances of the mapped task are missing. In the logs I found the error posted below. I don't understand how 19 tasks work just fine but 5 of them fail with client error. Unfortunately I don't have a reproducible example yet, as this error occurs rarely. Also the GraphQL query looks ok to me. Does someone have any ideas to nudge me before I dive deeper into the problem 🙏 ?
Copy code
Failed to retrieve task state with error: ClientError('400 Client Error: Bad Request for url: <http://prefect-apollo:4200/graphql>\n\nThe following error messages were provided by the GraphQL server:\n\n    GRAPHQL_VALIDATION_FAILED: Cannot query field "get_or_create_task_run_info" on\n        type "Mutation". Did you mean "get_or_create_task_run" or\n        "get_or_create_mapped_task_run_children"?\n\nThe GraphQL query was:\n\n    mutation {\n            get_or_create_task_run_info(input: { flow_run_id: "6817c28a-1bba-4caa-9728-e02606a74321", task_id: "9de63fe3-c41e-44f0-8139-ed307f3edb5c", map_index: 8 }) {\n                id\n                version\n                serialized_state\n        }\n    }\n\nThe passed variables were:\n\n    null\n',)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 360, in _send_request
    response.raise_for_status()
  File "/usr/local/lib/python3.6/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: <http://prefect-apollo:4200/graphql>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/prefect/engine/cloud/task_runner.py", line 193, in initialize_run
    map_index=map_index,
  File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 1380, in get_task_run_info
    result = self.graphql(mutation)  # type: Any
  File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 303, in graphql
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 219, in post
    retry_on_api_error=retry_on_api_error,
  File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 445, in _request
    session=session, method=method, url=url, params=params, headers=headers
  File "/usr/local/lib/python3.6/site-packages/prefect/client/client.py", line 373, in _send_request
    raise ClientError(f"{exc}\n{graphql_msg}") from exc
prefect.utilities.exceptions.ClientError: 400 Client Error: Bad Request for url: <http://prefect-apollo:4200/graphql>

The following error messages were provided by the GraphQL server:

    GRAPHQL_VALIDATION_FAILED: Cannot query field "get_or_create_task_run_info" on
        type "Mutation". Did you mean "get_or_create_task_run" or
        "get_or_create_mapped_task_run_children"?

The GraphQL query was:

    mutation {
            get_or_create_task_run_info(input: { flow_run_id: "6817c28a-1bba-4caa-9728-e02606a74321", task_id: "9de63fe3-c41e-44f0-8139-ed307f3edb5c", map_index: 8 }) {
                id
                version
                serialized_state
        }
    }

The passed variables were:

    null
s

Spencer

12/14/2020, 3:32 PM
I'd say verify the server version and the runtime version?
get_or_create_task_run_info
was added very recently.
z

Zanie

12/14/2020, 5:44 PM
Yeah this is a version mismatch error
l

Lukas N.

12/14/2020, 6:33 PM
Yeah, indeed it was a version mismatch error. We're running multiple replicas of Apollo. The requests goes through a service, which explains why only a bunch of requests was failing, 1 out of 3 Apollo pods was responding with 400. I guess we have a bug in the rolling kubernetes update. Looking at Apollo code a bit I think this might have happened: • update pod for apollo -> it asks for graphql schema, but that request is handled by an old instance of graphql
z

Zanie

12/14/2020, 8:17 PM
I’d like to find a way to avoid this in the deployment but I’m not sure what the best way is yet!
2 Views