Hello! Running into an issue I don't fully unders...
# prefect-community
k
Hello! Running into an issue I don't fully understand. I have a series of flows that have been running for a couple of months without incident. We're running all of our production on 1.2.0-python3.8 for the ECS Agent and for all ECSRun instances. Our flows use S3 storage. Nothing has been purposely or visibly changed with our Prefect installation- no new flow deployments, or changes to our AWS infrastructure. Our flows, which do little more than kick off AWS Batch jobs via the BatchSubmit task are now consistently failing with the following error on the BatchSubmit invocation:
Copy code
Unexpected error: AttributeError("'S3Result' object has no attribute 'upload_options'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 930, in get_task_run_state
    result = self.result.write(value, **formatting_kwargs)
  File "/usr/local/lib/python3.8/site-packages/prefect/engine/results/s3_result.py", line 89, in write
    ExtraArgs=self.upload_options,
AttributeError: 'S3Result' object has no attribute 'upload_options'
Attached is one of the flows that are failing. Does the task running code record execution state back to the storage bucket? Thanks in advance!
k
This looks like a version mismatch issue. This was added in 1.2.3. You are pulling a later image somewhere. Are you using DaskExecutor?
I guess check agent version, flow registration version, and prefect version in execution container
k
Ok. We are not using dask. Our agent is 1.2.0, and the ECSRun should be pinned at 1.2.0 based on my code. Everything was registered some time ago under 1.2.0.
The run config in my flow run seems correct?
Copy code
{
  "cpu": null,
  "env": null,
  "type": "ECSRun",
  "image": "prefecthq/prefect:latest-python3.8",
  "labels": [],
  "memory": null,
  "__version__": "1.2.0",
  "task_role_arn": null,
  "run_task_kwargs": null,
  "task_definition": null,
  "execution_role_arn": null,
  "task_definition_arn": null,
  "task_definition_path": null
}
oh wait
is that image the problem?
k
Yeah latest might pull 1.2.3
k
so the ECSRun declaration in my code is being ignored?
Copy code
ECSRUN_IMAGE="prefecthq/prefect:1.2.0-python3.8"

if __name__ == "__main__":
    flow.run_config = ECSRun(image=ECSRUN_IMAGE)
    flow.register(project_name=PROJECT_NAME, labels=[ENVIRONMENT])
ran on register
and the agent seems correct.
How can I force the 1.2.0?
k
I doubt that would be ignored, but can’t immediately see what would override it to make it that latest
Can I see the agent definition?
k
Copy code
[
  {
            "name": "${var.environment}-prefect-ecs-agent",
            "image": "${var.docker_image}",
            "essential": true,
            "command": [
                "prefect",
                "agent",
                "ecs",
                "start",
                "--cluster",
                "${var.ecs_cluster_arn}",
                "--run-task-kwargs",
                "<s3://sgmt>-${var.environment}-prefect-flows/run_config_private.yaml",
                "--task-role-arn",
                "${aws_iam_role.prefect-task-role.arn}"
            ],
            "environment": [
                {
                    "name": "PREFECT__CLOUD__AGENT__LABELS",
                    "value": "[${join(",", [for l in var.agent_labels : format("'%s'",l)])}]"
                },
                {
                    "name": "PREFECT__CLOUD__AGENT__LEVEL",
                    "value": "DEBUG"
                },
                {
                    "name": "PREFECT__CLOUD__API",
                    "value": "<https://api.prefect.io>"
                },
                {
                    "name": "OTHER_PREFECT__CONTEXT__SECRETS__AWS_CREDENTIALS",
                    "value": "{\"ACCESS_KEY\":\"${var.access_key}\",\"SECRET_ACCESS_KEY\":\"${var.secret_access_key}\"}"                  
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "${aws_cloudwatch_log_group.log-group.name}",
                    "awslogs-region": "${var.region}",
                    "awslogs-stream-prefix": "ecs",
                    "awslogs-create-group": "true"
                }
            },
            "secrets": [
                {
                    "name": "PREFECT__CLOUD__API_KEY",
                    "valueFrom": "${var.api_key}"
                }
            ]
        }
Copy code
prefect_docker_image="prefecthq/prefect:1.2.0-python3.8"
that's from our terraform
k
How did you register? Just
python myfile.py
?
Can you try pulling the flow and its runconfig from the GraphQL API to see if it’s really not respecting the image when you register?
k
yep, that's how we register. We're going to redeploy everything and make sure we're where we need to be. How can a revision release introduce such a breaking change?
k
I’d need to ask around. I took a look at the PR and am confused why it breaks since a default is set
I don’t think this is necessarily a breaking change though. The storage here is pickle-based so it’s serialized in 1.2.0 and de-serialized in 1.2.3. The version mismatch of serializing and deserializing will frequently cause issues.
k
I have historically had the understanding that pickling is tied to the Python major version. Once we redeploy we'll make sure all that lines up. Just a little frustrating to track down when things work for several weeks and then kaboom!
thanks for the assist
k
Thanks for reporting! We do have a special serializer specifically for the S3Result. This would not have happened for other Result classes