It looks like the docker image `prefecthq/prefect:...
# prefect-server
g
It looks like the docker image
prefecthq/prefect:all_extras
is running python 3.8 - this seems to require all clients to move to 3.8, as cloudpickle might be sensitive to the version difference. Am I right on that? Is the position of the prefect team that everything needs to be on py 3.8? I might not have rtfm enough...
I might be wrong, but I feel like when I started looking at prefect, 3.7 was required, so I intentionally setup everything for 3.7. Before I go through the trouble of moving everything to 3.8, I figure I should ask.
j
Yeah, Python bumped the default pickle protocol in 3.8, so flows packaged with 3.8 will need to be running 3.8. In general though, we recommend you always run flows in the same/similar environment to the one they were registered in.
g
just fwiw, that is potentially really tough if we want flows to be created by a diverse team. On some platforms, upgrading your python major version is not a small task. In addition, infrastructure is laid down with major versions, so bumping this piece may require a big change. Not sure it is the right move. Causes me a s*** load of work potentially.
j
Sorry, back-tracking a bit. Flows packaged with older versions can run on the new version, flows packaged with newer versions cannot run on the old version.
g
I understand the 3.8 has been out a while...
j
So your old stuff should still work, but flows packaged with 3.8 cannot run on 3.7
g
ok - I have a flow packaged on 3.7.8:
Copy code
(rightsize-venv) MacBook-Pro:scripts gdesmarais$ python --version
Python 3.7.8
that, after the hopps of getting it deployed on fargate, results in the following message on the AWS side:
Copy code
2020-07-21T09:56:16.831-04:00
[2020-07-21 13:56:16] INFO - prefect.S3 | Downloading datasciences/prefect_flows/dask_cloud_provider_test from celsius-temp-data

2020-07-21T09:56:17.050-04:00
an integer is required (got type bytes)

2020-07-21T09:56:17.052-04:00
Traceback (most recent call last):

2020-07-21T09:56:17.052-04:00
File "/usr/local/bin/prefect", line 8, in <module>

2020-07-21T09:56:17.052-04:00
sys.exit(cli())

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__

2020-07-21T09:56:17.052-04:00
return self.main(*args, **kwargs)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main

2020-07-21T09:56:17.052-04:00
rv = self.invoke(ctx)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke

2020-07-21T09:56:17.052-04:00
return _process_result(sub_ctx.command.invoke(sub_ctx))

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke

2020-07-21T09:56:17.052-04:00
return _process_result(sub_ctx.command.invoke(sub_ctx))

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke

2020-07-21T09:56:17.052-04:00
return ctx.invoke(self.callback, **ctx.params)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke

2020-07-21T09:56:17.052-04:00
return callback(*args, **kwargs)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/prefect/cli/execute.py", line 80, in cloud_flow

2020-07-21T09:56:17.052-04:00
raise exc

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/prefect/cli/execute.py", line 69, in cloud_flow

2020-07-21T09:56:17.052-04:00
flow = storage.get_flow(storage.flows[flow_data.name])

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/prefect/environments/storage/s3.py", line 90, in get_flow

2020-07-21T09:56:17.052-04:00
return cloudpickle.loads(output)

2020-07-21T09:56:17.052-04:00
TypeError: an integer is required (got type bytes)
From what I gather, that error message is indicative of a py version mismatch and cloudpickle.
j
Hmmm, that's odd. Pickle is a backwards-compatible format, this might be one of cloudpickle's serializers that's failing, not the pickle protocol itself (which you'd see going from 3.8 -> 3.7).
g
My 20 minute level informed opinion is that I agree.
it seems to be cloudpickle.
j
Hmmm. You might need to have users using 3.7 register flow runs to run images with 3.7. I understand that from an organizational level that could be tricky, but prefect flows really aren't designed to be registered and run in two disparate environments.
g
I am already creating some custom images, so I can certainly do that. I had created one already, but it was hanging after the flow executed. Figured I should try using the hq one.
j
We have added support for script-based storage (some support already out, some coming in this release (probably today/tomorrow)). This lets your store the script instead of the pickle file, which will make this a bit easier, but I still recommend standardizing environments for other benefits.
g
So let me ask - this is the task that is pulling the flow from my storage, deserializing it, then executing it. What does that image need?
I have an image already built that pulls in prefect (among a lot of other stuff) and has this entry:
Copy code
ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]
I'm not sure that is the right entry, based on the hq image, or something else. I totally agree with standardizing the images for the flow run processes.
j
The image needs: • All the libraries your flow relies on (including
prefect
, ideally the same version as used locally when developing/registering. Things may work if they aren't the same, but no guarantees. • Any helper code not contained directly in the same file that created the flow. • Ideally the same major version of python as registered with. If using pickle-based storage this is more a requirement than a suggestion.
g
Ok - got that.
(I'm using a bunch of dask and cuda)
j
I'm not sure what
prepare.sh
does, but using
tini
is a good option. Sorry, not sure I understand the question here.
g
np - not sure I asked it clearly.
My question might then be different. I had assumed my custom image for the task that deserializes the flow and runs it was the problem - that I needed to use the hq one for that step, then my custom one for the actual task workers (dask).
I can go back and re-frame on what happens after the flow run is executed - I might have an issue with my dask cluster creation/configuration.
j
So the error you got above was failure to load the flow, before dask enters the picture at all. So that's an incompatibility between python/cloudpickle versions.
g
yea - agreed.
j
But more generally, we also recommend dependencies match throughout your dask cluster.
g
When I use my custom image, instead of the hq one, the flow gets done with execute, but then hangs. I've got patches for prefect code that inserts a bunch of printlns so I can increase that and see if I can isolate further.
My first instinct was that my custom image didn't have something right in the dockerfile.
j
The flow executes, but the image doesn't return?
So prefect thinks the flow is done, but the image is still running?
g
Actually, it returns, and status of 0
And the UI doesn't show it as complete.
which makes me think it is some wrapper that is supposed to communicate back to the api to say 'done'
j
All of that is contained within the flow runner (the thing executing the flow), there's no other bit that happens afterward.
g
hrm...ok, let me dig/debug/print a whole crap ton of stuff.
I've been battling this for a while, and I feel like I'm really close.
j
Does prefect see that the flow started?
g
yes
j
That is odd. Hmmm.
If you can create a reproducible example, we'd be happy to take a look.
g
I have a fargate agent started from api with a bunch of config, it picks up the run, kicks off the first task...
that first task pulls the flow from storage, deserializes, executes, then nothing...except that the container exits with a 0
the UI doesn't show the run completed
(I kick off the flow from the ui)
I'm "pretty sure" my networking between the containers and the api is correct, but maybe I have to look at that. it is all in the same vpc with wide open sec groups (net acl protecting the vpc)
j
What
Environment
class are you using?
g
Copy code
executor = DaskExecutor(cluster_class='dask_cloudprovider.FargateCluster',
                        cluster_kwargs=cluster_kwargs)

flow.environment = FargateTaskEnvironment(
    executor=executor,
    region_name=DEFAULT_REGION,
    metadata=metadata,
    **task_definition_kwargs
)
I have been through so many iterations of this that there is almost certainly some cruft.
I'm going to disappear for a while - I will put my debugs in and see where that leads me. If you have any suggestions, my ears are wide open.
j
FargateTaskEnvironment
kicks off an intermediate fargate task that launches another fargate task where the flow actually runs. Are you seeing both tasks start successfully? Just the first one?
g
I'm seeing the first one start, and I'm seeing the run_task for the second one return successfully, but I'm not seeing a command or anything. I am going to switch over to a dask cloud provider environment and see where that takes me.
j
Hmmm, that is odd behavior. If you can provide a reproducible example that'd be useful for us to help debug.
g
I've def. been trying to get a repro for this, but it invariably depends on my environment - a fargate cluster, server in EC2, etc. I've got things running in my pycharm debugger now - isolated the bit of code that is suspect. I'll open a new convo if I am able to pose a concise, answerable question. Right now it would be a pain in the ass to support me.
j
Ok, let us know when if/when you figure something out. Thanks for looking into this!
g
I have a feeling I just don't have the magic recipie for my deployment model.