Hi I have a flow that retrieves a list of keys from S3 Each Prefect Community #ask-community

Hi - I have a flow that retrieves a list of keys f...

Kevin

10/25/2021, 4:20 PM

Hi - I have a flow that retrieves a list of keys from S3. Each key represents a .zip file which is an archive that when unzipped contains 5 different csv files. I am trying to map over those keys and write each file within the archive to Azure Storage. I am having issues understanding how to handle the zipped object - which at times should be unmapped. The working code I have right now only writes out the data associated with the last file in the list

Kevin

10/25/2021, 4:20 PM

Copy code

with Flow("s3-ingest-azure-load") as flow:
    s3_keys = ven_next_keys(prefix='prefix') # [list, of, keys]
    s3_obj = ven_next_dl.map(key=s3_keys, as_bytes=unmapped(True)) # [list, of, objs]
    zipped_object = convert_to_zip.map(s3_obj) #[objs, as, zip]
    files = create_list_of_files.map(zipped_object) # [[list],[of],[zipInfoObjects]]
    file_name = create_file_name.map(zip_file=flatten(files), upstream_tasks=[unmapped(zipped_object)]) # [list, of, filenames]
    file_data = extract_file_data.map(zip_archive=zipped_object, zip_file=flatten(files)) # [list, of, data] but currently only includes the data associated with the last file
    blob_name = azure_upload.map(data=file_data, blob_name=file_name, overwrite=unmapped(True))

Kevin Kho

10/25/2021, 4:25 PM

Hey @Kevin, do you mean the last file is mapped N times or the mapped task only runs 1 time?

Kevin

10/25/2021, 4:27 PM

The azure_upload.map() task only runs once...

Kevin

10/25/2021, 4:27 PM

Copy code

[2021-10-25 16:27:23+0000] INFO - prefect.TaskRunner | Task 'BlobStorageUploadOverwrite[0]': Finished task run for task with final state: 'Success'
[2021-10-25 16:27:23+0000] INFO - prefect.FlowRunner | Flow run SUCCESS: all reference tasks succeeded

Kevin

10/25/2021, 4:31 PM

I've been referencing this tutorial to help me through: https://docs.prefect.io/core/advanced_tutorials/advanced-mapping.html#extend-flow-to-write-to-a-database

Kevin Kho

10/25/2021, 4:32 PM

Are all of these maps the same length?

Kevin

10/25/2021, 4:33 PM

no... thats what i have had a tough time thinking through

Kevin

10/25/2021, 4:33 PM

i have mapped objects that represent the s3 objects (the zipped file - example.zip).

Kevin

10/25/2021, 4:34 PM

and then i have mapped objects that represent the files within the zipped file (example1.csv, example2.csv, etc.)

Kevin

10/25/2021, 4:34 PM

so its a 1:many

Kevin

10/25/2021, 4:34 PM

i was trying to make it dynamic enough so that i dont have to assume that i will only download 1 s3 object/zip file per flow execution

Kevin

10/25/2021, 4:37 PM

i'm thinking maybe i should create two distinct flows. one for downloading the s3 object and the other for extrating the files out of that object

Kevin Kho

10/25/2021, 4:43 PM

I’ll give this a shot in a bit

Kevin

10/25/2021, 4:44 PM

Sounds good

Kevin Kho

10/25/2021, 4:56 PM

I am a bit confused here”

file_data = extract_file_data.map(zip_archive=zipped_object, zip_file=flatten(files))

Kevin Kho

10/25/2021, 4:56 PM

Are those lengths equal for the map?

Kevin

10/25/2021, 4:57 PM

no... zipped_object = a list of zip archives downloaded from s3. and files is a list of files within a single one of those archives

Kevin

10/25/2021, 4:58 PM

so the length of zipped_object would equal the number of objects we download from s3

Kevin

10/25/2021, 4:58 PM

and the length of files would be equal to the number of files within one of those objects from s3

Kevin Kho

10/25/2021, 4:59 PM

I think these need to be equal somehow. This is the line where things go wrong right?

Kevin

10/25/2021, 4:59 PM

yes

Kevin

10/25/2021, 4:59 PM

i guess i could try and copy the zipped object to be equal to the number of files within it

Kevin Kho

10/25/2021, 4:59 PM

Could you try an intermediate task to reshape the inputs to make them equal? I’m working on a flow with similar structure. I just got kinda stuck here 😅

Kevin

10/25/2021, 4:59 PM

do you think that would be better than creating two distinct flows?

Kevin

10/25/2021, 5:00 PM

one flow to download the s3 object and convert it to a ZipObject... the second flow would accept the ZipObject as a parameter

Kevin

10/25/2021, 5:00 PM

that way i could keep it unmapped as i work through the list of files within it

Kevin Kho

10/25/2021, 5:04 PM

Yeah I think it can be done in one flow

Kevin

10/25/2021, 5:30 PM

i ended up just creating the equal mapping length - it worked!

Kevin

10/25/2021, 5:31 PM

funny work around but thats oaky

Kevin Kho

10/25/2021, 5:38 PM

I think when you map, the calls need to be equal length because each of those pairwise elements is submitted as a task so you’ll run into problems if there is no pair.

Kevin

11/09/2021, 3:24 PM

Hey @Kevin Kho - revisiting this thread as i've run into an issue. I have a flow that works when I run it locally but runs into this error when i deploy and run it via prefect cloud on my kubernetes cluster:

Copy code

Unexpected error: TypeError("can't pickle _thread.RLock objects")

Kevin

11/09/2021, 3:25 PM

the task this fails on is a task that receives the downloaded bytes from S3 and uses them to initialize and return a ZipFile object

Kevin

11/09/2021, 3:25 PM

having a really tough time troubleshooting/triaging

Kevin

11/09/2021, 3:27 PM

Copy code

@task 
def convert_to_zip(s3_obj):
    zipped_file = ZipFile(BytesIO(s3_obj), 'r')
    return zipped_file

Kevin Kho

11/09/2021, 3:28 PM

can you try manually using

cloudpickle

on the

ZipFile

object?

Copy code

import cloudpickle
cloudpickle.dumps(ZipFile)

Kevin Kho

11/09/2021, 3:29 PM

Or the s3_obj

Kevin Kho

11/09/2021, 3:29 PM

Are you using Dask also?

Kevin

11/09/2021, 3:29 PM

i'm just using KubernetesRun

Kevin

11/09/2021, 3:30 PM

and yea - i can give that a shot

Kevin

11/09/2021, 3:47 PM

I have a feeling this is why you wanted me to try that:

Copy code

TypeError: cannot serialize '_io.BufferedReader' object

Kevin Kho

11/09/2021, 3:51 PM

Uhhh….I was expecting the same _thread.RLock lol. But either way task inputs and outputs have to be serializeable by cloudpickle. There is a workaround though. You can store your Flow as a script instead of pickle and you won’t need to serialize it.

2 Views

Open in Slack

Previous Next