https://prefect.io logo
Title
a

Aaron Y

06/12/2020, 5:57 PM
has anyone worked with preprocessing large videos in a data pipeline?
j

Jenny

06/12/2020, 6:08 PM
Hi Aaron and welcome to Prefect. I don't have any experience here but hoping others in the community will be able to help you.
a

Alexander Hirner

06/14/2020, 8:40 AM
Hi Aaron, we do. By preprocessing you mean transcoding, splitting, basically any ffmpeg operation?
a

Aaron Y

06/14/2020, 6:01 PM
@Alexander Hirner thats correct. Right now, converting a 14gb mp4 to frames, into a webm, and uploading all that data to s3 is taking forever. Are there any tools with prefect that could make this process easier?
a

Alexander Hirner

06/14/2020, 8:09 PM
Usually ffmpeg is using all cores for transcoding. Does that step max out all your cores, is the mp4 file already on a bucket or distributed file system and does uploading max out your connection bandwidth?
a

Aaron Y

06/14/2020, 8:56 PM
it's in a google cloud storage, so the current process is to download it to my local, then run everything here, then upload it to s3
a

Alexander Hirner

06/15/2020, 9:57 AM
That sounds like a great basis for parallelizing. We start ffmpeg tasks directly on signed gcs https urls and hence save one roundtrip. If download is a bottleneck, the tricky part would be to parallelize transcoding and keep it in the cloud. Either you need to seek into chunks deterministically or pre-chunking to 15-60min files before upload. We do the latter to avoid any gaps and overlaps. If pre-chunked, each video I/O task could be a prefect task.