https://prefect.io logo
a

Andrey Tatarinov

12/11/2020, 7:02 PM
Hi! So I have an external API with pagination, that provides some results with a call and a link to next page of results. Total results (all chunks combined) of this API are huge and do not fit in memory. Ideally I would make a task which pulls that API a generator so that I can start mapping its results before pulling complete. Is it possible to achieve that in Prefect?
n

nicholas

12/11/2020, 7:12 PM
Hi @Andrey Tatarinov - you should check out task Looping; I think that'll nicely fit your use case where you want to process results based on your generator some segment at a time but you don't know how many there will be and don't want to store them all in memory by fetching them upstream.
👍 1
a

Andrey Tatarinov

12/11/2020, 7:13 PM
@nicholas so do I understand correctly, that I can do task looping in "generator" task and
.map
it results?
n

nicholas

12/11/2020, 7:17 PM
That's correct @Andrey Tatarinov - you could loop that generator task, adding to the results until you're satisfied, and then map over the results you've generated.
👍 1
a

Andrey Tatarinov

12/11/2020, 7:17 PM
Thanks!
😄 1
@nicholas what if I have a Python generator that cannot survive interruption?
n

nicholas

12/11/2020, 7:29 PM
In other words you can't pass it between tasks (or iterations of a task?)
a

Andrey Tatarinov

12/11/2020, 7:30 PM
yes
I mean in this specific case
raise LOOP
works, but there are others which break
it would be nicer if Prefect could handle
-> Generator
type annotation, for example
n

nicholas

12/11/2020, 7:33 PM
In cases like that, you don't need to leave the task or pass the generator to another task, you can always do your looping pythonically and collect the results as normal before passing them along to the next task. You lose a little visibility into where something might break down but not much
a

Andrey Tatarinov

12/11/2020, 7:35 PM
collect the results
That works as long as it all fits in memory.
My other case is parsing extremely huge XML file, all the data that we need, loaded into memory, exceeds 10s of Gigs. It would be nice to be able to break it into chunks that are processed separately.
Huge XMLs are a common thing in ecommerce world. Google products feed or Yandex market XML can be really huge.
n

nicholas

12/11/2020, 10:18 PM
Sorry for the slow reply over here @Andrey Tatarinov - this is definitely doable within a task; one approach i've taken is to loop through the file upstream (python does a great job of reading files line by line or however you need to) to generate a list of indexes. That will make sure you know how much mapping you need to do and you can pass that list downstream to your mapped task and then read the file at those indexes. There are other considerations that I'm sure you're thinking of there but that would allow the file to only be read into memory in ways that are manageable
I think there are lots of ways you can operate on the file system instead of in memory, the same way you might in a non-prefect python application.
a

Andrey Tatarinov

12/11/2020, 10:40 PM
@nicholas thanks, I think I understand general approach now, will try it out
n

nicholas

12/11/2020, 10:41 PM
Let me know how it goes! Happy to help as you have more questions 🙂