Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi! So I have an external API with pagination, that provides some results with a call and a link to next page of results. Total results (all chunks combined) of this API are huge and do not fit in memory.

Ideally I would make a task which pulls that API a generator so that I can start mapping its results before pulling complete. Is it possible to achieve that in Prefect?

Hi <@URFS52DJ6> - you should check out <https://docs.prefect.io/core/advanced_tutorials/task-looping.html#introducing-task-looping|task Looping>; I think that'll nicely fit your use case where you want to process results based on your generator some segment at a time but you don't know how many there will be and don't want to store them all in memory by fetching them upstream.

<@UN6FTLFAS> so do I understand correctly, that I can do task looping in "generator" task and `.map` it results?

That's correct <@URFS52DJ6> - you could loop that generator task, adding to the results until you're satisfied, and then map over the results you've generated.

<@UN6FTLFAS> what if I have a Python generator that cannot survive interruption?

In other words you can't pass it between tasks (or iterations of a task?)

I mean in this specific case `raise LOOP` works, but there are others which break

it would be nicer if Prefect could handle `-&gt; Generator` type annotation, for example

In cases like that, you don't _need_ to leave the task or pass the generator to another task, you can always do your looping pythonically and collect the results as normal before passing them along to the next task. You lose a little visibility into where something might break down but not much

&gt; collect the results
That works as long as it all fits in memory.

My other case is parsing extremely huge XML file, all the data that we need, loaded into memory, exceeds 10s of Gigs.

It would be nice to be able to break it into chunks that are processed separately.

Huge XMLs are a common thing in ecommerce world. Google products feed or Yandex market XML can be really huge.

Sorry for the slow reply over here <@URFS52DJ6> - this is definitely doable within a task; one approach i've taken is to loop through the file upstream (python does a great job of reading files line by line or however you need to) to generate a list of indexes. That will make sure you know how much mapping you need to do and you can pass that list downstream to your mapped task and then read the file at those indexes. There are other considerations that I'm sure you're thinking of there but that would allow the file to only be read into memory in ways that are manageable

I think there are lots of ways you can operate on the file system instead of in memory, the same way you might in a non-prefect python application.

<@UN6FTLFAS> thanks, I think I understand general approach now, will try it out

Let me know how it goes! Happy to help as you have more questions :slightly_smiling_face: