https://prefect.io logo
Title
m

Marc Lipoff

03/17/2022, 3:58 PM
I have an interesting situation. I'm pulling data from an API. I can pull only 100 items at a time (there are like 200k total). The API response gives me a "next url" that I can use in the next API request. However, that "next url" is the same as I iterate through (I assume it's some sort of stream). Once I get the results, I want to write them to a database. My non-prefect way is to write a task that pulls from the API until there is no next_url (
while has_next_url: ...
) . Is there a better "prefect" way to do this? One of the downsides of my basic way is that, if there is an error along the way (let's say at record 199,999 of 200,000), I lose what I had.
I should say that what would be even better is that I can write to the database once I have X records (eg, 10,000).
s

Sylvain Hazard

03/17/2022, 4:03 PM
Haven't been able to use them myself but Prefect Loops look like what you want to be doing. From what I understand, it adds new task runs until a certain condition is met. Depending on the specifics of your database, you could either wait until you have all elements before pushing them or push them as you go in every iteration of the loop.
m

Marc Lipoff

03/17/2022, 4:04 PM
That's what I was thinking... But how do I push them every iteration? And keep the loop going
k

Kevin Kho

03/17/2022, 4:06 PM
You can just push them inside the loop task and then pass the payload for the next run…unless i am missing something?
s

Sylvain Hazard

03/17/2022, 4:07 PM
Was gonna say the same thing. The issue with that is that you can't have the databases operations be encapsulated in a Prefect task.
m

Marc Lipoff

03/17/2022, 4:07 PM
The API read and the DB write are in the same task? I prefer for them to be seperate tasks
k

Kevin Kho

03/17/2022, 4:08 PM
Ah I would recommend them to be in the same, because you can’t over loop 2 tasks as it violates the DAG. It would be something like creating subflows and that’s pretty unwieldy for that maybe operations
s

Sylvain Hazard

03/17/2022, 4:08 PM
Unsure it is currently possible to trigger downstream task runs for each loop iteration
If you want to deal with potential fail during the API read loop, you may want to write the results from your API call in a file and have the DB task run regardless of whether the API loop failed or not.
This way, if you only retrieved 60K items before the stream closed, you can still push those to your database
👍 1