https://prefect.io logo
Title
s

Stephane Boisson

02/17/2020, 10:13 PM
What would be the best practice/patterns how to implement a web crawler with Prefect ? Using the LOOP signal to recurse until some kind of depth limit?
j

Jeremiah

02/17/2020, 10:16 PM
Without knowing the details, a rule of thumb for deciding how to iterate in Prefect is if the inputs are possibly unbounded, then we recommend using LOOP. For example, if you’re walking each page of a news aggregator or comments section and don’t know how many there are in advance. If on the other hand the input is bounded and known/knowable, we recommend using a map over a list. For example, use a task to generate all the url’s you need to check, then map over them. And if you’re not sure, prefer map to loop.
s

Stephane Boisson

02/17/2020, 10:30 PM
Thanks for the answer. It sound it would be a combo of loop to gather the urls and then map over a list of the urls. Is it possible to map over a mutable list ?
j

Jeremiah

02/17/2020, 10:31 PM
I don’t want to push you too strongly to use map if it isn’t appropriate — the reason we generally prefer it is just that working with a flow is easier if each step is known, and looping implicitly hides steps from you (and is slightly more complex to code). However if it fits your use case, go for it.
As for a mutable list — your map won’t start until the upstream task finishes, so once you produce the list to map over, it is fixed as far as Prefect is concerned. This is why mapping is only appropriate if you can figure out the bounded input in advance.
In other words, once you produce the mapped input list and pass it to another task to be mapped over, there isn’t an opportunity to change it again.
s

Stephane Boisson

02/17/2020, 10:41 PM
Thanks for the precision. Map looks more elegant to me too.