is the major scaling problem for concurrent flows,...
# prefect-server
d
is the major scaling problem for concurrent flows, that the prefect server needs to be available to recieve checkpointing data and store it?
c
Hi DJ - big question! Let me try to answer it concisely without getting too in the weeds. Both Server and Cloud provide an API that drives the UI as well as all workflow operations (setting states, sending logs, updating configuration settings, releasing work at the right time, etc.) There are additionally a lot that happens behind the API, both when it’s in use and when it’s not. For example, in both Server and Cloud there is a “Zombie Killer” service that is constantly monitoring for Running tasks / flows that have stopped talking to the API. Cloud of course has more of these services and hooks than Server, but the idea is the same - providing a monitoring / insurance platform for your workflows. Prefect Server specifically has a scaling limit because every request that hits the API requires a database query. This means that every time you open the UI, every time you send a log or a state, etc. you are talking directly to the database. Cloud, on the other hand, has much more caching + horizontal and vertical scale built in so can essentially scale to infinity.
d
Awesome thanks for the quick and detailed reply. The daemon processes that are running on the service as far as I can tell: 1. Scheduling clocks 2. Zombie Killer -> assuming number of concurrent flows doesn’t adversely affect performance for this guy significantly And then for non-daemon requests that scale with number of flows: 1. Receive states from running flows 2. Receive Logs
and the scheduler is what is “releasing work”?
For prefect server are logs stored in the postgres server, or can we push those to S3?
c
Yea, there is also a Lazarus process that is also not really affected by concurrent flow runs; Prefect Agents work on a polling model, so they make an API request for work, and that is received + logic is run that determines what flow runs should be run by the particular agent making the request. And logs are stored in Postgres for Server; you can move them to S3 for sure but you’ll have work to do if you want to see them in the UI (because the UI is querying the database for those logs). Also note that the API is not a passive receiver of states - whenever a state is set, other logic kicks into action (e.g., “if the flow run is finished, don’t let task runs enter running states”)
d
lazarus -> retry mechanism?
ok interesting to know that the prefect agents are actually polling for work
so the srever itself has a GQL interface, that then interactrs with Hasura to make modificatiosn to the DB?
👍 1
sry im trying to crash aquaint myself with the project 😞
c
Lazarus identifies flow runs that haven’t completed for some reason and reschedules them; yea the whole system was designed so that metadata / orchestration happens 100% separated from the execution of the workflows. This is largely a security feature but also allows for diverse execution environments. (see https://medium.com/the-prefect-blog/the-prefect-hybrid-model-1b70c7fd296)
d
do youg uys have an architecture diagram or flow diagram for how these process interact with eachother, understand if you don’t but would definitely help me wrap my head around things
c
We do, but that’s more on the sales side of the house so I recommend emailing us at
<mailto:hello@prefect.io|hello@prefect.io>
for a deeper dive
d
👍 thx will do