mot s

    mot s

    2 years ago
    Hi Prefect Community, just started going over prefect docs, blogs and videos. We are a big on-premise data company (in a highly regulated environment) and trying to see if prefect is a good solution for us. We are a big pyspark shop on CDH (Cloudera hadoop distro) and using ML (h2o). Looking at prefect to help us with mainly 2 things: 1. document data and state dependencies across our pyspark code, orchestrate pyspark based flows. Reading up on prefect, it seems like it can do these 2 things but a little lost on architecture of how this can be deployed. Basic questions - would prefect agent be running on the edge (hadoop gateway servers) and prefect backend server on K8s cluster? How would flow orchestration work? Would agent be kerberos aware? Sorry for lot of these basic questions - could not find info on prefect deployment in an on-prem setup with spark being the execution engine.
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Hi @mot s, prefect currently has no hadoop or spark integration so to speak. There's nothing about the architecture of prefect that would prevent running as part of your hadoop cluster, but work for doing that currently hasn't been done.
    would prefect agent be running on the edge (hadoop gateway servers) and prefect backend server on K8s cluster
    You could run the backend and agents wherever it makes sense for you to do so. It sounds like you want your flows to run on the edge node, so you'd want to have the agent run there as well. If you're running prefect server, you just need to have it running somewhere where both the agent and the running flows can access its api.
    How would flow orchestration work
    I suggest walking through the tutorial (https://docs.prefect.io/orchestration/tutorial/configure.html) and if you have further questions asking something more specific.
    Would agent be kerberos aware
    Prefect server has no authentication model, so there's no auth to deal with here. If you're running using prefect cloud, we use our own authentication/authorization model, and don't plugin to kerberos. That said, there are definitely ways to have a prefect flow run with valid hadoop delegation token so you can kick off a yarn job on your cluster. And a yarn-backed agent is within scope for prefect (we just haven't written one yet).
    mot s

    mot s

    2 years ago
    Thank you Jim. I will review these links