https://prefect.io logo
Docs
Join the conversationJoin Slack
Channels
announcements
ask-marvin
best-practices-coordination-plane
data-ecosystem
data-tricks-and-tips
events
find-a-prefect-job
geo-australia
geo-bay-area
geo-berlin
geo-boston
geo-chicago
geo-colorado
geo-dc
geo-israel
geo-japan
geo-london
geo-nyc
geo-seattle
geo-texas
gratitude
introductions
marvin-in-the-wild
prefect-ai
prefect-aws
prefect-azure
prefect-cloud
prefect-community
prefect-contributors
prefect-dbt
prefect-docker
prefect-gcp
prefect-getting-started
prefect-integrations
prefect-kubernetes
prefect-recipes
prefect-server
prefect-ui
random
show-us-what-you-got
Powered by Linen
best-practices-coordination-plane
  • k

    Kevin Kho

    04/26/2022, 6:14 PM
    set the channel description: Want to understand the best way to use Prefect? Looking for examples of code patterns are people using? Look no further!
  • a

    Anna Geller

    04/26/2022, 6:29 PM
    Welcome, everyone! πŸ‘‹ Excited to chat about best practices in workflow orchestration, especially as we continue to iterate on Prefect 2.0 to be the best workflow orchestration engine everyone will love interacting with! πŸ˜›refect2:
    πŸ™Œ 6
  • s

    Sash Stasyk

    04/27/2022, 1:42 PM
    Hey, is there a way to limit concurrency on a
    map
    call?
    k
    • 2
    • 5
  • c

    chicago-joe

    04/27/2022, 7:13 PM
    What is the difference between the customer success recipe for EKS deployment vs EKS on Fargate as shown here?
    a
    • 2
    • 3
  • a

    Alvaro DurΓ‘n Tovar

    04/29/2022, 9:19 AM
    hi! I would like to understand a bit more how is people calling the code for the flows. So far what we are doing is having 1 repo with many flows. Then a build script that will create 1 single docker image, then assign that single docker storage to all flows (basically like
    for flow in flows: flow.storage = docker
    ) and register the flows. This allows for easily reuse the same image, dependencies, etc. Now I've started a new project integrating feast and I want to have some flows to materialize the views. That means I guess having 1 flow per thing I want to materialize, then following the same pattern as above but on a different project. Starts to be a bit of too much repetition (although it works well tbh). Is other people doing something similar? How do you deal with distributing the code for the flows? I can only find using docker images because the flows often have dependencies on other python modules.
    βž• 1
    a
    • 2
    • 10
  • b

    Bernardo Galvao

    04/29/2022, 10:02 AM
    Hi, I would like to fish for a conceptual clarification and best practices around CI/CD in ML. There seems to me that there is a functional overlap between GitLab CI/CD and Prefect; and I have to conceptualize some sort of Continuous Integration and Continuous Delivery for machine learning, which I could put into Prefect dataflows. β€’ As I understand it, data is better passed between Prefect tasks. β—¦ This would make Prefect a better candidate for running data and model validation tests. β€’ GitLab CI/CD is designed to test code. β—¦ I am not sure if I should use it to run data and model validation tests. β—¦ I think it has its place in integrating and delivering Prefect code. I am slightly confused whether 1. GitLab CI/CD would end up testing the same things as Prefect would at some point 2. I can do without GitLab CI/CD It is not clear how to use one or the other specifically.
    a
    m
    • 3
    • 5
  • j

    John Jacoby

    05/03/2022, 11:48 PM
    Hi all, this is a great channel idea. I just came here specifically to ask a best practices question and the first thing I saw was the announcement about this new channel! I'm wondering if anyone else has been thinking about what the best practice is for tasks that produce and/or consume file paths. Without going into too much detail, each input into my main flow comes with a unique ID. This ID is used by each task to construct file paths for the task's persistent outputs. The issue came when I realized that I needed the path constructed by one task in another task down the line. I can go edit the upstream task to return the needed path, but now I need to re-run all the mapped iterations of that lengthy task just to return a file path, which on it's own is a very small and quick operation. I can think of a few ways to get around this and I'm wondering if one of them is considered standard or best practice: 1. Don't bother passing the paths down the flow and just re-construct the required file paths in each individual task. 2. Have one task at the start of the flow that constructs quick metadata that all the other tasks can use. That way if I need a new file path, I just have to re-run this quick task. 3. Same as 2, but write the paths and other metadata to a persistent file like a JSON instead of passing it down the flow. Other tasks can then read from this JSON.
    a
    • 2
    • 1
  • e

    Edmondo Porcu

    05/04/2022, 3:03 AM
    Hello everyone! I am trying to write a task that returns another task, like so:
    def my_task(param1, param2):
         return NewTaskSomething(param1, param2)
    
    and then in the flow...
    my_task_instance = my_task(param1,param2)
    my_task_instance(param3)
    however this fails saying that param1 and param2 are not specified. Maybe in reality my_task function should not be decorated with the
    @mytask
    decorator?
    k
    • 2
    • 21
  • e

    Edmondo Porcu

    05/04/2022, 5:49 PM
    Hello, it seems like that documentation for the DatabricksSubmitRun is obsolete. It suggests to pass the connection string as a string-encoded json using PrefectSecret, but the class only accepts a dict
    k
    • 2
    • 49
  • w

    William Jamir

    05/04/2022, 7:53 PM
    Hello πŸ™‚ I have one generic task to do some data processing, and this task accepts a few inputs to determine which customer I will act on, and some configurations for this customer. Now I want to improve the visualization of these tasks, I mean I want to have a pre-defined configuration for different/each customer, and have a filter to visualize tasks for a specific customer. My initial thought (and currently the way to go) is to create multiple flows, one for each client, to have pre-defined inputs and a dashboard for visualization of these tasks. Is this the way to go? Is there any other option that I can evaluate? I was looking for perhaps some solution where I can utilize only one flow and have a dashboard with breakdowns.
    k
    • 2
    • 1
  • e

    Edmondo Porcu

    05/05/2022, 1:18 AM
    Still coming to Databricks and I guess in general with Tasks. Databricks has added git support for jobs, and the current DatabricksSubmitMultitaskRun doesn't support it. I am in doubt among the possible approaches: β€’ Create a DatabricksSubmitMultitaskRun custom implementation β€’ Use the databricks CLI python library to create a job and then just run it via Prefect β€’ Other? The real problem is that the Task does not allow dependency injection (i.e. the databricks client is created within the Run function, so it's not easy to override it). I guess the design of the Task is concerning in the sense that is not extensible, one needs to rewrite it from scratch
    k
    • 2
    • 2
  • r

    Ramzi A

    05/06/2022, 12:55 AM
    I figured out a way to use prefect 1 with k8s that scales well, with the introduction of 2 i am having trouble figuring out how to run kuberentes on aws and how to scale it. my initial method was using docker containers for python dependency for each flow which worked well along with having github actions auto deploy my flows as i add them and pull docker images from ECR. i don't think i can adopt the same format with 2.0. any one sucessfully build out a ci/cd with k8s on Prefect 2.0?
    a
    • 2
    • 4
  • t

    Tyler Matteson

    05/06/2022, 7:27 PM
    Hi folks, I'm new here. I am looking to use Prefect as an orchestrator and primary development focus, but I also am interested in leveraging work that has already been done in Singer.io and the Meltano runner specifically. Choosing Meltano because my team has a background in Python and Vue and not in Airbyte's stack of Java and React. First level of understanding is how I might/should pass and retrieve data between Prefect and Meltano without a direct integration (perhaps with a call to
    subprocess
    ). What should I know? What am I not asking? Tasks are a mix of polling, ETL on demand and scheduled ETL tasks. I think most data professionals would describe the load as "not much", so efficiency is going to take back seat to maintainable and easy to use.
    a
    • 2
    • 21
  • e

    Edmondo Porcu

    05/07/2022, 1:51 AM
    Quick question about Prefect. I have run a
    with Flow(''') as flow:
    in the body of the script. However, since there are parameters that set from env variable in the main like so, that was causing an exception
    if __name__ == '__main__':
        flow  = build_flow()
        flow.executor = LocalDaskExecutor()
        project_name = os.environ['PROJECT_NAME']
        spark_version = os.environ['SPARK_VERSION']
        github_repo = os.environ['GITHUB_REPO']
        git_ref = os.environ['GIT_REF']
        flow.run(
            project_name=project_name,
            spark_version=spark_version,
            github_repo=github_repo,
            git_ref=git_ref
        )
    and now I have wrapped my flow definition in a function. Is that a reasonable thing to do?
    :discourse: 1
    k
    a
    +3
    • 6
    • 75
  • b

    Bernardo Galvao

    05/09/2022, 11:46 AM
    heyy, I am setting up prefect in an on-prem production environment and I'm looking into setting up Storage. I would like to know if there is any recommended KV store docker image that I should use for this purpose?
    πŸ‘‹ 1
    a
    d
    • 3
    • 23
  • l

    Linh Nguyen

    05/11/2022, 9:40 AM
    Hello there, we are on the very first step using Prefect πŸ™‚ , having some flows registered and run --> I wonder if you could share your suggestion on how your prefect repo would look like, especially when involving flow of flows / master flow. Big Thanks :thank-you:
    a
    k
    • 3
    • 3
  • j

    Jason White

    05/11/2022, 6:23 PM
    Hello, we are testing Prefect to coordinate ML pipelines. We are really happy with the experience so far πŸ™‚. One area that we're still trying to understand fully is resource specification. We are exploring how to run tasks that have heterogeneous requirements (packages and compute) using prefect. Our understanding is that we specify resources/run config at a flow level, and that we could isolate resources/run configs by using a flow of flows. Is this correct, or is there a more straightforward way to specify this at a task level? For reference, we are mostly looking at utilizing kubernetes at this point.
    z
    a
    • 3
    • 4
  • c

    Chris Hatton

    05/11/2022, 9:40 PM
    Hey all ... I'm looking into Storage options for integrating with a CICD pipeline. At this point, we have: 1. Agents running in AWS ECS Fargate under Local Storage 2. Github Actions to build a new docker image (with our flows on the image) and push/start to AWS ECS. As I look into automating the Flow registration process, I've come across (and am interested in) both the Github and AWS S3 Storage classes. Github looks easier, but I'm not sure I want to rely on Github uptime for my flow execution. AWS S3 seems a lot more reliable (plus my flows are on AWS ECS so if Amazon is having probs it's more likely to impact both. Any thoughts on Github vs AWS S3? What are you guys using?
    a
    • 2
    • 3
  • m

    Marco PΓ©rez

    05/12/2022, 12:10 AM
    let me know if this is not the best place to ask this. But I’m curious how I might go about dynamically adjusting parameters on retries? The use case is: β€’ processes taking a sum of a large number…if it overflows…I can try taking the sum with a divisor factor β€’ for my use case I just need a number to compare to another - which would have same divisor factor applied. β€’ if it fails again - I can try a larger divisor factor - up to 3 times - and then fail after the nth retry
    a
    s
    • 3
    • 5
  • y

    Yang Ruan

    05/13/2022, 9:52 PM
    Hello, Can the same agent run flows that use different docker images? Does that mean that the agent is launching another container? Thanks!
    k
    • 2
    • 8
  • s

    Sander

    05/16/2022, 9:24 AM
    Hi, for prefect 2.0b4, how can I add a json file as parameter in a deployment?
    a
    • 2
    • 49
  • d

    Davide at Evo

    05/16/2022, 2:26 PM
    Hi! I'm evaluating Prefect for our company and so far it's excellent, although (I'm using version 2) I'm stuck with k8s jobs not being able to pull images from my custom registry. Our deployment would take advantage of custom images at flow level to support custom packages deployed by a CI. You can specify an image KubernetesFlowRunner(namespace="prefect", image="my.registry:5001/something") but in the Orion branch I cannot find any code related to the imagePullSecrets config. Thanks!!
    a
    • 2
    • 2
  • y

    Yang Ruan

    05/16/2022, 5:19 PM
    Hi, another question about steps. We'd like one task to unzip folders in gcs, and another task to process each file in the folder. Can that be done where the second task fans out (multiprocess)?
    k
    • 2
    • 4
  • s

    StΓ©phan Taljaard

    05/17/2022, 3:03 PM
    Hi. I was wondering how everyone tackles Prefect 1.0 mapped SQL queries Mapping like the example in thread causes a slow-running flow in that the database connection is re-established for each mapped run. How do you do this using a single connection?
    k
    • 2
    • 10
  • l

    Linh Nguyen

    05/18/2022, 7:38 AM
    Hi there, I am currently evaluating different ways to manage dependencies between pipelines and come across this great article here. I think flow of flows/ master flows is great idea but wonder two things β€’ Does Prefect still have similar
    ExternalTaskSensor
    task to let a master flow wait for another master flow? Or we need to create another master flow on top of previous flows β€’ I would imagine these master flows will expand and become more complex. Also child flows inside might be overlapped e.g one extract flow is needed in two master flows. What would you recommend regarding this? Thanks
    a
    • 2
    • 8
  • j

    John Kang

    05/18/2022, 3:25 PM
    Hopefully, simple question. I'm having issues setting up Google Cloud storage as the remote storage for prefect 2.0. Any tips on what I should do?
    Validation failed! Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or
    explicitly create credentials and re-run the application. For more information, please see
    <https://cloud.google.com/docs/authentication/getting-started>
    βœ… 1
    k
    a
    • 3
    • 10
  • j

    John Kang

    05/18/2022, 6:10 PM
    I have an issue where I have a deployment and can find it by writing in
    prefect deployment ls
    but when I try to
    prefect deployment inspect 'leonardo_dicapriflow/leonardo-deployment'
    it does not show up. Also, when I try to run the deployment locally it does not work as well. FYI, my remote storage is through Google Cloud.
    (Capacity_venv) C:\Users\JKANG1\PycharmProjects\Manheim_Capacity\main_python_files\cockroachdb_write_after_etl>prefect deployment ls
    C:\ProgramData\Anaconda3\envs\Capacity_venv\lib\site-packages\pkg_resources\__init__.py:122: PkgResourcesDeprecationWarning: winpty is an invalid version and will not be supported in a future release warnings.warn( Deployments β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Name β”‚ ID β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ leonardo_dicapriflow/leonardo-deployment β”‚ 19aacccb-d89e-406e-bd1a-0ba4bf2dedb5 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    (Capacity_venv) C:\Users\JKANG1\PycharmProjects\Manheim_Capacity\main_python_files\cockroachdb_write_after_etl>prefect deployment inspect 'leonardo_dicapriflow/leonardo-deployment'
    C:\ProgramData\Anaconda3\envs\Capacity_venv\lib\site-packages\pkg_resources\__init__.py:122: PkgResourcesDeprecationWarning: winpty is an invalid version and will not be supported in a future release warnings.warn( Deployment "'leonardo_dicapriflow/leonardo-deployment'" not found!
    βœ… 1
    a
    k
    • 3
    • 7
  • j

    jedi

    05/18/2022, 6:57 PM
    We would like to use prefect cloud for orchestrating jobs. If prefect cloud is compromised by a bad actor, would it be possible for them to infiltrate or exfiltrate arbitrary code/data to run on agents that are hosted on-prem behind firewall.
    βœ… 1
    a
    • 2
    • 2
  • j

    jedi

    05/18/2022, 6:59 PM
    When will prefect 2.0 be possible to install via conda and without the need for docker or WSL?
    βœ… 1
    a
    • 2
    • 1
  • a

    Aaron Goebel

    05/18/2022, 8:08 PM
    I want to compile a workflow DAG from a set of primitive tasks on the fly. The idea is that the steps can be arbitrarily re-ordered and the final result should be compared. So, like a combinatorial way of generating flows from a set of task nodes. I'm wondering if anyone has already done this or if there is anything you'd consider as being important prior to jumping in. I'm thinking that as along as Input/Output (not considering side effects) are all of the same type that this should be doable on the fly in reaction to an api request
    βœ… 1
    a
    • 2
    • 2
Powered by Linen
Title
a

Aaron Goebel

05/18/2022, 8:08 PM
I want to compile a workflow DAG from a set of primitive tasks on the fly. The idea is that the steps can be arbitrarily re-ordered and the final result should be compared. So, like a combinatorial way of generating flows from a set of task nodes. I'm wondering if anyone has already done this or if there is anything you'd consider as being important prior to jumping in. I'm thinking that as along as Input/Output (not considering side effects) are all of the same type that this should be doable on the fly in reaction to an api request
βœ… 1
a

Anna Geller

05/18/2022, 8:18 PM
If you use Prefect 2.0 you don't need a DAG at all and tasks are determined dynamically at runtime
those two blog posts may help to understand how it works and the reason for this: β€’ https://www.prefect.io/blog/announcing-prefect-orion/ β€’ https://www.prefect.io/blog/introducing-prefect-2-0/
πŸ‘€ 1
View count: 9