Hi all I m trying to deploy my etl using github filesystem I Prefect Community #ask-community

Hi all. I'm trying to deploy my etl using github f...

David

05/03/2023, 4:38 PM

Hi all. I'm trying to deploy my etl using github filesystem. I've pushed my python script containing the flow on github, created a github block on Prefect, and I've this deployment code:

Copy code

from prefect.deployments import Deployment
from prefect.filesystems import GitHub
from etl_web_to_gcs import etl

github_block = GitHub.load("github-block")

# Deployment Object
github_dep = Deployment.build_from_flow(
    flow=etl,
    name="github-flow",
    infrastructure=github_block
)

if __name__ == "__main__":
    github_dep.apply()

When I run the code above, I get the following error:

Copy code

...
cls_init(__pydantic_self__, **data)
 File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Deployment
infrastructure
 Infrastructure block must have 'run-infrastructure' capabilities. (type=value_error)

Any idea how to fix this?

redsquare

05/03/2023, 4:38 PM

I was replying when you removed the post:)

redsquare

05/03/2023, 4:38 PM

the github_block is not infra but storage

David

05/03/2023, 4:39 PM

Sorry for that

redsquare

05/03/2023, 4:40 PM

you will also need some sort of infrastructure so prefect knows how you want the flow to run

redsquare

05/03/2023, 4:40 PM

https://docs.prefect.io/latest/concepts/infrastructure/#infrastructure-overview

Ali Mir

07/07/2023, 4:20 PM

Hi @redsquare, sorry for the ping! I have a question that is somewhat related to this issue. We have our infrastructure setup on cloud using ECS and I usually develop and test stuff using local

Process

. Yesterday we did the Prefect associate certificate training and a part of that was hosting code on github (basically telling the flow where the code that needs to be executed lives). Our team just recently migrated from Prefect 1 to 2 and we're still trying to figure out how to work with Prefect 2 efficently; a few flow that we have migrated from 1 to 2 are hosted on an s3 bucket which is a headache for us, all of our code already lives on github. Could you advice me on how I can configure Prefect 2, in both the cloud (using ECS) and for local testing/dev (using

Process

) to fetch the code from github? The github repos are ofcourse private. Is it as simple as creating a github block and passing credentials? Side question, during training the usage of prefect.yaml came up. How does that come into play in this context? Is it necessary? Do you write your own yaml file before deployment of a flow if required to connect to github?

Ali Mir

07/07/2023, 4:20 PM

C.C. @Emil Christensen - we had a brief conversation about this yesterday.

Emil Christensen

07/07/2023, 5:00 PM

Storage blocks (e.g. the

GitHub

block) and infrastructure blocks are intended for use with agents and are generally incompatible with workers and the

prefect.yaml

way of deploying flows. When using workers and

prefect.yaml

, the

prefect.deployments.steps.git_clone

pull step serves the same purpose as the

GitHub

storage block. Regardless of whether you have an agent or a worker, what happens is that at run-time the flow will pull code from git. This happens regardless of the infrastructure. The caveat here is that I’m assuming you are running a deployment that gets executed by an agent or worker. If you’re just running your flow directly (e.g.

python my_flow.py

), then no code gets downloaded.

Emil Christensen

07/07/2023, 5:02 PM

Hopefully that clarifies to some extent. Basically you have two options: 1. If you’re happy using the

prefect.yaml

file and the

prefect deploy

CLI command, then add a

prefect.deployments.steps.git_clone

step to your pull section (example). Now every time a worker executes a flow run, it’ll pull the code from GitHub 2. Alternatively, if you want to define your deployments in Python, create a

GitHub

storage block and pass it to your deployment. Now, when an agent executes a flow run, it’ll pull code from GitHub.

Ali Mir

07/07/2023, 5:02 PM

So imagine you have a script compatible with Prefect 2 on your github repo. If you wanted to deploy that piece of code and have it be runnable on the prefect cloud how would you approach this? How would you configure the authentication to github? I feel like I am mixing 2 different concepts/method up and confusing myself

Emil Christensen

07/07/2023, 5:03 PM

^ does the message above help?

Ali Mir

07/07/2023, 5:03 PM

Haha yes sorry I was typing as you were already answering my question 😄

😅 1

Ali Mir

07/07/2023, 5:03 PM

Thank you!

Emil Christensen

07/07/2023, 5:03 PM

In either case, you would want to create a personal access token and assign that to your

GitHub

block or your

prefect.deployments.steps.git_clone

step.

Emil Christensen

07/07/2023, 5:04 PM

(in both cases the argument name is

access_token

)

Ali Mir

07/07/2023, 5:05 PM

I do like the usage of Prefect.yaml; a couple of clarifying questions about it. Can you refresh my memory on how this file gets created (which later is modified by the developer)? Or does the user create it based on their need? Also do you have a single Prefect.yaml for the whole repo or is it flow/script specific?

Ali Mir

07/07/2023, 5:58 PM

@Alyssa Harris

Ali Mir

07/07/2023, 6:02 PM

@Jordan Lessard

Emil Christensen

07/07/2023, 10:09 PM

how this file gets created

You can initialize it with

prefect init

, optionally from a recipe. Also, if you do

prefect deploy

and deploy a flow, you can write out a new file at the end of that.

Also do you have a single Prefect.yaml for the whole repo or is it flow/script specific?

Generally one per repo. Within it you can have multiple different deployments (usually one per flow).

Ali Mir

07/07/2023, 10:43 PM

Thank you so much Emil!

Ali Mir

07/11/2023, 6:58 PM

Hi @Emil Christensen; I hope you had a great weekend. Following your guidance I managed to deploy and test a very simple "hello world" flow using github as it's source-code successfully. Ran into a couple of challenges when I wanted to deploy our real production flows which managed to solve all but one of these issues. Let me recap the steps that I have done. The code is pushed to a repo, it's a private repo and I'm working off of a feature branch. Using

prefect deploy

I provided the repo url and the branch name, then I generated an access token and provided that as well. I selected

process

as our esc is not fully fleshed out yet. Now, the flow uses 2 external packages, pandas and s3fs. The are imported in this order:

Copy code

import pandas as pd
import s3fs

After deploying, starting the workpool using this command :

prefect worker start --pool 'test-pool'

I ran the flow using the UI. Now I faced an issue with the packages.

prefect.exceptions.ScriptError: Script at 'flows/read_csv.py' encountered an exception: ModuleNotFoundError("No module named 's3fs'")

Now I am confused because of 2 questions: 1. Why the flow executed locally and not on cloud 2. How to make sure that the Prefect process has access to the packages that the script requires? 3. And oddly enough, why did it fail on `s3fs`and not on

pandas

even tho the latter was imported first? 4. How can I remedy this? Thank you so much for helping me out with this

Ali Mir

07/11/2023, 9:06 PM

C.C. @Nicole Garza

Mitch

07/13/2023, 2:41 PM

Hey Emil - I have been having a similar issue where my flow code does not seem to be cloned into the docker container. We are deploying our flows on ECS using a python

Deployment.build_from_flow

deployment so we are not using prefect.yaml file to deploy. I have created a github block wit the credentials. Is there a port mapping or security groups which may be required when the task is spinning up in ECS? Is there a way to validate, print, or pause the container so I can view if the files are cloned in the task container? Any help would be greatly appreciated as I have spent a lot of time on this seemingly trivial implementation 🙂

Mitch

07/13/2023, 2:42 PM

I have also tried all sorts of path and entrypoints to see if I could find the file, with no luck.

Emil Christensen

07/17/2023, 4:10 PM

@Ali Mir assuming you’ve installed the packages, the most likely explanation is that two different virtualenvs are in use. What’s the output of

which prefect

? Can you run the flow successfully with something like

python flow.py

? If so, what’s the output of

which python

Emil Christensen

07/17/2023, 4:16 PM

@Mitch when you’re using

Deployment.build_from_flow

, are you passing the GitHub block you created? You should see a log in the agent that says something to the effect of “pulling files from repo …“. You could peek at the files by adding something like the following to your flow:

Copy code

import os
print(f"Current dir is {os.path.abspath('.')}")
print(f"Files: {os.listdir('.')}")

3 Views

Open in Slack

Previous Next