Is there any real reason to manage dataframe based...
# data-tricks-and-tips
j
Is there any real reason to manage dataframe based ml pipelines with something like kedro or Hamilton? I've just been using prefect tasks
d
I personally use prefect tasks. A few usage notes for me: • I always use
.submit
when invoking a task • I never access a
DataFrame
in a flow, only in a task • I generally compute statistics with
stats_df = df.describe().reset_index()
and I upload that as an artifact
stats_df.astype(str).to_dict("records")
(note the cast to
str
silences issues in handling non-primitive types) in a task Sticking to those rules has resulted in a very ergonomic way of manipulating
DataFrame
objects in prefect. Also, one other cool thing: it's very easy to manage caching, which is helpful for long-lived flows that have failure points.