Hi there, when caching data with a function that m...
# ask-community
b
Hi there, when caching data with a function that maps over rows in a dataframe, only the first row will be stored. Is there a way to cache all the mapped rows?
k
Hey @Blake List, I think this is because they use the same file path by default. You can add the map index like this:
@task(_result_=LocalResult(_dir_="./output/",_location_="{map_index}.txt"), _checkpoint_=True)
. In general, you can see these docs for templating names. Do you need observability on a per row level here? Seems like you can just save the whole DataFrame with a reduce step afterwards also?
b
Hi Kevin, thank you so much! I don't particularly need to cache here, but I have several map functions chained together which operate on the rows so thought it might be nice to add it. Currently I am just caching the assemble_rows function afterwards.
k
I think if this is Pandas, apply would serve you better. Prefect charges per task also, so operations like this can be expensive. Do you really need observability for all of the chained operations?
b
Can you explain how I can add a pandas apply within a task? Under what circumstances would prefect charge? Currently I have prefect running in a conda env on my company server, and we would ideally deploy a prefect-server and ui with docker. I don't think I need observability for all the chained operations, just the df before and after
k
Copy code
@task
def transform(df):
    df['col'] = df['col'].apply(lambda x: x+1)
    return df
It’s something like that. Because mapping adds overhead to each row of your DataFrame. Here the overhead is added to the whole DataFrame operation. We don’t charge for server actually so you’re fine. We charge per successful task run so map operations on dataframes can blow it up and some people complain so I was just giving a heads up
b
Great thanks @Kevin Kho 😁
👍 1