Hey There s a problem Just as an example say there s a websi Prefect Community #ask-community

Hey! There's a problem. Just as an example, say th...

tash lai

04/07/2021, 4:33 AM

Hey! There's a problem. Just as an example, say there's a website that has a list of tv shows, a a page for each tv show has links to information about every episode. I want to scrape all this info and save everything into a table (url, show_name, episode_name, episode_description)

Tyler Wanner

04/07/2021, 4:49 AM

Hi Tash. I can't be of much help atm but could you please move this into a thread? It is appreciated to avoid walls of text, especially for large code blocks

tash lai

04/07/2021, 4:52 AM

Ok thank Tyler. So the first thing that comes to mind is

Copy code

with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    episode_details = download_episode_page.map(url=flatten(episode_urls))
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc(episode_details)

Now we have all the data we need but how do i save it to my database? This approach won't work because after flattening, zip-like behaviour won't help much because there's less

url

and

show_name

than

eposode_name

and

episode_desc

Copy code

save_to_database.map(url, show_name, episode_name, episode_desc)

Another approach is to annotate

episode_urls

with

show_name` and

url

before flattening:

Copy code

@task
def annotate_episode_url(url, show_name, episode_url):
   return (url, show_name, episode_url)

with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    annotated_episode_urls = annotate_episode_url.map(url, show_name, episode_urls)
    flattened = flatten(annotated_episode_urls)
    episode_show_url = flattened[0]
    episode_show_name = flattened[1]
    episode_url = flattened[2]
    episode_details = download_episode_page.map(url=episode_url)
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc(episode_details)
    save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)

But this is overly verbose and ineffective. Is there some other way to do this?

🙏 1

emre

04/07/2021, 9:49 AM

First off, I think making

save_to_database

a mapped task is inefficient. You are forcing yourself to make a roundtrip to the db for each episode you have. Instead, consider collecting the output of your mapped tasks and make your

save_to_database

bulk insert. Anyways, as you said, without annotating the show url and name to the episode urls is the simplest way to maintain the relation between show and episode. If you can modify your

save_to_database

task, I would just make it accept the annotated tuple instead of episode_show_url and episode_show_name, and parse the tuple inside the task. That way, you can avoid the verbose

flatten[x]

steps. Something like:

Copy code

@task
def annotate_episode_url(url, show_name, episode_url_list):
   return [(url, show_name, episode_url) for episode_url in episode_url_list]
with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    annotated = annotate_episode_url.map(url, show_name, episode_urls)
    episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc.map(episode_details)
    save_to_database.map(annotated, episode_name, episode_desc)

If you don't want to modify

save_to_database

, consider placing a collection task just before database task, for unzipping the annotated. Use

nout

to decrease clutter.

Copy code

@task
def annotate_episode_url(url, show_name, episode_url_list):
   return [(url, show_name, episode_url) for episode_url in episode_url_list]

@task(nout=2)
def collect(annotated_tuple):
    return (
        [u for u, s, ep in annotated_tuple],
        [s for u, s, ep in annotated_tuple],
    )

with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    annotated = annotate_episode_url.map(url, show_name, episode_urls)
    episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc.map(episode_details)
    episode_show_url, episode_show_name = collect(annotated)
    save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)

Apparently

nout

does not work if the task is mapped, so collection is necessary in that regard.

tash lai

04/07/2021, 10:36 AM

Thanks, @emre As for save_to_database i might want to have the results for each episode in my database as soon as it's downloaded instead of waiting for the whole thing to complete.

Kevin Kho

04/07/2021, 1:35 PM

I think @emre’s suggestion is good because of efficiency. This is hard because of the imbalance in lengths of

url

and

show_name

versus

episode_name

. I don’t have any suggestions beyond what you have, but I’ll come back here if I come across something

Open in Slack

Previous Next