Thread
#prefect-community
    tash lai

    tash lai

    1 year ago
    Hey! There's a problem. Just as an example, say there's a website that has a list of tv shows, a a page for each tv show has links to information about every episode. I want to scrape all this info and save everything into a table (url, show_name, episode_name, episode_description)
    Tyler Wanner

    Tyler Wanner

    1 year ago
    Hi Tash. I can't be of much help atm but could you please move this into a thread? It is appreciated to avoid walls of text, especially for large code blocks
    tash lai

    tash lai

    1 year ago
    Ok thank Tyler. So the first thing that comes to mind is
    with Flow('scrapeshows') as flow:
        url = get_shows_urls_from_front_page()
        show_details = download_show_page.map(url)
        show_name = extract_name.map(tv_show_details)
        episode_urls = extract_episode_urls.map(tv_show_details)
        episode_details = download_episode_page.map(url=flatten(episode_urls))
        episode_name = extract_expisode_name.map(episode_details)
        episode_desc = extract_episode_desc(episode_details)
    Now we have all the data we need but how do i save it to my database? This approach won't work because after flattening, zip-like behaviour won't help much because there's less 
    url
     and 
    show_name
    than 
    eposode_name
     and 
    episode_desc
    save_to_database.map(url, show_name, episode_name, episode_desc)
    Another approach is to annotate 
    episode_urls
     with 
    s
    show_name` and 
    url
     before flattening:
    @task
    def annotate_episode_url(url, show_name, episode_url):
       return (url, show_name, episode_url)
    
    with Flow('scrapeshows') as flow:
        url = get_shows_urls_from_front_page()
        show_details = download_show_page.map(url)
        show_name = extract_name.map(tv_show_details)
        episode_urls = extract_episode_urls.map(tv_show_details)
        annotated_episode_urls = annotate_episode_url.map(url, show_name, episode_urls)
        flattened = flatten(annotated_episode_urls)
        episode_show_url = flattened[0]
        episode_show_name = flattened[1]
        episode_url = flattened[2]
        episode_details = download_episode_page.map(url=episode_url)
        episode_name = extract_expisode_name.map(episode_details)
        episode_desc = extract_episode_desc(episode_details)
        save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
    But this is overly verbose and ineffective. Is there some other way to do this?
    emre

    emre

    1 year ago
    First off, I think making
    save_to_database
    a mapped task is inefficient. You are forcing yourself to make a roundtrip to the db for each episode you have. Instead, consider collecting the output of your mapped tasks and make your
    save_to_database
    bulk insert. Anyways, as you said, without annotating the show url and name to the episode urls is the simplest way to maintain the relation between show and episode. If you can modify your
    save_to_database
    task, I would just make it accept the annotated tuple instead of episode_show_url and episode_show_name, and parse the tuple inside the task. That way, you can avoid the verbose
    flatten[x]
    steps. Something like:
    @task
    def annotate_episode_url(url, show_name, episode_url_list):
       return [(url, show_name, episode_url) for episode_url in episode_url_list]
    with Flow('scrapeshows') as flow:
        url = get_shows_urls_from_front_page()
        show_details = download_show_page.map(url)
        show_name = extract_name.map(tv_show_details)
        episode_urls = extract_episode_urls.map(tv_show_details)
        annotated = annotate_episode_url.map(url, show_name, episode_urls)
        episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
        episode_name = extract_expisode_name.map(episode_details)
        episode_desc = extract_episode_desc.map(episode_details)
        save_to_database.map(annotated, episode_name, episode_desc)
    If you don't want to modify
    save_to_database
    , consider placing a collection task just before database task, for unzipping the annotated. Use
    nout
    to decrease clutter.
    @task
    def annotate_episode_url(url, show_name, episode_url_list):
       return [(url, show_name, episode_url) for episode_url in episode_url_list]
    
    @task(nout=2)
    def collect(annotated_tuple):
        return (
            [u for u, s, ep in annotated_tuple],
            [s for u, s, ep in annotated_tuple],
        )
    
    with Flow('scrapeshows') as flow:
        url = get_shows_urls_from_front_page()
        show_details = download_show_page.map(url)
        show_name = extract_name.map(tv_show_details)
        episode_urls = extract_episode_urls.map(tv_show_details)
        annotated = annotate_episode_url.map(url, show_name, episode_urls)
        episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
        episode_name = extract_expisode_name.map(episode_details)
        episode_desc = extract_episode_desc.map(episode_details)
        episode_show_url, episode_show_name = collect(annotated)
        save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
    Apparently
    nout
    does not work if the task is mapped, so collection is necessary in that regard.
    tash lai

    tash lai

    1 year ago
    Thanks, @emre As for save_to_database i might want to have the results for each episode in my database as soon as it's downloaded instead of waiting for the whole thing to complete.
    Kevin Kho

    Kevin Kho

    1 year ago
    I think @emre’s suggestion is good because of efficiency. This is hard because of the imbalance in lengths of
    url
    and
    show_name
    versus
    episode_name
    . I don’t have any suggestions beyond what you have, but I’ll come back here if I come across something