Hey! There's a problem. Just as an example, say th...
# ask-community
t
Hey! There's a problem. Just as an example, say there's a website that has a list of tv shows, a a page for each tv show has links to information about every episode. I want to scrape all this info and save everything into a table (url, show_name, episode_name, episode_description)
t
Hi Tash. I can't be of much help atm but could you please move this into a thread? It is appreciated to avoid walls of text, especially for large code blocks
t
Ok thank Tyler. So the first thing that comes to mind is
Copy code
with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    episode_details = download_episode_page.map(url=flatten(episode_urls))
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc(episode_details)
Now we have all the data we need but how do i save it to my database? This approach won't work because after flattening, zip-like behaviour won't help much because there's less 
url
 and 
show_name
than 
eposode_name
 and 
episode_desc
Copy code
save_to_database.map(url, show_name, episode_name, episode_desc)
Another approach is to annotate 
episode_urls
 with 
s
show_name` and 
url
 before flattening:
Copy code
@task
def annotate_episode_url(url, show_name, episode_url):
   return (url, show_name, episode_url)

with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    annotated_episode_urls = annotate_episode_url.map(url, show_name, episode_urls)
    flattened = flatten(annotated_episode_urls)
    episode_show_url = flattened[0]
    episode_show_name = flattened[1]
    episode_url = flattened[2]
    episode_details = download_episode_page.map(url=episode_url)
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc(episode_details)
    save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
But this is overly verbose and ineffective. Is there some other way to do this?
🙏 1
e
First off, I think making
save_to_database
a mapped task is inefficient. You are forcing yourself to make a roundtrip to the db for each episode you have. Instead, consider collecting the output of your mapped tasks and make your
save_to_database
bulk insert. Anyways, as you said, without annotating the show url and name to the episode urls is the simplest way to maintain the relation between show and episode. If you can modify your
save_to_database
task, I would just make it accept the annotated tuple instead of episode_show_url and episode_show_name, and parse the tuple inside the task. That way, you can avoid the verbose
flatten[x]
steps. Something like:
Copy code
@task
def annotate_episode_url(url, show_name, episode_url_list):
   return [(url, show_name, episode_url) for episode_url in episode_url_list]
with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    annotated = annotate_episode_url.map(url, show_name, episode_urls)
    episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc.map(episode_details)
    save_to_database.map(annotated, episode_name, episode_desc)
If you don't want to modify
save_to_database
, consider placing a collection task just before database task, for unzipping the annotated. Use
nout
to decrease clutter.
Copy code
@task
def annotate_episode_url(url, show_name, episode_url_list):
   return [(url, show_name, episode_url) for episode_url in episode_url_list]

@task(nout=2)
def collect(annotated_tuple):
    return (
        [u for u, s, ep in annotated_tuple],
        [s for u, s, ep in annotated_tuple],
    )

with Flow('scrapeshows') as flow:
    url = get_shows_urls_from_front_page()
    show_details = download_show_page.map(url)
    show_name = extract_name.map(tv_show_details)
    episode_urls = extract_episode_urls.map(tv_show_details)
    annotated = annotate_episode_url.map(url, show_name, episode_urls)
    episode_details = download_episode_page.map(url=flatten(episode_urls)) # flatten(episode_urls) and annotated have the same order, this works
    episode_name = extract_expisode_name.map(episode_details)
    episode_desc = extract_episode_desc.map(episode_details)
    episode_show_url, episode_show_name = collect(annotated)
    save_to_database.map(episode_show_url, episode_show_name, episode_name, episode_desc)
Apparently
nout
does not work if the task is mapped, so collection is necessary in that regard.
t
Thanks, @emre As for save_to_database i might want to have the results for each episode in my database as soon as it's downloaded instead of waiting for the whole thing to complete.
k
I think @emre’s suggestion is good because of efficiency. This is hard because of the imbalance in lengths of
url
and
show_name
versus
episode_name
. I don’t have any suggestions beyond what you have, but I’ll come back here if I come across something