Issam Assafi
09/23/2021, 12:38 PMRuslan Aliev
09/23/2021, 12:46 PMIssam Assafi
09/23/2021, 12:50 PMpage_path_pdf_pairs = get_pages_from_pdf(pdf_path)
texts_of_pages = get_text_from_page.map(page_path_pdf_pairs)
ocr_pages = join_ocr_pages(texts_of_pages)
but in the code above it's applied only to 1 pdf, and i have multiple PDFs ... any idea how i can handle this while gaining parallelisationadvantage?emre
09/23/2021, 1:15 PMflatten
to, well, flatten your nested lists into a single list, which will then be mappable.
from prefect import flatten
with Flow('x') as f:
page_path_pdf_pairs = get_pages_from_pdf.map(pdf_paths)
texts_of_pages = get_text_from_page.map(flatten(page_path_pdf_pairs))
ocr_pages = join_ocr_pages(texts_of_pages)
Be careful, by using flatten you are losing the nested structure, so you might lose the information of which page belongs to which pdf. Tag your pages with their source pdf names wherever possible, so you can group the pages back.Kevin Kho
Wieger Opmeer
09/23/2021, 2:36 PMWieger Opmeer
09/23/2021, 2:37 PMKevin Kho
Wieger Opmeer
09/23/2021, 2:47 PMWieger Opmeer
09/23/2021, 2:47 PMWieger Opmeer
09/23/2021, 2:48 PMWieger Opmeer
09/23/2021, 2:48 PMWieger Opmeer
09/23/2021, 2:49 PMWieger Opmeer
09/23/2021, 2:50 PMKevin Kho
Wieger Opmeer
09/23/2021, 2:52 PM