Sylvain Hazard
10/08/2021, 8:54 AMbatch_size = Parameter("batch_size", default = 50)
codes = get_codes() # Gets a list of codes from crawling a website (~13 000 codes)
batches = batch_generator(codes, batch_size) # Generates batches in order to parallelize tasks
documents = get_documents.map(batches) # Loads a document for each code in the batch. Runs about 30s to 2 min for a batch of size 50
uploaded_documents = upload_documents(documents) # Uploads the loaded documents to a DB. Runs ~5 min to 20 min for a batch of size 50
parsed_documents = parse_documents.map(uploaded_documents) # Parses the uploaded documents using a home API. Runs about ~1 min for a batch of 50.
None
input value, which makes them fail.
• The 4 tasks that were running at this time end up being killed by the Zombie Killer after a while.
• As we are using a Dask Executor, DFE allows a portion of downstream parsing tasks to succeed. The rest is divided between failed tasks (receiving None
inputs as well) and TriggerFailed tasks.None
input and failed because of it.
51/311 parsing tasks succeeded, 225/311 failed and 35/311 TriggerFailedKevin Kho
batch_generator
?Sylvain Hazard
10/08/2021, 2:21 PM@task(name="batch-generator")
def batch_generator(iterable, batch_size: int, padding_value: Any = None) -> List[Any]:
batches = list(zip_longest(*[iter(iterable)] * batch_size, fillvalue=padding_value))
batches_unpadded = [tuple(b for b in batch if b is not None) for batch in batches]
<http://logger.info|logger.info>(f"Generating {len(batches_unpadded)} batchs with batch_size={batch_size}")
return batches_unpadded
I did not code this and am not sure why we need to pad those but it probably does not matter. Basically it takes a long list and returns multiple slices of that list each of size batch_size
Kevin Kho
uploaded_documents = upload_documents(documents) # Uploads the loaded documents to a DB. Runs ~5 min to 20 min for a batch of size 50
a mapped operation?Sylvain Hazard
10/08/2021, 4:26 PMKevin Kho
Sylvain Hazard
10/08/2021, 4:33 PMSylvain Hazard
10/08/2021, 5:12 PMKevin Kho