Hi Everyone I have been facing an issue with `S3Result` main Prefect Community #ask-community

Hi Everyone, I have been facing an issue with `S3R...

Marwan Sarieddine

05/27/2020, 12:14 AM

Hi Everyone, I have been facing an issue with

S3Result

- mainly once I use

S3Result

- my memory usage is doubled - so I took the time to look through the source code to test the memory usage locally and I believe there is an issue in the implementation - more specifically in the following line in

s3_result.py

Copy code

binary_data = new.serialize_to_bytes(new.value)

a new object is being created here - requiring twice the memory allocation - at least this is what seems to me to be happening - please correct me if I am wrong here See full

write

method below

Copy code

def write(self, value: Any, **kwargs: Any) -> Result:
        """
        Writes the result to a location in S3 and returns the resulting URI.

        Args:
            - value (Any): the value to write; will then be stored as the `value` attribute
                of the returned `Result` instance
            - **kwargs (optional): if provided, will be used to format the location template
                to determine the location to write to

        Returns:
            - Result: a new Result instance with the appropriately formatted S3 URI
        """

        new = self.format(**kwargs)
        new.value = value
        self.logger.debug("Starting to upload result to {}...".format(new.location))
        binary_data = new.serialize_to_bytes(new.value)

        stream = io.BytesIO(binary_data)

        ## upload
        from botocore.exceptions import ClientError

        try:
            self.client.upload_fileobj(stream, Bucket=self.bucket, Key=new.location)
        except ClientError as err:
            self.logger.error("Error uploading to S3: {}".format(err))
            raise err

        self.logger.debug("Finished uploading result to {}.".format(new.location))
        return new

Chris White

05/27/2020, 12:25 AM

Hi @Marwan Sarieddine - yes you are correct; Prefect must convert the task return value into something that can be stored. Because Prefect imposes very little restriction on the types of data that can be passed around, converting the object to bytes is the most universal approach

Marwan Sarieddine

05/27/2020, 12:26 AM

Hi @Chris White - thank you for the quick response - so I guess the solution here would be for me to implement by own custom Result class - if I am so wary of memory usage?

Chris White

05/27/2020, 12:30 AM

yea that’s one option; we are also starting to explore exposing the serialization protocol to users directly: https://github.com/PrefectHQ/prefect/issues/2639 I’m curious, for your use case, how do you plan to get around making a copy?

Marwan Sarieddine

05/27/2020, 12:33 AM

haven’t explored “elegant” solutions yet - my hunch probably tells me that I would have to delete objects after “using” them - will get back to you on how I end up implementing this

Chris White

05/27/2020, 12:34 AM

gotcha yea I’d be curious! If there are any easy wins that we could implement that would help manage memory better I’m open to it

Avi A

05/27/2020, 7:19 AM

Exposing the serialization would give users great power, that’s cool.

Avi A

05/27/2020, 11:27 AM

@Chris White perhaps if

serialize_to_bytes

returns a Byte stream it would not have to allocate the excessive memory?

Open in Slack

Previous Next