Skip to content

S3

dadosfera.services.s3.list_s3_objects

list_s3_objects(bucket_name, prefix, aws_access_key_id=None, aws_secret_access_key=None)

List objects in an S3 bucket with pagination support.

This function lists objects in an AWS S3 bucket under a specified prefix, handling pagination automatically. It filters out zero-byte objects and validates the response status for each page of results.

PARAMETER DESCRIPTION
bucket_name

Name of the S3 bucket. Example: "my-company-data-bucket"

TYPE: str

prefix

Prefix to filter objects in the bucket. Acts like a folder path in the S3 bucket. Example: "data/2024/01/" or "logs/"

TYPE: str

aws_access_key_id

AWS access key ID. If not provided, falls back to default credentials. Defaults to None.

TYPE: Optional[str] DEFAULT: None

aws_secret_access_key

AWS secret access key. If not provided, falls back to default credentials. Defaults to None.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
List[Dict[str, Any]]

List[Dict[str, Any]]: List of S3 object metadata dictionaries.

RAISES DESCRIPTION
Exception

When S3 API returns a non-200 status code.

ClientError

When AWS API calls fail. Common cases: - Invalid credentials - Insufficient permissions - Bucket does not exist - Network issues

NoCredentialsError

When no AWS credentials are available and none are provided.

Examples:

List all objects in a specific prefix:

>>> objects = list_s3_objects('my-bucket', 'data/2024/')
>>> for obj in objects:
...     print(f"Found {obj['Key']} of size {obj['Size']}")

Using explicit credentials:

>>> objects = list_s3_objects(
...     'my-bucket',
...     'logs/',
...     aws_access_key_id='AKIAXXXXXXXXXXXXXXXX',
...     aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
... )
Notes
  • Uses us-east-1 region by default
  • Automatically handles pagination of results
  • Filters out zero-byte objects (typically folder markers)
  • Uses boto3 session for AWS API calls
  • Validates HTTP status code for each page
Performance Considerations
  • For buckets with many objects, this function may make multiple API calls
  • Consider using prefix to narrow down results
  • Response time depends on number of objects and network conditions
  • Memory usage scales with number of non-zero-byte objects
See Also
  • AWS S3 ListObjects documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html
  • boto3 S3 documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html
Source code in dadosfera/services/s3.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def list_s3_objects(
    bucket_name: str,
    prefix: str,
    aws_access_key_id: Optional[str] = None,
    aws_secret_access_key: Optional[str] = None
) -> List[Dict[str, Any]]:
    """List objects in an S3 bucket with pagination support.

    This function lists objects in an AWS S3 bucket under a specified prefix,
    handling pagination automatically. It filters out zero-byte objects and
    validates the response status for each page of results.

    Args:
        bucket_name (str): Name of the S3 bucket.
            Example: "my-company-data-bucket"

        prefix (str): Prefix to filter objects in the bucket.
            Acts like a folder path in the S3 bucket.
            Example: "data/2024/01/" or "logs/"

        aws_access_key_id (Optional[str], optional): AWS access key ID.
            If not provided, falls back to default credentials.
            Defaults to None.

        aws_secret_access_key (Optional[str], optional): AWS secret access key.
            If not provided, falls back to default credentials.
            Defaults to None.

    Returns:
        List[Dict[str, Any]]: List of S3 object metadata dictionaries.

    Raises:
        Exception: When S3 API returns a non-200 status code.

        botocore.exceptions.ClientError: When AWS API calls fail.
            Common cases:
            - Invalid credentials
            - Insufficient permissions
            - Bucket does not exist
            - Network issues

        boto3.exceptions.NoCredentialsError: When no AWS credentials are available
            and none are provided.

    Examples:
        List all objects in a specific prefix:
        >>> objects = list_s3_objects('my-bucket', 'data/2024/')
        >>> for obj in objects:
        ...     print(f"Found {obj['Key']} of size {obj['Size']}")

        Using explicit credentials:
        >>> objects = list_s3_objects(
        ...     'my-bucket',
        ...     'logs/',
        ...     aws_access_key_id='AKIAXXXXXXXXXXXXXXXX',
        ...     aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
        ... )

    Notes:
        - Uses us-east-1 region by default
        - Automatically handles pagination of results
        - Filters out zero-byte objects (typically folder markers)
        - Uses boto3 session for AWS API calls
        - Validates HTTP status code for each page

    Performance Considerations:
        - For buckets with many objects, this function may make multiple API calls
        - Consider using prefix to narrow down results
        - Response time depends on number of objects and network conditions
        - Memory usage scales with number of non-zero-byte objects

    See Also:
        - AWS S3 ListObjects documentation:
          https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html
        - boto3 S3 documentation:
          https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html
    """
    session = boto3.Session()
    client = session.client(
        "s3",
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
        region_name="us-east-1"
    )

    paginator = client.get_paginator("list_objects")
    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)
    s3_objects = []
    for page in page_iterator:
        # Just validating if the request was successful
        if page["ResponseMetadata"]["HTTPStatusCode"] != 200:
            raise Exception(
                "received a status code different than 200 "
                f"status_code: {page['ResponseMetadata']['HTTPStatusCode']}"
            )
        if "Contents" in page:
            for s3_object in page["Contents"]:
                if s3_object["Size"] > 0:
                    s3_objects.append(s3_object)
    return s3_objects

dadosfera.services.s3.get_objects_from_s3

get_objects_from_s3(bucket_name, prefix)

Retrieve and decode objects from AWS S3 with automatic character encoding detection.

This function retrieves objects from an AWS S3 bucket, automatically detects their character encoding using chardet, and returns their decoded contents. It uses the list_s3_objects function to get object metadata before downloading each object individually.

PARAMETER DESCRIPTION
bucket_name

Name of the S3 bucket to search. Example: "my-company-data-bucket"

TYPE: str

prefix

Prefix (folder path) to filter objects in the bucket. Example: "data/2024/01/" or "logs/"

TYPE: str

RETURNS DESCRIPTION
List[Dict[str, str]]

List[Dict[str, str]]: List of dictionaries containing file information. Each dictionary contains: - file_content (str): Decoded content of the file - key (str): Full S3 key/path of the object - file_name (str): Extracted file name without extension Example: [ { 'file_content': 'content of file1...', 'key': 'data/2024/01/file1.txt', 'file_name': 'file1' }, ... ] Returns empty list if no objects are found.

RAISES DESCRIPTION
ClientError

When AWS API calls fail. Common cases: - Invalid credentials - Insufficient permissions - Bucket does not exist - Object does not exist - Network issues

UnicodeDecodeError

When file content cannot be decoded with detected encoding.

NoCredentialsError

When no AWS credentials are available.

Example

objects = get_objects_from_s3('my-bucket', 'data/2024/') for obj in objects: ... print(f"File {obj['file_name']} content length: {len(obj['file_content'])}")

Notes
  • Uses us-east-1 region by default
  • Uses chardet to detect file encoding
  • Logs operations at INFO and DEBUG levels
  • Requires list_s3_objects function
  • Returns empty list instead of None when no objects found
Dependencies
  • boto3: AWS SDK for Python
  • chardet: Character encoding detection
  • logging: For operation logging
  • list_s3_objects: Custom function for listing S3 objects
See Also
  • AWS S3 GetObject documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html
  • chardet documentation: https://chardet.readthedocs.io/en/latest/usage.html
Source code in dadosfera/services/s3.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
def get_objects_from_s3(bucket_name: str, prefix: str) -> List[Dict[str, str]]:
    """Retrieve and decode objects from AWS S3 with automatic character encoding detection.

    This function retrieves objects from an AWS S3 bucket, automatically detects
    their character encoding using chardet, and returns their decoded contents.
    It uses the list_s3_objects function to get object metadata before downloading
    each object individually.

    Args:
        bucket_name (str): Name of the S3 bucket to search.
            Example: "my-company-data-bucket"

        prefix (str): Prefix (folder path) to filter objects in the bucket.
            Example: "data/2024/01/" or "logs/"

    Returns:
        List[Dict[str, str]]: List of dictionaries containing file information.
            Each dictionary contains:
            - file_content (str): Decoded content of the file
            - key (str): Full S3 key/path of the object
            - file_name (str): Extracted file name without extension
            Example: [
                {
                    'file_content': 'content of file1...',
                    'key': 'data/2024/01/file1.txt',
                    'file_name': 'file1'
                },
                ...
            ]
            Returns empty list if no objects are found.

    Raises:
        botocore.exceptions.ClientError: When AWS API calls fail.
            Common cases:
            - Invalid credentials
            - Insufficient permissions
            - Bucket does not exist
            - Object does not exist
            - Network issues

        UnicodeDecodeError: When file content cannot be decoded with detected encoding.

        boto3.exceptions.NoCredentialsError: When no AWS credentials are available.

    Example:
        >>> objects = get_objects_from_s3('my-bucket', 'data/2024/')
        >>> for obj in objects:
        ...     print(f"File {obj['file_name']} content length: {len(obj['file_content'])}")

    Notes:
        - Uses us-east-1 region by default
        - Uses chardet to detect file encoding
        - Logs operations at INFO and DEBUG levels
        - Requires list_s3_objects function
        - Returns empty list instead of None when no objects found


    Dependencies:
        - boto3: AWS SDK for Python
        - chardet: Character encoding detection
        - logging: For operation logging
        - list_s3_objects: Custom function for listing S3 objects

    See Also:
        - AWS S3 GetObject documentation:
          https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html
        - chardet documentation:
          https://chardet.readthedocs.io/en/latest/usage.html
    """
    session = boto3.Session()
    client = session.client('s3', region_name='us-east-1')

    logger.info(f"Listing objects in bucket {bucket_name} for prefix {prefix}")
    objects_metadata = list_s3_objects(bucket_name=bucket_name, prefix=prefix)
    logger.info(f"Found {len(objects_metadata)} objects")

    objects = []
    for object_metadata in objects_metadata:
        response = client.get_object(Bucket=bucket_name, Key=object_metadata['Key'])
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:
            body = response['Body'].read()

            encoding = chardet.detect(body[:10000])['encoding']
            if encoding is not None:
                body = body.decode(encoding)

            logger.debug(f"Detected encoding {encoding} for file {object_metadata['Key']}")
            objects.append({
                'file_content': body,
                'key': object_metadata['Key'],
                'file_name': object_metadata['Key'].split('/')[-1].split('.')[0]
            })

    return objects

dadosfera.services.s3.get_s3_bucket_size

get_s3_bucket_size(bucket_name, prefix='', aws_access_key_id=None, aws_secret_access_key=None)

Calculates the size of the S3 bucket or prefix in bytes.

PARAMETER DESCRIPTION
bucket_name

Name of the S3 bucket to search. Example: "my-company-data-bucket"

TYPE: str

prefix

Prefix (folder path) to filter objects in the bucket. Example: "data/2024/01/" or "logs/"

TYPE: str DEFAULT: ''

aws_access_key_id

AWS access key ID. If not provided, falls back to default credentials. Defaults to None.

TYPE: Optional[str] DEFAULT: None

aws_secret_access_key

AWS secret access key. If not provided, falls back to default credentials. Defaults to None.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
total_size

size of the S3 bucket

TYPE: float

RAISES DESCRIPTION
ClientError

When AWS API calls fail. Common cases: - Invalid credentials - Insufficient permissions - Bucket does not exist - Object does not exist - Network issues

Example

objects = get_objects_from_s3('my-bucket', 'data/2024/') for obj in objects: ... print(f"File {obj['file_name']} content length: {len(obj['file_content'])}")

Notes
  • Uses us-east-1 region by default
Dependencies
  • boto3: AWS SDK for Python
Source code in dadosfera/services/s3.py
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def get_s3_bucket_size(bucket_name, prefix="", aws_access_key_id = None, aws_secret_access_key = None):
    """Calculates the size of the S3 bucket or prefix in bytes.


    Args:
        bucket_name (str): Name of the S3 bucket to search.
            Example: "my-company-data-bucket"

        prefix (str): Prefix (folder path) to filter objects in the bucket.
            Example: "data/2024/01/" or "logs/"
        aws_access_key_id (Optional[str], optional): AWS access key ID.
            If not provided, falls back to default credentials.
            Defaults to None.

        aws_secret_access_key (Optional[str], optional): AWS secret access key.
            If not provided, falls back to default credentials.
            Defaults to None.

    Returns:
        total_size (float): size of the S3 bucket

    Raises:
        botocore.exceptions.ClientError: When AWS API calls fail.
            Common cases:
            - Invalid credentials
            - Insufficient permissions
            - Bucket does not exist
            - Object does not exist
            - Network issues


    Example:
        >>> objects = get_objects_from_s3('my-bucket', 'data/2024/')
        >>> for obj in objects:
        ...     print(f"File {obj['file_name']} content length: {len(obj['file_content'])}")

    Notes:
        - Uses us-east-1 region by default

    Dependencies:
        - boto3: AWS SDK for Python

    """
    session = boto3.Session()
    client = session.client(
        "s3",
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
        region_name="us-east-1"
    )
    paginator = client.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)

    total_size = 0
    for page in pages:
        for obj in page.get("Contents", []):
            total_size += obj["Size"]

    return total_size