9. S3

S3 (Simple Storage Service)#

Amazon S3 is an object storage service that lets you store and retrieve any amount of data from anywhere on the internet. Unlike a file system (which organises data in a folder hierarchy) or a block storage volume (which behaves like a hard disk), S3 stores data as discrete objects — each one a self-contained bundle of data, metadata, and a unique key. S3 exists because applications need a place to persist files that is durable, infinitely scalable, and decoupled from any single server. It underpins a huge slice of AWS architectures: application assets, data lakes, backups, static sites, event pipelines, and more. 🔗

Buckets, Objects, Keys, and Regions#

Every object in S3 lives inside a bucket. A bucket is a named container that belongs to one AWS region. You choose the region at creation time and it never changes, which matters for latency, compliance, and replication. Bucket names must be globally unique across all AWS accounts — if someone already owns my-app-assets, you cannot use it.

Inside a bucket, each object is identified by its key — essentially the object’s full path within the bucket (e.g., images/profile/user-42.jpg). S3 has no real folder hierarchy; the slashes in a key are just characters that the console renders as folders for convenience. The combination of bucket name + key + (optionally) version ID uniquely identifies every object in S3. 🔗

A single object can be up to 5 TB in size. Each object also carries metadata — system metadata (content type, size, ETag) and optional user-defined key-value pairs you attach at upload time.

Storage Classes#

S3 offers several storage classes tuned for different access patterns and cost profiles. Choosing the right class is one of the most cost-effective decisions you can make. 🔗

S3 Standard — The default. High durability (11 nines), high availability, low latency. Use for frequently accessed data.
S3 Standard-IA (Infrequent Access) — Same durability and low latency as Standard but cheaper storage cost, with a per-retrieval fee. Use for data accessed less than once a month (e.g., backups, disaster recovery files).
S3 One Zone-IA — Like Standard-IA but stored in a single AZ. Cheaper still, but you lose resilience to AZ failure. Acceptable for reproducible data (e.g., thumbnail images that can be regenerated).
S3 Glacier Instant Retrieval — Archive tier with millisecond retrieval. Designed for data accessed roughly once a quarter but needing immediate access when requested (e.g., medical images, media archives).
S3 Glacier Flexible Retrieval — Archive tier with retrieval times of minutes to hours (Expedited, Standard, Bulk). Lower cost than Glacier Instant. Use when retrieval speed is not time-critical.
S3 Glacier Deep Archive — The lowest-cost tier. Retrieval takes 12–48 hours. Designed for long-term regulatory retention (e.g., 7-year compliance archives).
S3 Intelligent-Tiering — Monitors access patterns and automatically moves objects between access tiers (Frequent, Infrequent, Archive Instant, Archive, Deep Archive). Small monthly monitoring fee per object. Ideal when access patterns are unknown or unpredictable.

Lifecycle Policies#

Rather than manually managing storage classes, you can define lifecycle policies — rules that S3 evaluates daily to automatically transition or expire objects. 🔗

A typical rule might say: “After 30 days, transition objects under the logs/ prefix to Standard-IA. After 90 days, transition to Glacier Flexible Retrieval. After 365 days, delete them.” This keeps storage costs low without any application-side logic. Transitions can only move objects down the cost hierarchy (you cannot transition from Glacier back to Standard via a lifecycle rule; that requires a restore operation).

Versioning and MFA Delete#

When versioning is enabled on a bucket, S3 retains every version of every object instead of overwriting it. Each upload creates a new version with a unique version ID. A DELETE request on a versioned object does not erase it — it places a delete marker on top, making the object appear deleted while all prior versions remain intact and recoverable. 🔗

Versioning is a prerequisite for S3 Replication and Object Lock.

MFA Delete adds a second layer of protection: in addition to valid AWS credentials, permanently deleting a version or disabling versioning requires a valid MFA token. This protects against both accidental deletion and compromised credentials. MFA Delete can only be enabled by the bucket owner using the root account via the CLI. 🔗

Static Website Hosting#

S3 can serve a bucket’s contents as a static website — HTML, CSS, JavaScript, images — directly over HTTP, without any web server. You enable the feature on the bucket, specify an index document (e.g., index.html) and optionally an error document, and S3 exposes a public endpoint like http://my-bucket.s3-website.eu-west-1.amazonaws.com. 🔗

A common pattern is to front this endpoint with CloudFront for HTTPS, global caching, and a custom domain. For a React or Vue single-page app, this is a completely serverless hosting solution with no EC2 required.

S3 Replication (CRR and SRR)#

S3 Replication automatically copies objects from a source bucket to a destination bucket asynchronously in the background. Versioning must be enabled on both buckets. 🔗

Cross-Region Replication (CRR) — Source and destination are in different AWS regions. Used for disaster recovery, latency reduction for geographically distributed users, or regulatory requirements to keep copies in specific regions.
Same-Region Replication (SRR) — Source and destination are in the same region. Used for log aggregation across accounts, or maintaining a live copy in a separate account for compliance.

Replication copies only new objects written after the rule is created. Existing objects are not replicated automatically (you need S3 Batch Replication for that). Delete markers are not replicated by default, but this can be enabled explicitly.

Pre-signed URLs (Downloads and Uploads)#

By default, S3 objects in a private bucket are inaccessible to the public. A pre-signed URL grants time-limited access to a specific object without requiring the requester to have AWS credentials. The URL embeds the bucket owner’s credentials and an expiry timestamp — anyone with the URL can use it until it expires. 🔗

Pre-signed URLs are most commonly associated with downloads (GET), but they also support uploads (PUT). This is a frequent exam trap: you can generate a pre-signed URL that lets an end user upload a file directly to S3 from their browser, bypassing your backend server entirely. The backend generates the URL, hands it to the client, and the client uploads directly to S3. This offloads upload bandwidth and processing from your application servers.

# Generate a pre-signed URL for a PUT (upload) — valid for 1 hour
import boto3
s3 = boto3.client('s3', region_name='eu-west-1')

url = s3.generate_presigned_url(
    ClientMethod='put_object',
    Params={'Bucket': 'my-bucket', 'Key': 'uploads/user-file.pdf'},
    ExpiresIn=3600
)

S3 Access Points#

Access Points are named network endpoints attached to a bucket, each with its own access policy. Instead of a single, increasingly complex bucket policy managing access for many teams or applications, you create one Access Point per use case (e.g., one for the analytics team, one for the ingest pipeline) and attach a focused policy to each. 🔗

Access Points can also be scoped to a VPC, ensuring that only traffic originating from within that VPC can use the endpoint — never traversing the public internet.

Event Notifications#

S3 can emit event notifications when things happen to objects in a bucket: uploads (s3:ObjectCreated), deletions (s3:ObjectRemoved), restore completions from Glacier, and more. These notifications can be delivered to three destinations: 🔗

Lambda — Process the object immediately (e.g., resize an uploaded image, parse a CSV, index content).
SQS — Queue the event for reliable downstream processing.
SNS — Fan out the event to multiple subscribers simultaneously.

A typical real-world pattern: a user uploads a video to S3 → S3 sends an ObjectCreated notification to an SQS queue → a Lambda function polls the queue and kicks off a MediaConvert transcoding job.

Object Lock and Glacier Vault Lock#

Object Lock enforces a Write Once, Read Many (WORM) model — objects cannot be overwritten or deleted for a defined retention period. This is used for regulatory compliance (financial records, audit logs) where data immutability must be provable. 🔗

Object Lock operates in two modes:

Governance mode — Users with special IAM permissions can override or remove the lock. Useful for testing and internal controls.
Compliance mode — Nobody — not even the root account — can delete or alter the object during the retention period. Strongest protection.

You can also set a Legal Hold on individual objects independently of any retention period, preventing deletion until the hold is explicitly removed.

Glacier Vault Lock is a similar concept but applies to Glacier Vaults specifically. Once a policy is locked, it becomes immutable and cannot be changed. 🔗

Encryption#

S3 offers multiple encryption options for data at rest. Understanding which key-management model each uses is important for the exam. 🔗

SSE-S3 (Server-Side Encryption with S3-Managed Keys) — S3 handles everything. Keys are managed by AWS entirely behind the scenes. No configuration required, enabled by default on all new buckets. Header: x-amz-server-side-encryption: AES256.
SSE-KMS (Server-Side Encryption with AWS KMS) — Objects are encrypted using a KMS Customer Master Key (CMK). You get an audit trail in CloudTrail (every decryption is logged), and you control key rotation and access policies. Header: x-amz-server-side-encryption: aws:kms. Important caveat: heavy S3 traffic with SSE-KMS can hit KMS API rate limits, since every GET calls KMS to decrypt.
DSSE-KMS (Dual-layer Server-Side Encryption with KMS) — Two independent layers of KMS encryption. Designed for workloads with strict regulatory requirements. 🔗
SSE-C (Server-Side Encryption with Customer-Provided Keys) — You supply the encryption key with every request; S3 performs the encryption but does not store your key. Requires HTTPS. Useful when you need full key custody.
Client-Side Encryption — You encrypt the data before sending it to S3. S3 stores ciphertext and has no knowledge of your keys or plaintext. Maximum control, maximum responsibility.

For data in transit, S3 supports HTTPS (TLS) and you can enforce it via a bucket policy that denies requests where aws:SecureTransport is false.

Bucket Policies vs ACLs#

Access to S3 resources is controlled primarily through bucket policies — JSON-based IAM-style policies attached directly to the bucket. They can grant or deny access to principals (users, roles, other accounts, anonymous users) and can reference conditions like IP address, VPC, or whether MFA was used. 🔗

Access Control Lists (ACLs) are a legacy mechanism that predates bucket policies. They are coarser-grained (canned ACLs like public-read, private) and are now generally discouraged. AWS recommends disabling ACLs (using Bucket Owner Enforced mode) and relying exclusively on bucket policies and IAM policies. The one context where ACLs still see use is granting access to a specific AWS account that you don’t control — though cross-account bucket policies cover this more cleanly.

A simple rule for the exam: prefer bucket policies for bucket-level access control; use IAM policies for controlling what your own users and roles can do.

CORS Configuration#

CORS (Cross-Origin Resource Sharing) becomes relevant when a web application hosted on one domain (e.g., app.example.com) makes a JavaScript request directly to an S3 bucket on a different domain (e.g., mybucket.s3.amazonaws.com). Browsers block these cross-origin requests by default. 🔗

You configure CORS on the S3 bucket by specifying which origins are allowed, which HTTP methods are permitted, and which headers are exposed. This is required for browser-based uploads via pre-signed URLs or any JavaScript that fetches assets from a different S3 bucket origin than the page itself.

S3 Select and Glacier Select#

Normally, to find a record in a large CSV or JSON file stored in S3, you’d download the entire object and filter it in your application. S3 Select lets you push a SQL-like query directly to S3, which filters and returns only the relevant subset of data — dramatically reducing transfer costs and latency. 🔗

Glacier Select provides the same capability for objects stored in Glacier, running queries against archived data without a full restore.

Both work on CSV, JSON, and Parquet formats, optionally compressed with GZIP or BZIP2. For large-scale analytics you’d use Athena, but S3 Select is lighter-weight and requires no infrastructure setup.

Performance#

S3 can handle very high request rates, but knowing the right tools to push it further matters at scale. 🔗

Multipart Upload splits a large object into parts that are uploaded independently and in parallel, then S3 assembles them. AWS recommends multipart upload for objects over 100 MB and requires it for objects over 5 GB. It also provides resilience: if one part fails, only that part needs to be retried. 🔗

Transfer Acceleration routes uploads through AWS’s globally distributed CloudFront edge locations using the optimised AWS backbone network instead of the public internet. A user in Tokyo uploading to a bucket in us-east-1 will see significantly faster throughput with Transfer Acceleration enabled. There is an additional per-GB cost. 🔗

Byte-Range Fetches allow you to download specific byte ranges of an object in parallel — analogous to multipart upload but for downloads. You split the target object into ranges, issue concurrent GetObject requests for each range, and reconstruct the file client-side. This maximises throughput for large downloads and also lets you retrieve just the header of a file (e.g., the first few bytes of a Parquet file) without fetching the whole object.