Sumo Logic ahead of the pack
Read articleComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
November 6, 2022
Almost everyone who’s used Amazon Web Services (AWS) has used Amazon simple storage service (S3). In the decade since it was first released, S3 storage has become essential to thousands of companies for file storage. While using S3 in simple ways is easy, at a larger scale it involves a lot of subtleties and potentially costly mistakes, especially when your data or team are scaling up.
Here are the most important things about AWS S3 that will help you avoid costly mistakes. We’ve assembled these tips and best practices to help your team make the most of your cloud storage. While these tips are focused on performance, optimization and cost savings, we have further reading if you’re looking for the top six Amazon S3 metrics to monitor.
Getting data into and out of AWS S3 takes time. If you’re moving data on a frequent basis, there’s a good chance you can speed it up. Cutting down the time you spend uploading and downloading files can be remarkably valuable in indirect ways — for example, if your team saves 10 minutes every time you deploy a staging build, you are improving engineering productivity significantly.
S3 is highly scalable, so in principle, with a big enough pipe or enough instances, you can get arbitrarily high throughput. A good example is S3DistCp, which uses many workers and instances. But almost always you’re hit with one of two bottlenecks:
Improve S3 latency by paying attention to regions and connectivity
The first takeaway from this is that regions and connectivity matter — for example, each region may have different latencies. Obviously, if you’re moving data within AWS via an EC2 instance or through various buckets, such as off of an EBS volume, you’re better off if your EC2 instance and S3 region correspond.
If your servers are in a major data center but not in Amazon EC2, you might consider using DirectConnect ports to get significantly higher bandwidth (you pay per port). Alternatively, you can use S3 Transfer Acceleration to get data into AWS faster simply by changing your API endpoints. You have to pay for that too, the equivalent of 1-2 months of storage cost for the data transfer in either direction. For distributing content quickly to users worldwide, remember you can use BitTorrent support, Amazon CloudFront, or another CDN with S3 as its origin.
Improve S3 performance by using higher bandwidth networks
Secondly, instance types matter. If you’re using EC2 servers, some instance types have higher bandwidth network connectivity than others. You can see this if you sort by “Network Performance” on the excellent ec2instances.info list.
Use concurrency to improve AWS S3 latency and performance
Thirdly, and critically if you are dealing with lots of items, concurrency matters. Each S3 operation is an API request with significant latency — tens to hundreds of milliseconds, which adds up to pretty much forever if you have millions of objects and try to work with them one at a time. So what determines your overall throughput in moving many objects is the concurrency level of the transfer: How many worker threads (connections) on one instance and how many instances are used.
Many common AWS S3 libraries (including the widely used s3cmd) do not by default make many connections at once to transfer data. Both s4cmd and AWS’ own AWS CLI do make concurrent connections and are much faster for many files or large transfers (since multipart uploads allow parallelism).
Another approach is with EMR, using Hadoop to parallelize the problem. For multipart syncs or uploads on a higher-bandwidth network, a reasonable part size is 25–50MB. It’s also possible to list objects much faster, too, if you traverse a folder hierarchy or other prefix hierarchy in parallel.
Finally, if you really have a ton of data to move in batches, just ship it.
Before you put something in AWS S3, there are several things to think about. One of the most important is a simple question:
When and how should this object be deleted?
Remember, large data will probably expire — that is, the cost of paying Amazon to store it in its current form will become higher than the expected value it offers your business. You might re-process or aggregate data from long ago, but it’s unlikely you want raw unprocessed logs or builds or archives forever.
At the time you are saving a piece of data, it may seem like you can just decide later. Most files are put in S3 by a regular process via a server, a data pipeline, a script, or even repeated human processes — but you’ve got to think through what’s going to happen to that data over time.
In our experience, most AWS S3 users don’t consider lifecycle up front, which means mixing files that have short lifecycles together with ones that have longer ones. By doing this you incur significant technical debt around data organization (or increasing costs to Amazon).
Once you know the answers, you’ll find managed lifecycles and AWS S3 object tagging are your friends. In particular, you want to delete or archive based on object tags, so it’s wise to tag your objects appropriately so that it is easier to apply lifecycle policies. It is important to mention that S3 tagging has a maximum limit of 10 tags per object and 128 unicode character.
You’ll also want to consider compression schemes. For large data that isn’t already compressed, you almost certainly want to — S3 bandwidth and cost constraints generally make compression worth it. (Also consider what tools will read it. EMR supports specific formats like gzip, bzip2, and LZO, so it helps to pick a compatible convention.)
When and how is the AWS S3 object modified?
As with many engineering problems, prefer immutability when possible — design so objects are never modified, but only created and later deleted. However, sometimes mutability is necessary. If S3 is your sole copy of mutable log data, you should consider some sort of backup — or locate the data in an AWS S3 bucket with versioning enabled.
If all this seems like it’s a headache and hard to document, it’s a good sign no one on the team understands it. By the time you scale to terabytes or petabytes of data and dozens of engineers, it’ll be more painful to sort out.
This best practice is possibly the most important one here. Before you put something into Amazon S3, ask yourself the following questions:
Are there people who should not be able to modify this data?
Are there people who should not be able to read this data?
How are the latter access rules likely to change in the future?
Should the data be encrypted? (And if so, where and how will we manage the encryption keys?)
Are there specific compliance requirements?
Some data is completely non-sensitive and can be shared with any employee. For these scenarios the answers are easy: Just put it into S3 without encryption or public access policies. However, every business has sensitive data — it’s just a matter of which data, and how sensitive it is. Determine whether the answers to any of these questions are “yes.”
The compliance question can also be confusing. Ask yourself the following:
Does the data you’re storing contain financial, PII, cardholder, or patient information?
Do you have PCI, HIPAA, SOX, GDPR or EU Safe Harbor compliance requirements?
Do you have customer data with restrictive agreements in place? For example, are you promising customers that their data is encrypted at rest and in transit?
Minimally, you’ll probably want to store data with different needs in separate S3 buckets, regions, and/or AWS accounts, and set up documented processes around encryption and access control for that data.
It’s not fun digging through all this when all you want to do is save a little bit of data, but trust us, it’ll save in the long run to think about it early.
Newcomers to S3 are always surprised to learn that latency on S3 operations depends on key names because prefix similarities become a bottleneck at more than about 100 requests per second. If you need high volumes of operations, it is essential to consider naming schemes with more variability at the beginning of the key names, like alphanumeric or hex hash codes in the first 6 to 8 characters, to avoid internal hot spots within S3 infrastructure.
If you’ve thought through your lifecycles, you probably want to tag objects so you can automatically delete or transition objects based on tags, for example setting a policy like “archive everything with object tag raw to Amazon S3 Glacier after three months.”
There’s no magic bullet here, other than to decide upfront which you care about more for each type of data: Easy-to-manage policies or high-volume random-access operations?
A related consideration for how you organize your data is that it’s extremely slow to crawl through millions of objects without parallelism. Say you want to tally up your usage on an S3 bucket with ten million objects. Well, if you don’t have any idea of the structure of the data, good luck! If you have reasonable tagging, or if you have uniformly distributed hashes with a known alphabet, it’s also possible to parallelize.
S3’s standard storage class offers very high durability (it advertises 99.999999999% durability, or “eleven 9s”), high availability, low latency access, and relatively cheap access cost.
There are three ways you can store data with lower cost per gigabyte:
S3’s Reduced Redundancy Storage (RRS) has lower durability (99.99%, so just four nines). That is, there’s a good chance you’ll lose a small amount of data. For some datasets where data has value in a statistical way (losing say half a percent of your objects isn’t a big deal), this is a reasonable trade-off.
S3’s Infrequent Access (IA) (also called S3 standard IA) lets you get cheaper storage in exchange for more expensive access. This is great for archives like logs you already processed but might want to look at later.
S3 Glacier deep archive gives you much cheaper storage with much slower and more expensive access. It is intended for deep archive usage.
A common policy that saves money is to set up managed lifecycles that migrate Standard storage to IA and then from IA to Glacier.
One of the most common oversights is to organize data in a way that causes business risks or costs later. You might initially assume data should be stored according to the type of data, the product, or by team, but often that’s not enough.
It’s usually best to organize your data into different buckets and paths at the highest level not on what the data is itself, but rather by considering these axes:
Sensitivity: Who can and cannot access it? (E.g. is it helpful for all engineers or only a few admins?)
Compliance: What are necessary controls and processes? (E.g. is it PII?)
Lifecycle: How will it be expired or archived? (E.g. is it verbose logs only needed for a month, or important financial data?)
Realm: Is it for internal or external use? For development, testing, staging, production?
Visibility: Do I need to track usage for this category of data exactly?
We’ve already discussed the first three. The concept of a realm is just that you often want to partition things in terms of process: For example, to make sure no one puts test data into a production location. It’s best to assign buckets and prefixes by realm up front.
The final point is a technical one: If you want to track usage, AWS offers easy usage reporting at the bucket level. If you put millions of objects in one S3 bucket, tallying usage by prefix or other means can be cumbersome at best, so consider individual buckets where you want to track significant S3 usage or you can use a log analytics solution like Sumo Logic to analyze your S3 logs.
This is pretty simple, but it comes up a lot. Don’t hard-code S3 locations in your code. This is tying your code to deployment details, which is almost guaranteed to hurt you later. You might want to deploy multiple production or staging environments. Or you might want to migrate all of one kind of data to a new location, or audit which pieces of code access certain data.
Decouple code and S3 locations. This will also help with test releases, or unit or integration tests so they use different buckets, paths, or mocked S3 services. Set up some sort of configuration file or service, and read S3 locations like buckets and prefixes from that.
There are many services that are (more or less) compatible with S3 APIs. This is helpful both for testing and migration to local storage. Commonly used tools for small test deployments are S3Proxy (Java) and s3mock (Java), which can make it far easier and faster to test S3-dependent code in isolation. More full-featured object storage servers with S3 compatibility include Minio (in Go), Ceph (C++/Terra), and Riak CS (Erlang).
Many large enterprises have private cloud needs and deploy AWS-compatible cloud components, including layers corresponding to AWS S3, in their own private clouds, using Eucalyptus and OpenStack. These are not quick and easy to set up but are mature open-source private cloud systems.
One tool that’s been around a long time is s3fs, the FUSE filesystem that lets you mount S3 as a regular filesystem in Linux and Mac OS. Disappointingly, it turns out this is often more of a novelty than a good idea, as S3 doesn’t offer all the right features to make it a robust filesystem. Appending to a file requires rewriting the whole file, which cripples performance, there is no atomic rename of directories or mutual exclusion on opening files, and a few other issues.
That said, there are some other solutions that use a different object format and allow filesystem-like access. Riofs (C) and Goofys (Go) are more recent implementations that are generally improvements on s3fs. S3QL is a Python implementation that offers data de-duplication, snap-shotting, and encryption. It only supports one client at a time, however. A commercial solution that offers lots of filesystem features and concurrent clients is ObjectiveFS.
Another use case is filesystem backups to S3. The standard approach is to use EBS volumes and use snapshots for incremental backups, but this does not fit every use case. Open source backup and sync tools include zbackup (deduplicating backups, inspired by rsync, in C++), restic (deduplicating backups, in Go), borg (deduplicating backups, in Python), and rclone (data syncing to cloud) can be used in conjunction with S3.
Consider that S3 may not be the optimal choice for your use case. As discussed, Glacier and cheaper S3 variants are great for cheaper pricing. EBS and EFS can be much more suitable for random-access data but cost 3 to 10 times more per gigabyte (see the table above).
Traditionally, EBS (with regular snapshots) is the option of choice if you need a filesystem abstraction in AWS. Remember EBS has a high failure rate compared to S3 (0.1-0.2% per year), so you need to use regular snapshots. You can only attach one instance to an EBS volume at a time. However, with EFS, AWS’ network file service, there is another option that allows up to thousands of EC2 instances to connect to the same drive concurrently — if you can afford it.
Of course, if you’re willing to store data outside AWS, the directly competitive cloud options include Google Cloud Storage, Azure Blob Storage, Rackspace Cloud Files, EMC Atmos, and BackBlaze B2. (Note: BackBlaze has a different architecture that offloads some work to the client, and is significantly cheaper.)
With Sumo Logic, you can finally get a 360-degree view of all of your AWS S3 data. Leveraging these powerful monitoring tools you can index, search, and perform deeper and more comprehensive analysis of performance and access/audit log data. Learn more about AWS monitoring with Sumo Logic.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial