December 01, 2023

The Frugal Architect Explained

After combing through the words of the brilliant Werner Vogels regarding cost-based architecture, I felt I wanted to elaborate on his concise and precise words. Here the word product is interchangeable for product, service, component, module, application, system, etc.

Law I: Make Cost a Non-functional Requirement.

Cost is more than a number. It’s a mantra. Cost savings cannot be an afterthought. It must be designed and accounted for from the beginning. The nuance of this first law is that cost is not usually mentioned with requirements. The discussion of cost usually is only after things are design and deployed. I’ve heard of asking projects of how much will this cost. But I’ve never thought of making cost an actual requirement. In other words, it is not enough to be cost conscious. It is needed to go a step further and actually establish how much or little the product in question will cost to operate. This means we design around the cost not exceeding this value. Now we will need to factor this against the performance, scalability and resilience that we are expecting for this product.

Law II: Systems that Last Align Cost to Business.

The emphasis here is that there must be a clear and defined relationship between cost and business. That means that as the business grows your costs may grow. That is to be expected and should align properly. Cost is not expected in all circumstances to be fixed (unless it is). It is usually understood that costs will rise as different dimensions of usage increase. The word performance is very vague when we are dealing with the specifics of a system. If we are talking about the throughput, even that is too vague. The throughput must be discussed within the context of a volume of requests at one time. That means when we are talking about “how long it takes to process a request”, we need to factor in how many requests we are processing at a given time. The other important factor is how much “stuff” do we have in our data stores that might affect how long things take. If our request involves checking against the correlation of other existing transactions the sheer volume of existing transactions will heavily influence the duration of the call. So too, if we are expecting up to 100 simultaneous calls, or 100,000 simultaneous calls this will greatly affect the request duration. All of this is relevant when you are considering cost. In today’s “serverless” world, the ability to infinitely scale without having much or any provisioned resources can take the burden of upfront costs away. However, depending on what solution you are using for your data storage will deeply affect your ability to scale.

The business model might necessitate a certain SLA for requests. Ideally as your company and business grows the demand on resources and complexity will increase and ideally costs per unit will decrease. Obviously the overall cost will rise, but as you hit new cost thresholds you will ideally merit volume discounts on computing services.

The engineering cost is another that needs to factored in. As the complexity of a product grows the development costs and maintenance costs will rise. A simple system is cheap to develop. A complex one can be costly. So sometimes focusing on cutting costs for computing ends up costing a great deal more money for personnel and isn’t worth the expenditure. In my opinion nothing beats clean simple code. If you can hit your SLAs, don’t be greedy thinking that you will save big with pennies here and there. The sheer cost of engineering and maintenance on a complex system may not be worth it.

Law III: Architecting is a Series of Trade-offs.

It is well known that there are no perfect systems or solutions. Every system has a balancing act of a myriad of factors. Cost comes in two flavors in the computer world. Cost is money and cost is time. In reality, they are one because computing costs ends up being all about money. If you are dealing with an algorithm like for compression or machine learning your approach often favors time over quality of the outputted result. Let’s look at compression its very simple. Today you can have a song that is encoded with a very high quality lossless encoding that is very large in size. Here you are preferring quality over size. Or you can have the humble MP3 that is decent quality and notably small in with respect to filesize. When MP3s came out in the early 90’s internet connection speeds were often dial-up and notoriously slow. The MP3 was a game changed because the sound quality was good and the file was easily to deliver.

Today, where everyone has a high speed internet connection, we don’t really need MP3s anymore. In architecture, we will make decisions like the MP3 often for well intended reasons. Only to soon find out that reason no longer applicable. A new technology comes out, or the nature of the request rate or data schema change in such a way making the architectural choices stale and outdated. You will always need to choose between several factors when designing a solution, and that rationale may not stand the test of time. With that said, the cost should adjust with those factors. The trade-offs for using MP3 over a lossless format is the time to transfer and the storage size. It needs to be understood what you are paying for and why. As times will change and your music streaming service updates, it might charge slightly more to accommodate for the higher operating costs of storing and streaming the lossless files.

Law IV: Unobserved Systems Lead to Unknown Costs.

No matter how much planning goes into a system there will always be unforeseen costs. The traffic is higher than expected, the runtime is longer than expected, the file size is larger than expected. Without visibility into the system it is impossible to be aware of how on target you may or may not be with respect to expected costs. Let us not forget the costs of management and personnel. Amazon is well known for having the ability to tag almost every resource out there so that you may effectively and easily known your costs. Knowing the costs of resources without their proper context is more or less useless. If you know your system costs $1,000 last month, and $1,500 the month prior, what is the rationale for that $500 difference is key. Seeing the correlation between your resource and personnel costs against the business is key. Let’s say that you have metrics that you can see how many incident tickets you have for a given component. Being able to detect that really the lower month of $1,000 was due to downtime because of several critical failures, would greatly inform your cost analysis. The more you know and the more dots you can connect the more you can nail down waste and over or under utilization.

Law V: Cost Aware Architectures Implement Cost Controls.

Beyond merely monitoring components, the ability to easily tune and configure the cost versus power of a given component becomes a surefire way to save money. Having components separated into tiers of priority is an easy way to isolate what components may be tweaked with over time for cost savings. A component in a lower priority may be reduced in power when its need is lower. Such that you don’t need 100 nodes to process data if you are only using 2 at the time. The question of startup time comes in to play. What if a customer suddenly needs to run a large payload that will overwhelm our pool of nodes. Do we always run 100 nodes to accommodate for those occasions? You can be sure that AWS thinks about those questions. We all know that AWS Lambda is not actually infinitely scalable, despite us using the term. AWS must spend a lot of money and time coming up with proper projections on how many nodes to dedicate to handle bursts or surges in usage.

The idea that costs and cost-cutting must be justified via business impact is key. Cost-cutting can go too far when you find yourselves without the number of nodes needed to run your request in a given SLA. This is a juggling act that must allow for risk. A backup process that runs in a scheduled fashion. Running the process is integral, but running the process hourly versus twice daily may have drastic cost savings that may not be integral to the business needs.

The idea of tuning a product at small levels enables you to tweak parameters easily that may affect costs. Everyone knows a key in good software design is as much as possible parameterize your configuration enabling configuration without requiring a rebuild. To be able to tune your product in real time, see the effects and persist that configuration is a game changer. Being able to do that easily without rebuilding a product opens up your application to new modern possibilities. Especially today where you might be able to use an ML/AI model to attempt to tweak parameters and find a sweet spot where performance and cost are in perfect harmony.

Law VI: Cost Optimization is Incremental.

If you are trying to optimize your costs and doing it properly it doesn’t happen in one day. Even the best designed systems may be improved over time as factors change or expected utilization differs from expected. This is built on Laws IV and V. You cannot cut costs without proper metrics, and beginning the process of reducing costs starts with tunable or configurable components that you may be able to adjust that affect performance and costs. Taking property x and turning it from 10 to 1 is not only ill-advised but wrong. The proper way of tweaking parameters is with a dummy environment where you can run controlled tests with varied parameters and evaluate the performance with those new settings. As mentioned earlier, today there are automated ways of doing this that might find harmony with ML/AI solutions that can help with these adjustments. When tweaking things manually don’t expect to go from spending $1,000 to $50 without breaking things. You may never be able to get under a certain cost and that may be the reality. Rewrites of certain components might be necessary beyond just tweaking the known configurable options. Breaking out components into smaller bite sized pieces that further determine if this function is run under the given circumstances are all minor ways to reduce milliseconds from task execution.

Law VII: Unchallenged Success Leads to Assumptions.

Variety is the spice of life. With technology the “tried and true” spirit can lead only too often to complacency and a lack of growth. As Werner concludes the idea of saying we are a “Java shop” is often the cause of a lack of innovation. Cost optimization, stems from both cutting down and trimming the fat as well as making the core product more powerful and robust with less. Let me give an example. Java is notorious for being a technology that is slow on cold starts. That means to get things going to need to wait until more usage where the application is properly loaded, and the system has tuned itself for optimizations. A recent new feature enables you to warm up the application and save it in that state, so you may be able to restart the application already nice toasty and warm. This dramatically changes the ability to use Java in a serverless capacity. This now takes a technology that was more or less unusable in serverless ecosystems and would require complete rewrites to take advantage of this new platform called serverless into a reality. The status quo is the enemy of innovation. The more innovation yields to better performance, and ultimately better cost savings. When you see the lifecycle of R&D to innovation and cost savings that goes back into R&D it’s a beautiful thing.