By Mohsiur Rahman, HeavyWater Developer
We focused on converting our services into AWS Lambda functions and our workflows into Step Functions.
Straight after graduation, I joined a startup called Heavywater Inc — and had no idea what was in store for me. The company is focused on using artificial intelligence virtual assistants to enable business process outsourcing, and the the infrastructure is built completely on Amazon Web Services (AWS).
When I started with the company, the development team was launching the initial product to our customers. Over the next three months, I was thrown onto different projects involving AWS Lambda, EC2, Simple Workflow (SWF), Redshift, and QuickSight.
Being exposed to such a wide variety of projects helped to accelerate my understanding of AWS infrastructure — and recognize that our implementation of SWF needed to be fixed.
The cost of our problems
A major component of the product involved processing batch files, and our orchestration infrastructure was built using SWF and EC2 instances.
This approach made sense at the time of our initial release, but our architecture had some drawbacks. The batch processing jobs controlled by SWF were being executed and monitored 24×7 — and relied on the same EC2 instances used by our microservices.
Even with most of our EC2 instances sized as t2.micro, our AWS bills kept increasing. In the span of just 4 months, our monthly bill increased from $10K to $30K with over 1,000 EC2 instances running.
Even with all the spend, throughput was still an issue — with an average processing rate of only 4000 files every 24 hours. To make matters worse, SWF would fail multiple times due to a nondeterministic issue.
At first, we thought the failures were related to our codebase. An internal investigation could not find an issue, so we opened a case with the AWS support team. After two months of going through the hoops to resolve the issue, the AWS support team finally recommended for us to consider Step Functions.
AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows.
The path forward
To convert our system to Step Functions, it was a going to be a massive overhaul in our infrastructure — and could take months of effort. There was a lot of debate amongst the team whether this approach would be feasible, and whether the new architecture would reduce costs.
To get started, our first step was converting all of our microservices into Lambda functions. While this was a fairly easy task, it took about a 1–2 months to complete given the number of services. At the same time, we decided to convert one of our smallest workflows into a step function.
Issues we encountered
As expected with any big change, our team encountered a lot of issues designing and implementing our new infrastructure. One of our biggest pitfalls was converting some of our workflows while still maintaining our SWF stacks. This approach resulted in some of the SWF stacks becoming bottlenecks — since many of our workflows had to be blocked until the next workflow was completed as well.
Another issue we encountered was designing an infrastructure in step functions with asynchronous in mind. After reviewing design options for a couple of weeks, the team developed a smart solution that is a bit of a hack — and has helped to increase our throughput.
Instead of using the decider in SWF to transitions of workflow transitions, our solution built our own decider using Step Functions. Whenever a workflow was completed, it would invoke the
starter Lambda of our Step Function with parameters of which workflow needs to be triggered next.
The benefits of serverless
As the weeks rolled by, the team focused on converting our services into Lambda and our workflows into Step Functions. The results were evident immediately:
- Costs started on a downwards trajectory.
- Human resources devoted to batch processing dropped from 24 hours to 16 hours and continued to decrease.
- The number of EC2 instances decreased to 211 instances.
- The number of errors in SWF decreased — albeit never disappeared.
By the end of November we had converted 80% of our workflows to Step Functions. All that is remaining is the conversion of two workflows, and the addition of a new state which calls our web services.
After a $30K invoice in September, our AWS bill for the month of December is projected to be less than $4,000. Just as important, the new approach is saving our developers hours per day with minimal monitoring — in fact, I’m writing this article while processing.
The biggest impact is that we’re now designing our future services in a completely microservice-orientated manner. As a recent college graduate, it’s been a valuable experience learning how to design modern architectures — and use new tools from AWS that reduce cost and increase productivity.