In this post, I talk about my experience with designing and implementing a serverless, event driven payments pipeline on AWS.
Table of contents
Open Table of contents
Backstory
The fine folks at IFF wanted to improve their existing payment flows. As with many organizations that stared off their online payments journey by using an off the shelf payment gateway for everything related to the payment lifecycle, IFF started off as the same. In their earliest iteration their stack was glued together by just using some lambdas for server side functionality, which, to be fair, served them well for a long time.
There was a need to handle customer data in a dedicated location for various reasons. There was a need to also handle pre- and post-payment related logic effectively. These logics could include persistence, sending emails, logging among other use cases.
An important limitation for IFF was that it is largely a volunteer led organization, at least on the tech side. This translates to different volunteers contributing from different tooling and language backgrounds. If tomorrow we were to onboard a new volunteer for helping us build a new functionality, they should ideally be free to choose their own tooling without having to touch the core infrastructure of the application.
I was responsible to come up with a cost-effective solution that could be implemented natively on AWS without any performance or reliability issues.
Architecture
The first major realization that helped me a lot was to identify the sync and async functionalities that are related to the entire lifecycle of the application. I realized that a majority of the logic related to the payment lifecycle was async in nature. Previously every functionality was handled by using lambdas proxied by the API gateway. This meant that adding any functionality would mean that downstream systems would potentially also be affected in some manner or the other. In other words there was a lot of tight coupling in the system. Separating the synchronous events, the corresponding API’s were offloaded to Cloudflare workers based REST API’s (which I plan on describing in a dedicated blog post). The rest of the blog post only deals with handling async events.
After careful consideration, the Next logical step was to introduce AWS EventBridge, a serverless service that helps build loosely coupled event driven applications. As I previously had some experience using NATS, kafka and Benthos, mapping the logical REST APIS’s as events was relatively familiar. In order for external clients to make a call to the EventBridge, API gateway AWS integration was used. This allows events to be proxied to EventBridge without using an intermediate compute services(ex: Lambda), which is really cool IMO!
EventBridge has a predefined structure such as event source, details and other fields that can be used to configure API gateway integrations. Incoming client events were mapped to different routes and each route was then handled according to the rules defined in EventBridge bus rules. For instance, api.example.org/event1
was handled differently from api.example.org/event2
and events are routed to their respective bus rules based on their paths.
EventBridge bus allows us to configure rules for each matching event. Each rule has one or more targets that are triggered for every event. These targets include native AWS services like lambda, DynamoDB, step functions, etc. but also external HTTP API destinations. Again, these targets can be triggered without any additional compute services.
The most important step was building the logical execution layer or the compute layer. This is where the event payload is transformed and processed according to business requirements. Most of the services used in this layer was a bunch of lambdas. However, for some use cases that involved complex logic, AWS step functions made more sense, which is also a supported destination for EventBridge rules.
The final step was building a persistence layer. This was a relatively straight forward step as services like CloudWatch and DynamoDB are already supported destinations for EventBridge rule targets. As some data passing through the bus was of higher importance, such events were sent to CloudWatch logs destination. This simply meant that the events are persisted even before any transformation or processing of data. This approach is known as Fail Fast Storage First.
Fail fast storage first data ingestion architecture is a design approach that prioritizes the speed and reliability of data storage over other considerations. This approach is typically used in high-throughput data ingestion pipelines, where it is critical to ensure that incoming data is quickly and reliably stored in a durable manner.
In a fail fast storage first data ingestion architecture, incoming data is immediately written to storage, rather than being processed or transformed in any way. This enables quick capturing and storage of incoming data, even if it is in a format that is not immediately usable.
This approach has turned out to be highly beneficial. There were instances of lambda errors or downstream third party systems downtimes that caused the logic layer to fail. But since we added a parallel persistence layer, we were able to reconcile the failed event logic manually and rectify the errors. This also means we are highly available from the client’s (producer) perspective. Another feature that can be used to achieve this functionality is by using EventBridge Archives which allows archiving and replaying of events whenever required.
Extending to other use cases
Initially the architecture was designed to be used with client events from the frontend website. However, while revisiting our requirements, I realized that the same architecture can be extended to other use cases, one of them being ingesting webhooks.
We at IFF have been a long time users of Ghost CMS. We rely heavily on it as it is our primary headless CMS for all things posted on the website. Ghost emits some useful events such as mutating blog posts, among others. We wanted to keep a track of such events, partly as an audit log (this was before ghost even had an inbuilt post history feature) but also as part of our transparency and achieving purposes. Meaning a timestamp based events of everything that changes on the website both internally(such as adding or removing a staff member on the CMS) and externally (such as making edits to an already published post) was persisted for various use cases.
Additionally, since we were already using payment gateway (Razorpay) as part of our payments pipeline, we also built logic around ingesting payments related events emitted by payment gateway’s webhooks. However, as our endpoint is public and requires no auth, there was a need for securing the endpoints and avoiding spam. A simple solution would be to introduce a lambda (for Auth) before or after hitting our EventBridge but definitely before committing to persistence. But to keep things generic and simple, events were consumed assuming they were not secure, and a corresponding API call was made to verify the authenticity of the webhook (thin events).
Now I hear you saying - ‘why can’t you make use of the signature header which is provided as part of the payload itself ?’, which is a fair argument. However, during my experimentation with API gateway, I was unable to pass the headers from the webhooks as an optional part of the event payload to EventBridge. Additionally, there isn’t a consensus regarding how to build secure webhooks. Some service providers use signature as part of the body, some don’t. Even the headers names used for the purpose of signature is often different. Let’s say that if I configure AWS API Gateway proxy to forward the header “Example-X-Signature” as part of the event payload and if it is missing or does not match any existing headers, it would fail to forward the event (Please do let me know if there’s a workaround for this!).
Therefore, by capturing the webhooks from payment gateway, we were essentially using our pipeline as a serverless change data capture system. Change data capture (CDC) events are notifications that are generated whenever a change is made to a specific data source. These events can be used to trigger downstream processes or actions, such as updating a cache or index, replicating data to another system, or generating alerts or notifications.
CDC events are typically used in data integration and data management scenarios, where it is important to keep multiple systems or applications in sync with each other. For example, an organization may use CDC events to replicate data from a transactional database to a data warehouse, or to update a search index in real-time whenever data is added or modified. Thus, in addition to capturing events from our CMS, we were able to ingest and essentially build a replica of payment related data from payment gateway.
Conclusion
From a privacy standpoint, steps were taken to ensure that persistence was time bound, meaning expiration was set for each log group and DynamoDB (using TTL) wherever required. This architecture also allowed us to significantly reduce the amount of PII (personally identifiable information) and other sensitive data stored on the payment gateway as this was now offloaded to downstream systems and could be fetched using only reference ID’s (KSUID’s to be specific).
As an organization supported entirely by donations, cost optimization was a very important metric. As the entire stack is serverless, we are yet to receive a single cent toward our application invoice! Using CDK to build and deploy the application helped immensely with development, reproducibility and long term maintainability. As with any serverless applications there are other benefits that come out of the box such as scalability, reliability, sustainability, etc.
As of today, adding a new functionality mostly just translates to creating a single lambda function and setting up an EventBridge target for an existing rule.
My biggest takeaway from implementing this architecture wasn’t the excitement of using AWS EventBridge or serverless technology, but the realization that code itself is a liability, not an asset.
Relying on such serverless solutions makes maintainability definitely easier, but at what cost?