ACME operates a Master Trust Pension. A ACME pension consist of units of Assets held on behalf of the member. These units, and any associated cash, are recorded in an internal ledger. The actual units of each asset are aggregated into a single holding held by an external Asset Custodian on behalf of ACME.
Asset units are bought and sold on behalf of members. The process of buying and selling assets takes several days trades are only placed periodically and take several days to settle.
* Re-balancing - the process of adjusting the number of units held for each of a number of assets in the asset Internal Account in a Pot to ensure the mix is correct for the chosen risk profile.
* Asset custodian - a body that holds asset units on behalf of ACME and trades them when instructed.
Consider the description above in an event driven platform with information stored in a database. How would you design such a system and what challenges would you expect to find? The solution should consider the following:
* ACME is regulated by 3 financial authorities and the data we hold must be 100% accurate
* Assets may be bought and sold
* There are some automated processes, e.g. re-balancing, that generate large numbers of buy and sell trades as the prices of assets move.
* There are some processes that must process large amounts of data, e.g. reconciliation
* This is not a real time trading environment, prices are always closing prices, external trades are only placed with asset custodians on a daily basis.
Assuming the platform is implemented in Go, what patterns would you use to enable large numbers (c200-300) processes to be produced in a consistent and repeatable manner by multiple teams?
* Three processes mentioned in the original text - buying assets, rebalancing and reconciliation. **I assume that the following is true for majority of the processes in the system:**
* They need to process multiple records in batches
* They are not real-time, but need to finish within specificed amount of time, i.e. throughput is important
* On one hand if there's multiple teams working on different types of processes independently we could think that this means that the momentary system load can be close to the average, i.e. spikes cancel each other out. However I am going ot assume that this is not the case, esp if the business process are tied to a country-specific wall-clock.
* New types of processes will appear in the future - we cannot design the system that will be hard-tied to only the known process types.
* Once something has entered a system from a third party, e.g. an order to buy or sell it need to be persisted durably. **We can only reply to the initial call from the third party once we can guarantee durability.**
* **I assume each part of the system can be at different phases of processing data, but as long as the third parties view a consistent state of transactions it is allowed.**
* Is the granularity here per-customer ?
* I think the text mentions that all buy and sell transactions are batched up to be executed together.
* Is it important for these batches to be externally observable ?
* **Can a customer or a regulator demand proof that a transaction was processed at roughly the same time as others, in the same batch ?**
* The system should be built in a way that allows for continuous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
* Measure and minimise impact on any external third party. Prefer impact on internal teams to impact on customers.
* **Could we set up canary/fake flows on production ? E.g. fake customers selling and buying** This helps a lot with incident response as well as understanding which part of the system is struggling, whether it's related to third parties or our system itself.
Proposed is a system architecture template, with a lot of optional components to be chosen from depending on the actual needs of the system. The proposed shape allows for spikes in load as main compute components are horizontally scalable and are isolated by queues that act as buffers. One interesting thing to note is that this system shape encourages accepting as much from external third parties and then processing it locally when possible. Thus it discourages backpressure, which can be detrimental in some cases, esp during disaster scenarios.
Implementation notes:
* Start small, this template aims to inspire and provide some overview of possiblities, it does not mean we should build everything present there on day one.
* A good place to start would be:
* One process, not too big, but representative
* external HTTP trigger
* one lambda to handle the HTTP call and put a message on the bus
* a small number of lambdas to handle steps of processing (re-using lambda, not introducing new deployment concepts)
* one lambda to handle informing external party of our intent (e.g. buy trades)
* one lambda to handle informing the original caller about progress/state of processing - effectively forwarding events and transforming into correct format (HTTP calls ? queue ?)
* just one data store for everything, synchronous db calls - need to get the event emission vs DB call semantic right - e.g. outbox pattern ?
* From there we could try and experiment with other things mentioned in the template
* Automated testing is extremely important, especially whenever we have a distributed system.
* Idempotency matters, both externally as well as internally. Never assume exactly-once delivery of a message, test extensively with non-zero delivery error rate scenarios.
* Avoid synchronous communication with the actors requesting changes - e.g. if using HTTP - don't make clients wait for us to do any processing before we can issue an HTTP response. Store the intent and process later rather than processing in place.
To be able to validate the system template and choose specific components from it let's analyse the processes we know of and see how easy it is to implement them.