ACME operates a Master Trust Pension. A ACME pension consist of units of Assets held on behalf of the member. These units, and any associated cash, are recorded in an internal ledger. The actual units of each asset are aggregated into a single holding held by an external Asset Custodian on behalf of ACME.
Asset units are bought and sold on behalf of members. The process of buying and selling assets takes several days trades are only placed periodically and take several days to settle.
* Re-balancing - the process of adjusting the number of units held for each of a number of assets in the asset Internal Account in a Pot to ensure the mix is correct for the chosen risk profile.
* Asset custodian - a body that holds asset units on behalf of ACME and trades them when instructed.
Consider the description above in an event driven platform with information stored in a database. How would you design such a system and what challenges would you expect to find? The solution should consider the following:
* ACME is regulated by 3 financial authorities and the data we hold must be 100% accurate
* Assets may be bought and sold
* There are some automated processes, e.g. re-balancing, that generate large numbers of buy and sell trades as the prices of assets move.
* There are some processes that must process large amounts of data, e.g. reconciliation
* This is not a real time trading environment, prices are always closing prices, external trades are only placed with asset custodians on a daily basis.
Assuming the platform is implemented in Go, what patterns would you use to enable large numbers (c200-300) processes to be produced in a consistent and repeatable manner by multiple teams?
* Once something has entered a system from a third party, e.g. an order to buy or sell it need to be persisted durably. **We can only reply to the initial call from the third party once we can guarantee durability.**
* **I assume each part of the system can be at different phases of processing data, but as long as the third parties view a consistent state of transactions it is allowed.**
* Is the granularity here per-customer ?
* I think the text mentions that all buy and sell transactions are batched up to be executed together.
* Is it important for these batches to be externally observable ?
* **Can a customer or a regulator demand proof that a transaction was processed at roughly the same time as others, in the same batch ?**
* Original text mentions varies business processes, some of them reading a lot of data, **I assume the load on the system can be "spiky"**
** Is it worth doing capacity planning or autoscaling or a mixture of both ?
* **I assume worth presenting a solution that can scale automatically.**
* Additional criteria
* Long term velocity & staff retention
* The system should be built in a way that allows for continous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
* Focus on good mapping between isolated problem areas and teams
**make sure team structure reflects desired architecture (inverse Conway maneuver)**
* Allow for autonomy, mastery and purpose to be maximised in each team
* **encourage continous learning via coaching, lunch and learns etc**
* Testing
* Continous fake transactions flowing through a non-prod environment with metrics gathered
* Encourage starting with a problem statement and only once everyone agrees proceed to discuss solutions. Apply at different scales.
* **encourage problem and solution separation at project/programme definition level**
* **encourage programmers to start with task definition and writing high-level tests and only then proceed with solution/implemntation**
* Build trust in the system/external branding
* It needs to work.
* Observability & alerting
* Measure and minimise impact on any external third party. Prefer impact on internal teams to impact on customers.
* **Could we set up canary/fake flows on production ? E.g. fake customers selling and buying** This helps a lot with incident response as well as understanding which part of the system is struggling, whether it's related to third parties or our system itself.
* See `testing` above
## Proposals
Proposed is a system architecture template, with a lot of optional components to be chosen from depending on the actual needs of the system. The proposed shape allows for spikes in load as main compute components are horizontally scalable and are isolated by queues that act as buffers. Once interesting thing to note is that this system shape encourages accepting as much from external third parties and then processing it locally when possible. Thus it discourages backpressure, which can be detrimental in some cases, esp during disaster scenarios.