8.5 KiB
This exercise is to model a pensions system.
Original text from the client (named ACME)
ACME operates a Master Trust Pension. A ACME pension consist of units of Assets held on behalf of the member. These units, and any associated cash, are recorded in an internal ledger. The actual units of each asset are aggregated into a single holding held by an external Asset Custodian on behalf of ACME. Asset units are bought and sold on behalf of members. The process of buying and selling assets takes several days trades are only placed periodically and take several days to settle. The basic process for buying assets is:
flowchart TD
member([Member])-- Invest cash -->create_buy_trade
create_buy_trade[Create buy trade]-->aggregate
aggregate[Aggregate into an external buy trade]-.->external_buy_trade
external_buy_trade[Place an external buy trade]-->buy_assets
buy_assets[Buy assets]-->custodian
custodian([Asset custodian]) -. Trade priced .-> move_to_pots1
custodian -. Trade settled .-> move_to_pots2
move_to_pots1[Move assets to pots]-->move_cash_to_custodian
move_cash_to_custodian[Move cash from pots to custodian account]
move_to_pots2[Move assets to pots]
Dotted arrow is delayed operation
Round shape indicates an actor
The platform is intended to support up to 5 million members. The following are the meanings of terms used by ACME:
- Internal Account - a logical account within the internal ledger that records the holding of an amount of cash or units of assets
- Pot - a collection of Internal Accounts that belong to a Member, ACME, an external body, e.g. an Asset Custodian
- External Account - a real account in an external organisation, e.g. a bank account with a bank or an asset account with an asset custodian.
- Re-balancing - the process of adjusting the number of units held for each of a number of assets in the asset Internal Account in a Pot to ensure the mix is correct for the chosen risk profile.
- Asset custodian - a body that holds asset units on behalf of ACME and trades them when instructed.
Consider the description above in an event driven platform with information stored in a database. How would you design such a system and what challenges would you expect to find? The solution should consider the following:
- ACME is regulated by 3 financial authorities and the data we hold must be 100% accurate
- Assets may be bought and sold
- There are some automated processes, e.g. re-balancing, that generate large numbers of buy and sell trades as the prices of assets move.
- There are some processes that must process large amounts of data, e.g. reconciliation
- This is not a real time trading environment, prices are always closing prices, external trades are only placed with asset custodians on a daily basis.
Assuming the platform is implemented in Go, what patterns would you use to enable large numbers (c200-300) processes to be produced in a consistent and repeatable manner by multiple teams?
Interpretation of the text of the exercise. Assumptions made.
Some notes on the interpretation of the text and questions to ask. Main points to address are marked in bold
-
There are a lot of different business processes (200-300 as per the text)
- What is a process ?
- How unique are those processes ? Do they share any common parts ?
- The original text describes a "process for buying assets"
- Is it a typical process ?
- Is it comprised of smaller processes ? This could explain the 200-300 processes if so.
- They need to be managed by different teams across the company
- Would be great to allow for each team to work independently and yet share any improvements.
- How to allow for code and pattern reuse while maintaining team autonomy
- Encourage code reuse via libraries, e.g. auth, message parsing, common type definitions
- Each team owns a subset of processes
- How to allow each team to model a process
- Can the system benefit from global workflow orchestration, so that running of the workflows is separated from defining them ?
- Encourage investigation of workflow orchestration engines
- AWS
- Step functions
- SWF
- Self-hosted
- temporal
- AWS
- Encourage investigation of workflow orchestration engines
- Can the system benefit from global workflow orchestration, so that running of the workflows is separated from defining them ?
- How to allow each team to model a process
- Types of processes
- The text mentions that some processes are automated, I assume it means that any process can have automated and manual components both.
- Make sure the manual intervention is handled similarily to the automated one, to incentivise code reuse
- e.g. HTTP call from UI is handled by putting a message on the queue and the rest of the processing is unaware of the source of the trigger
- Make sure the manual intervention is handled similarily to the automated one, to incentivise code reuse
- Mentions of some processes needing to process large amoutns of data
- Let's try to estimate
- Is keeping historical data (e.g. past trades) in hot storage (quickly accessible) important ?
- I assume historical data older than N months can be moved to cold storage.
- The text mentions that some processes are automated, I assume it means that any process can have automated and manual components both.
- What is a process ?
-
"Data must be 100% accurate"
- What is "accuracy" in this statement ?
- Does it mean consistency and durability ?
- Losing data is assumed very bad
- Once something has entered a system from a third party, e.g. an order to buy or sell it need to be persisted durably. We can only reply to the initial call from the third party once we can guarantee durability.
- I assume each part of the system can be at different phases of processing data, but as long as the third parties view a consistent state of transactions it is allowed.
- Is the granularity here per-customer ?
- I think the text mentions that all buy and sell transactions are batched up to be executed together.
- Is it important for these batches to be externally observable ?
- Can a customer or a regulator demand proof that a transaction was processed at roughly the same time as others, in the same batch ?
- Is it important for these batches to be externally observable ?
- I think the text mentions that all buy and sell transactions are batched up to be executed together.
- Is the granularity here per-customer ?
- Losing data is assumed very bad
- Does it mean consistency and durability ?
- What is "accuracy" in this statement ?
-
Not a real-time system
- I assume we care more about total system throughput than about latency of individual steps of processing
- Original text mentions varies business processes, some of them reading a lot of data, I assume the load on the system can be "spiky"
** Is it worth doing capacity planning or autoscaling or a mixture of both ?
- I assume worth presenting a solution that can scale automatically.
-
Additional criteria
- Long term velocity & staff retention
- The system should be built in a way that allows for continous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
- Focus on good mapping between isolated problem areas and teams make sure team structure reflects desired architecture (inverse Conway maneuver)
- Allow for autonomy, mastery and purpose to be maximised in each team
- encourage continous learning via coaching, lunch and learns etc
- Testing
- Continous fake transactions flowing through a non-prod environment with metrics gathered
- Encourage starting with a problem statement and only once everyone agrees proceed to discuss solutions. Apply at different scales.
- encourage problem and solution separation at project/programme definition level
- encourage programmers to start with task definition and writing high-level tests and only then proceed with solution/implemntation
- The system should be built in a way that allows for continous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
- Build trust in the system/external branding
- It needs to work.
- Observability & alerting
- Measure and minimise impact on any external third party. Prefer impact on internal teams to impact on customers.
- Could we set up canary/fake flows on production ? E.g. fake customers selling and buying This helps a lot with incident response as well as understanding which part of the system is struggling, whether it's related to third parties or our system itself.
- See
testing
above
- It needs to work.
- Long term velocity & staff retention
Proposals
Proposed is a system architecture template, with a lot of optional components to be chosen from depending on the actual needs of the system. The proposed shape allows for spikes in load as main compute components are horizontally scalable and are isolated by queues that act as buffers. Once interesting thing to note is that this system shape encourages accepting as much from external third parties and then processing it locally when possible. Thus it discourages backpressure, which can be detrimental in some cases, esp during disaster scenarios.