History

Cyryl Płotnicki 5b5de29ee0 add diagram source		2024-07-08 18:36:38 +01:00
..
README.md	add diagram source	2024-07-08 18:36:38 +01:00
system_architecture_template.svg	add system architecture template	2024-07-08 18:16:33 +01:00

README.md

This exercise is to model a pensions system.

Version control via git

Original text from the client (named ACME)

ACME operates a Master Trust Pension. A ACME pension consist of units of Assets held on behalf of the member. These units, and any associated cash, are recorded in an internal ledger. The actual units of each asset are aggregated into a single holding held by an external Asset Custodian on behalf of ACME. Asset units are bought and sold on behalf of members. The process of buying and selling assets takes several days trades are only placed periodically and take several days to settle. The basic process for buying assets is:

flowchart TD
    member([Member])-- Invest cash -->create_buy_trade
    create_buy_trade[Create buy trade]-->aggregate
    aggregate[Aggregate into an external buy trade]-.->external_buy_trade
    external_buy_trade[Place an external buy trade]-->buy_assets
    buy_assets[Buy assets]-->custodian
    custodian([Asset custodian]) -. Trade priced .-> move_to_pots1
    custodian -. Trade settled .-> move_to_pots2
    move_to_pots1[Move assets to pots]-->move_cash_to_custodian
    move_cash_to_custodian[Move cash from pots to custodian account]
    move_to_pots2[Move assets to pots]

Dotted arrow is delayed operation
Round shape indicates an actor

The platform is intended to support up to 5 million members. The following are the meanings of terms used by ACME:

Internal Account - a logical account within the internal ledger that records the holding of an amount of cash or units of assets
Pot - a collection of Internal Accounts that belong to a Member, ACME, an external body, e.g. an Asset Custodian
External Account - a real account in an external organisation, e.g. a bank account with a bank or an asset account with an asset custodian.
Re-balancing - the process of adjusting the number of units held for each of a number of assets in the asset Internal Account in a Pot to ensure the mix is correct for the chosen risk profile.
Asset custodian - a body that holds asset units on behalf of ACME and trades them when instructed.

Consider the description above in an event driven platform with information stored in a database. How would you design such a system and what challenges would you expect to find? The solution should consider the following:

ACME is regulated by 3 financial authorities and the data we hold must be 100% accurate
Assets may be bought and sold
There are some automated processes, e.g. re-balancing, that generate large numbers of buy and sell trades as the prices of assets move.
There are some processes that must process large amounts of data, e.g. reconciliation
This is not a real time trading environment, prices are always closing prices, external trades are only placed with asset custodians on a daily basis.

Assuming the platform is implemented in Go, what patterns would you use to enable large numbers (c200-300) processes to be produced in a consistent and repeatable manner by multiple teams?

Interpretation of the text of the exercise. Assumptions made.

Some notes on the interpretation of the text and questions to ask. Main points to address are marked in bold

There are a lot of different business processes (200-300 as per the text)
- What is a process ?
  - How unique are those processes ? Do they share any common parts ?
    - Three processes mentioned in the original text - buying assets, rebalancing and reconciliation. I assume that the following is true for majority of the processes in the system:
      - They need to process multiple records in batches
      - They are not real-time, but need to finish within specificed amount of time, i.e. throughput is important
      - On one hand if there's multiple teams working on different types of processes independently we could think that this means that the momentary system load can be close to the average, i.e. spikes cancel each other out. However I am going ot assume that this is not the case, esp if the business process are tied to a country-specific wall-clock.
      - New types of processes will appear in the future - we cannot design the system that will be hard-tied to only the known process types.
  - The original text describes a "process for buying assets"
    - Is it a typical process ?
    - Is it comprised of smaller processes ? This could explain the 200-300 processes if so.
- They need to be managed by different teams across the company
  - Would be great to allow for each team to work independently and yet share any improvements.
  - How to allow for code and pattern reuse while maintaining team autonomy
    - Encourage code reuse via libraries, e.g. auth, message parsing, common type definitions
    - Each team owns a subset of processes
      - How to allow each team to model a process
        
        Can the system benefit from global workflow orchestration, so that running of the workflows is separated from defining them ?
        
        Encourage investigation of workflow orchestration engines
        
        AWS
        
        Step functions
        
        SWF
        
        Self-hosted
        
        temporal
- Types of processes
  - The text mentions that some processes are automated, I assume it means that any process can have automated and manual components both.
    - Make sure the manual intervention is handled similarily to the automated one, to incentivise code reuse
      - e.g. HTTP call from UI is handled by putting a message on the queue and the rest of the processing is unaware of the source of the trigger
  - Mentions of some processes needing to process large amoutns of data
  - Let's try to estimate
    - Is keeping historical data (e.g. past trades) in hot storage (quickly accessible) important ?
      - I assume historical data older than N months can be moved to cold storage.
"Data must be 100% accurate"
- What is "accuracy" in this statement ?
  - Does it mean consistency and durability ?
    - Losing data is assumed very bad
      - Once something has entered a system from a third party, e.g. an order to buy or sell it need to be persisted durably. We can only reply to the initial call from the third party once we can guarantee durability.
        
        What is considered an acceptable loss of data ? Under what circumstances ?
        
        Natural disasters, war etc covering a whole DC region
    - I assume each part of the system can be at different phases of processing data, but as long as the third parties view a consistent state of transactions it is allowed.
      - Is the granularity here per-customer ?
        
        I think the text mentions that all buy and sell transactions are batched up to be executed together.
        
        Is it important for these batches to be externally observable ?
        
        Can a customer or a regulator demand proof that a transaction was processed at roughly the same time as others, in the same batch ?
Not a real-time system
- I assume we care more about total system throughput than about latency of individual steps of processing
- Original text mentions varies business processes, some of them reading a lot of data, I assume the load on the system can be "spiky" ** Is it worth doing capacity planning or autoscaling or a mixture of both ?
  - I assume worth presenting a solution that can scale automatically.
Additional criteria
- Long term velocity & staff retention
  - The system should be built in a way that allows for continuous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
    - Focus on good mapping between isolated problem areas and teams make sure team structure reflects desired architecture (inverse Conway maneuver)
    - Allow for autonomy, mastery and purpose to be maximised in each team
      - encourage continous learning via coaching, lunch and learns etc
    - Testing
      - Continuous fake transactions flowing through a non-prod environment with metrics gathered
      - Encourage starting with a problem statement and only once everyone agrees proceed to discuss solutions. Apply at different scales.
        
        encourage problem and solution separation at project/programme definition level
        
        encourage programmers to start with task definition and writing high-level tests and only then proceed with solution/implementation
- Build trust in the system/external branding
  - It needs to work.
    - Observability & alerting
    - Measure and minimise impact on any external third party. Prefer impact on internal teams to impact on customers.
      - Could we set up canary/fake flows on production ? E.g. fake customers selling and buying This helps a lot with incident response as well as understanding which part of the system is struggling, whether it's related to third parties or our system itself.
    - See testing above

Proposal

System architecture template

Proposed is a system architecture template, with a lot of optional components to be chosen from depending on the actual needs of the system. The proposed shape allows for spikes in load as main compute components are horizontally scalable and are isolated by queues that act as buffers. One interesting thing to note is that this system shape encourages accepting as much from external third parties and then processing it locally when possible. Thus it discourages backpressure, which can be detrimental in some cases, esp during disaster scenarios.

Implementation notes:

Start small, this template aims to inspire and provide some overview of possiblities, it does not mean we should build everything present there on day one.
- A good place to start would be:
  - One process, not too big, but representative
  - external HTTP trigger
  - one lambda to handle the HTTP call and put a message on the bus
  - a small number of lambdas to handle steps of processing (re-using lambda, not introducing new deployment concepts)
    - one lambda to handle informing external party of our intent (e.g. buy trades)
  - one lambda to handle informing the original caller about progress/state of processing - effectively forwarding events and transforming into correct format (HTTP calls ? queue ?)
  - just one data store for everything, synchronous db calls - need to get the event emission vs DB call semantic right - e.g. outbox pattern ?
- From there we could try and experiment with other things mentioned in the template
Automated testing is extremely important, especially whenever we have a distributed system.
Idempotency matters, both externally as well as internally. Never assume exactly-once delivery of a message, test extensively with non-zero delivery error rate scenarios.
Avoid synchronous communication with the actors requesting changes - e.g. if using HTTP - don't make clients wait for us to do any processing before we can issue an HTTP response. Store the intent and process later rather than processing in place.

Diagram source

Example processes - modeling and notes

To be able to validate the system template and choose specific components from it let's analyse the processes we know of and see how easy it is to implement them. Would be good to draw sequence diagrams to try to understand different processes.

(I ran out of time here, but based on my intuition/experience the template should still work)

Data design/modelling

Would be interesting to model how data is stored and see how can we make things reusable.