added more assumptions and started writing additional criteria and proposal sections

This commit is contained in:
Cyryl Płotnicki 2024-07-08 11:35:00 +01:00
parent b69c122639
commit b6100750b4

View file

@ -1,5 +1,7 @@
# This exercise is to model a pensions system.
[Version control via git](https://git.cyplo.dev/cyplo/exercises/src/branch/main/fi1/README.md)
## Original text from the client (named ACME)
ACME operates a Master Trust Pension. A ACME pension consist of units of Assets held on behalf of the member. These units, and any associated cash, are recorded in an internal ledger. The actual units of each asset are aggregated into a single holding held by an external Asset Custodian on behalf of ACME.
@ -56,13 +58,26 @@ Some notes on the interpretation of the text and questions to ask.
* Is it comprised of smaller processes ? This could explain the 200-300 processes if so.
* They need to be managed by different teams across the company
* Would be great to allow for each team to work independently and yet share any improvements.
**How to allow for code and pattern reuse while maintaining team autonomy**
* How to allow for code and pattern reuse while maintaining team autonomy
* **Encourage code reuse via libraries, e.g. auth, message parsing, common type definitions**
* Each team owns a subset of processes
* How to allow each team to model a process
* Can the system benefit from global workflow orchestration, so that running of the workflows is separated from defining them ?
* **Encourage investigation of workflow orchestration engines**
* AWS
* Step functions
* SWF
* Self-hosted
* temporal
* Types of processes
* The text mentions that some processes are automated, **I assume it means that any process can have automated and manual components both.**
* Make sure the manual intervention is handled similarily to the automated one, to incentivise code reuse
* e.g. HTTP call from UI is handled by putting a message on the queue and the rest of the processing is unaware of the source of the trigger
* Mentions of some processes needing to process large amoutns of data
* Let's try to estimate
* Is keeping historical data (e.g. past trades) in hot storage (quickly accessible) important ?
* **I assume historical data older than N months can be moved to cold storage.**
* "Data must be 100% accurate"
* What is "accuracy" in this statement ?
* Does it mean consistency and durability ?
@ -73,7 +88,33 @@ Some notes on the interpretation of the text and questions to ask.
* I think the text mentions that all buy and sell transactions are batched up to be executed together.
* Is it important for these batches to be externally observable ?
* **Can a customer or a regulator demand proof that a transaction was processed at roughly the same time as others, in the same batch ?**
* Not a real-time system
* **I assume we care more about total system throughput than about latency of individual steps of processing**
* Original text mentions varies business processes, some of them reading a lot of data, **I assume the load on the system can be "spiky"**
** Is it worth doing capacity planning or autoscaling or a mixture of both ?
* **I assume worth presenting a solution that can scale automatically.**
* Additional criteria
* Long term velocity & staff retention
* The system should be built in a way that allows for continous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
* Focus on good mapping between isolated problem areas and teams
**make sure team structure reflects desired architecture (inverse Conway maneuver)**
* Allow for autonomy, mastery and purpose to be maximised in each team
* **encourage continous learning via coaching, lunch and learns etc**
* Testing
* Continous fake transactions flowing through a non-prod environment with metrics gathered
* Encourage starting with a problem statement and only once everyone agrees proceed to discuss solutions. Apply at different scales.
* **encourage problem and solution separation at project/programme definition level**
* **encourage programmers to start with task definition and writing high-level tests and only then proceed with solution/implemntation**
* Build trust in the system/external branding
* It needs to work.
* Observability & alerting
* Measure and minimise impact on any external third party. Prefer impact on internal teams to impact on customers.
* **Could we set up canary/fake flows on production ? E.g. fake customers selling and buying** This helps a lot with incident response as well as understanding which part of the system is struggling, whether it's related to third parties or our system itself.
* See `testing` above
## Proposals
Proposed is a system architecture template, with a lot of optional components to be chosen from depending on the actual needs of the system. The proposed shape allows for spikes in load as main compute components are horizontally scalable and are isolated by queues that act as buffers. Once interesting thing to note is that this system shape encourages accepting as much from external third parties and then processing it locally when possible. Thus it discourages backpressure, which can be detrimental in some cases, esp during disaster scenarios.
## proposals