From 2bac3a3619ac4bfeeec6dec6810086ac2a00d469 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Cyryl=20P=C5=82otnicki?= <cyplo@cyplo.dev>
Date: Mon, 8 Jul 2024 15:05:02 +0100
Subject: [PATCH] Implementation notes

---
 fi1/README.md | 52 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 12 deletions(-)

diff --git a/fi1/README.md b/fi1/README.md
index e8ae5fc..61b749f 100644
--- a/fi1/README.md
+++ b/fi1/README.md
@@ -8,9 +8,6 @@ ACME operates a Master Trust Pension. A ACME pension consist of units of Assets
 Asset units are bought and sold on behalf of members. The process of buying and selling assets takes several days trades are only placed periodically and take several days to settle.
 The basic process for buying assets is:
 
-
-
-
 ```mermaid
 flowchart TD
     member([Member])-- Invest cash -->create_buy_trade
@@ -53,9 +50,14 @@ Some notes on the interpretation of the text and questions to ask.
 * There are a lot of different business processes (200-300 as per the text)
   * What is a process ?
     * How unique are those processes ? Do they share any common parts ?
+      * Three processes mentioned in the original text - buying assets, rebalancing and reconciliation. **I assume that the following is true for majority of the processes in the system:**
+        * They need to process multiple records in batches
+        * They are not real-time, but need to finish within specificed amount of time, i.e. throughput is important
+        * On one hand if there's multiple teams working on different types of processes independently we could think that this means that the momentary system load can be close to the average, i.e. spikes cancel each other out. However I am going ot assume that this is not the case, esp if the business process are tied to a country-specific wall-clock.
+        * New types of processes will appear in the future - we cannot design the system that will be hard-tied to only the known process types.
     * The original text describes a "process for buying assets"
-    * Is it a typical process ?
-    * Is it comprised of smaller processes ? This could explain the 200-300 processes if so.
+      * Is it a typical process ?
+      * Is it comprised of smaller processes ? This could explain the 200-300 processes if so.
   * They need to be managed by different teams across the company
     * Would be great to allow for each team to work independently and yet share any improvements.  
     * How to allow for code and pattern reuse while maintaining team autonomy
@@ -75,14 +77,16 @@ Some notes on the interpretation of the text and questions to ask.
         * e.g. HTTP call from UI is handled by putting a message on the queue and the rest of the processing is unaware of the source of the trigger
     * Mentions of some processes needing to process large amoutns of data
     * Let's try to estimate
-        * Is keeping historical data (e.g. past trades) in hot storage (quickly accessible) important ?
+      * Is keeping historical data (e.g. past trades) in hot storage (quickly accessible) important ?
         * **I assume historical data older than N months can be moved to cold storage.**
-        
+
 * "Data must be 100% accurate"
   * What is "accuracy" in this statement ?
     * Does it mean consistency and durability ? 
       * **Losing data is assumed very bad**
         * Once something has entered a system from a third party, e.g. an order to buy or sell it need to be persisted durably. **We can only reply to the initial call from the third party once we can guarantee durability.**
+          * What is considered an acceptable loss of data ? Under what circumstances ?
+            * Natural disasters, war etc covering a whole DC region
       * **I assume each part of the system can be at different phases of processing data, but as long as the third parties view a consistent state of transactions it is allowed.**
         * Is the granularity here per-customer ?
           * I think the text mentions that all buy and sell transactions are batched up to be executed together.
@@ -97,16 +101,16 @@ Some notes on the interpretation of the text and questions to ask.
 
 * Additional criteria
   * Long term velocity & staff retention
-    * The system should be built in a way that allows for continous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
+    * The system should be built in a way that allows for continuous fast iteration and the size of the feedback loop should not deteriorate with the age of the system.
       * Focus on good mapping between isolated problem areas and teams
         **make sure team structure reflects desired architecture (inverse Conway maneuver)**
       * Allow for autonomy, mastery and purpose to be maximised in each team
         * **encourage continous learning via coaching, lunch and learns etc**
       * Testing
-        * Continous fake transactions flowing through a non-prod environment with metrics gathered
+        * Continuous fake transactions flowing through a non-prod environment with metrics gathered
         * Encourage starting with a problem statement and only once everyone agrees proceed to discuss solutions. Apply at different scales.
           * **encourage problem and solution separation at project/programme definition level**
-          * **encourage programmers to start with task definition and writing high-level tests and only then proceed with solution/implemntation**
+          * **encourage programmers to start with task definition and writing high-level tests and only then proceed with solution/implementation**
   * Build trust in the system/external branding
     * It needs to work.
       * Observability & alerting
@@ -114,7 +118,31 @@ Some notes on the interpretation of the text and questions to ask.
         * **Could we set up canary/fake flows on production ? E.g. fake customers selling and buying** This helps a lot with incident response as well as understanding which part of the system is struggling, whether it's related to third parties or our system itself.
       * See `testing` above
 
-## Proposals
+## Proposal
 
-Proposed is a system architecture template, with a lot of optional components to be chosen from depending on the actual needs of the system. The proposed shape allows for spikes in load as main compute components are horizontally scalable and are isolated by queues that act as buffers. Once interesting thing to note is that this system shape encourages accepting as much from external third parties and then processing it locally when possible. Thus it discourages backpressure, which can be detrimental in some cases, esp during disaster scenarios.
+### System architecture template
+Proposed is a system architecture template, with a lot of optional components to be chosen from depending on the actual needs of the system. The proposed shape allows for spikes in load as main compute components are horizontally scalable and are isolated by queues that act as buffers. One interesting thing to note is that this system shape encourages accepting as much from external third parties and then processing it locally when possible. Thus it discourages backpressure, which can be detrimental in some cases, esp during disaster scenarios.
 
+Implementation notes:
+* Start small, this template aims to inspire and provide some overview of possiblities, it does not mean we should build everything present there on day one.
+  * A good place to start would be:
+    * One process, not too big, but representative
+    * external HTTP trigger
+    * one lambda to handle the HTTP call and put a message on the bus
+    * a small number of lambdas to handle steps of processing (re-using lambda, not introducing new deployment concepts)
+      * one lambda to handle informing external party of our intent (e.g. buy trades)
+    * one lambda to handle informing the original caller about progress/state of processing - effectively forwarding events and transforming into correct format (HTTP calls ? queue ?)
+    * just one data store for everything, synchronous db calls - need to get the event emission vs DB call semantic right - e.g. outbox pattern ?
+  * From there we could try and experiment with other things mentioned in the template
+* Automated testing is extremely important, especially whenever we have a distributed system.
+* Idempotency matters, both externally as well as internally. Never assume exactly-once delivery of a message, test extensively with non-zero delivery error rate scenarios.
+* Avoid synchronous communication with the actors requesting changes - e.g. if using HTTP - don't make clients wait for us to do any processing before we can issue an HTTP response. Store the intent and process later rather than processing in place.
+
+
+## Example processes - modeling and notes
+To be able to validate the system template and choose specific components from it let's analyse the processes we know of and see how easy it is to implement them.
+
+(I ran out of time here, but based on my intuition/experience the template should still work)
+
+### Data design/modelling
+Would be interesting to model how data is stored and see how can we make things reusable.