Proper Tools help you tackle complexity

kalpesh patel
5 min readJul 11, 2020

Be it engineering or life I am observing that when dealing with complexity, proper tools are one way to tame it and its OK to spend some money on them if they can save you time or avoid mistakes. Here are some of my observations:

Sourdough bread: These are considered the holy grail of bread making. In the Covid-19 lockdown the stores ran out of the brand of bread I was eating. I recently embarked on this journey of ditching the store brought bread and making my bread every weekend after getting inspired by a Sr engineering leader at the company. It takes typically any where from 24–36 hours to make it but the end product is worth it. I had many failures over last 3 months and I still perfecting the recipe to enrich the flavor. Two of the tools that helped me are:

  1. Kitchen scale: Earlier I was measuring with cups and its not a good way to measure salt,water and flour and starter as all four have different consistency. I brought a kitchen scale and everything is now measured in grams and I got the perfect Flour->water (hydration), Flour->salt, Flour->starter in the dough and Flour->water ratio in the starter.
  2. IPhone alarms/Timer: Trust me I am a forgetful person as I will go deep in some thoughts and will forget that the starter needs feeding or the bread is in the oven or its time to do a stretch and fold on the dough. This would lead to mistakes so now I just put a timer or alarm to measure the exact time between stages and the end product is becoming more and more consistent.
  3. Next tool I am thinking of buying is a dutch oven because today I just burnt my fingers with the makeshift dutch oven :).

Engineering: When production is under pressure or down then you need to react fast. For a complex system there can be many things that can go wrong so you need a faster way to quickly eliminate suspects. Again we aren't there yet but we are becoming better at it. Some tools we use are:

  1. Newrelic: This along with nagios/icinga quickly tells us something is wrong and in which part of the datacenter it’s wrong. It’s good but as the platform is becoming complex the big problem is that we use many components and it wont tell us exactly what is wrong and why its wrong. Every time, we need to get senior engineers on the call and they would take some time to check stack traces and other things and then weed out the offending component. This can waste anywhere from 10–15 minutes and I could see a pattern.
newrelic telling something is wrong but can’t tell quickly what is wrong

2. Heartbeat dashboard: Therefore we created heartbeat dashboard, every component within a service reports a heartbeat status every minute and this tool aggregates it from thousands of nodes to provide a consolidated view to SRE to quickly check which component (Redis in the below example) is misbehaving and they can quickly reach out to appropriate oncall engineer to debug it.

3. Kibana/ELK: Before ELK, logs were mounted on to some central servers and as the platform was complex only developers would know which logs to scrape. It took a lot of grinding by our PE/OPS team over the past years to make the ELK stack work but it has given superpowers to SRE, support, CSM and PMs as now they can use human query language to search the logs and most of the times the developers need not be involved in investigation. Developers/Architects can keep breaking the system into parts and move things around but as others are still using Kibana and human query language to debug they won’t be exposed to the complexity.

4. OpenTSDB/Grafana: This stack ingests billions of metrics a day but its worth it because the platform is becoming complex every day and we can’t introduce architecture changes without first obsessing over the data. We recently improved our search performance from 14 sec to 2 sec, it took 6 months(more on this in a separate post) and every week we would run multiple canaries to first publish metrics and then based on data make decisions to pursue this or that strategy. Without OpenTSDB and Grafana we couldn’t have done it.

5. Feature Flag system: We have a massive feature flag system that allows developers to introduce code behind feature flag and then watch metrics and turn features on/off in production. We moved from bi-weekly releases to 2–3 releases a week and it couldn’t have been possible without our settings infrastructure. Again a complex code-base and a complex infrastructure and this settings system brings some sanity to the chaos so it was worth the investment into this tool.

Personal finance:I wrote about this before here but one of the big complexities I had was multiple retirement accounts of my wife and mine. Tracking proper asset allocation was a problem and I was using spreadsheets and what not to keep track of them and it was a mess and time consuming. I recently started using some aggregators like Mint, portfolio visualizer, personal capital and now its simple to see with all these market up/downs whether my asset allocation is out of whack and needs adjustment or not and removed a lot of anxiety in the process.

--

--