A Two Coffee Problem

Monday, 17 March 2025

The Virtual World

I don't have any statistics to back up this claim but given the prevalence of cloud native approaches I am willing to bet that there is more virtualised hardware in the world than real physical computers.

We take for granted that we can bring up a virtual machine within a few clicks, use it for a vast variety of different workloads and then spin it back down again. Servers have gone from things that need to be carefully maintained and looked after to an ephemeral resource we can create and throw away.

The technology unpinning this ability is the hypervisor. A piece of software that provides an abstraction of physical resources such as CPU and memory to allow the creation of virtual machines. This allows a powerful host server to be utilised to provide a large number of isolated guest machines, increasing efficiency and productivity.

History of Virtualisation

Virtualisation had previously existed to the extent that multiple different software applications could run on the same hardware concurrently. In the late 1960s IBM developed a research tool called SIMMON that took this a step further allowing hardware resources to also be virtualised on mainframe computers. This was soon extended to cover operating system resources such as kernel tasks such that the idea of having virtual machines built on top of real hardware was born.

For many years this virtualisation was the preserve of mainframe systems, until in around 2005 the first attempts at virtualisation for x86 systems started to gain momentum.

Originally the hypervisors being developed were complex and prone to relatively slow performance, but as technology advanced the level of virtualisation that could support the cloud services we now take for granted started to emerge.

Types of Hypervisor

Hypervisors can be broadly categorised into two types, type one and type two.

Type one hypervisors run directly on the host machines hardware and therefore eliminate the need for an underlying operating system. For this reason they are often referred to as native or bare metal hypervisors.

Type one hypervisors are very efficient and often more secure, they are typically used within data centres on powerful servers hosting a large number of virtual machines.

Type two hypervisors run on the host machines operating system in the same way as any other application, they provide an abstraction of the host operating system to create an isolated process the host can interact with. For this reason they are often referred to as host hypervisors.

Type two hypervisors are less efficient than type one because the host operating system prevents them from having direct control of the underlying hardware. However they are a more practical option when virtualisation is needed on a machine that isn't a server.

Benefits of Virtualisation

Aside from the ability to turn one server into multiple virtual machines, what other benefits does virtualisation bring?

As we've already touched on virtualisation can greatly increase efficiency. Rather than having multiple applications each being hosted on their own server, potentially not making full use of the available resources. All applications can be hosted on their own virtual machine on the same server.

The ability for virtual machines to be created quickly and automatically also provides an element of scalability that doesn't exist if new physical servers need to be added to a farm. The fact the underlying system is virtualised also provides an element of portability, the application can be hosted on any physical server capable of running a virtual machine to the same specification.

Coupled to this idea of portability is also the concept of snapshots and failure recovery. Since the environment an application is running in is software controlled it allows the current state of the virtualised hardware to be recorded into a snapshot which can be used to deal with failure by returning the system to a known good state.

Virtualisation also provides the ability to create crafted environments for legacy systems. Where an older application may have requirements that are no longer compatible with modern hardware, virtualisation offers a means to create such an environment whilst still utilising a cloud based architecture.

We sometimes take cloud computing for granted without thinking about the technology that underpins it. For the majority of applications it isn't really necessary to have a knowledge of how hypervisors are enabling them to be deployed into a crowd environment. But I still think it can helpful in us being well rounded engineers to at least have knowledge of the layers in our stack, what they provide and their limitations and strengths.

Monday, 10 March 2025

Distributing Problems

Patterns and practices in software engineering can be very cyclical, the names and terminology applied can change but we often go back to old ideas that had previously been dismissed.

I think this is indicative of the fact that many problems in software development don't have perfect answers. It is often a balance of pro's and con's and sometimes the weighting applied to them varies over time and so we flip between different competing approaches.

One of these areas relates to whether it is better to centralise or distribute. Should our applications be monolithic in nature or a collection of loosely coupled distributed parts. For a long time the argument as been seen to be won by distributed computing and the microservices approach. However in recent times the monolith approach has started to be seen as not always being the wrong idea.

This article shouldn't be seen as an argument against a distributed approach, it should be viewed more as an argument against premature optimisation. By understanding the drawbacks of distribution you can make a better judgement on whether it's the right approach for your application right now.

Interconnected Services

Distributed computing is a relatively broad term, but within the context of this article we are taking it to mean the pattern of dividing an application up into a collection of microservices.

Usually microservices are built to align to a single business domain or process with the application being the sum of these parts communicating via a lightweight protocol such as RESTful APIs, or increasingly via an event driven architecture.

You can see from this that the term microservice is quite loosely defined, a lot of the issues that are created when applying this approach can often be traced back to the fact that defining the division between microservices and their relative size is quite a hard problem.

The best explanation I've seen for this is that a microservice should be as easy to replace as to refactor, basically meaning microservices shouldn't be so large as to negate the option of starting again with their design.

I think this idea is much easier to apply when starting with a blank sheet of paper. When splitting up an existing application it is often more pragmatic to not subdivide too quickly, as further splitting an existing service is often easier than trying to coalesce several services back into one once they've been split.

Fallacies of Distributed Computing

In 1994 L. Peter Deutsch at Sun Microsystems devised a list of seven fallacies of distributed computing building on top of the work of Bill Joy and Dave Lyon.

The fallacies represent seven assumptions that often cause the architecture and development of a distributed system to head in the wrong direction.

The first is that the Network is Reliable, this often leads to services not being written with network related error handling in mind, meaning when network errors do occur services often stall and become stuck consuming resources while waiting for a response that isn't forthcoming.

The second and third are related in that Latency is Zero and Bandwidth is Infinite can both cause developers to give little thought to the nature of the data that is propagating through the network.

Number four is that the Network is Secure, which can lead to a complacency where possible intrusion from malicious insiders isn't considered.

Number five is that Network Topology Doesn't Change, which in a similar way to two and three are indicative of us not thinking about the fact that the network our applications operate in is a dynamic element in the same way as our code.

Number six is There is One Administrator, this can cause us to fail recognise inconsistent or contradictory policies around network traffic and routing.

Number seven is that Transport Cost is Zero, here we need to factor into our thinking that an API call and the resultant transfer of data has a cost in terms transmission time.

Strategies to Compensate

The fallacies described in the previous section shouldn't be seen as arguments for why distributed systems shouldn't be built, they are things that should be considered when they are.

We can often think that our software is deployed into an homogeneous environment with perfect conditions, but this is often not the case.

Errors in transport can occur so we should have an affective strategy to detect these errors, retry calls if we believe this may lead to a successful outcome, but also have a strategy for when calls continue to fail such as a circuit breaker to avoid filling the network with requests that are unlikely to received the desired response.

We must realise that as load increases on our system the size of the data we are passing between elements may start to be a factor in their performance. Even if each individual request/response is not large in sufficient quantities their impact may be felt.

We have to maintain a healthy distrust of the elements of our system we are interacting with. A zero trust approach means we do not inherently trust any element simply because it is inside the network, all elements must properly authenticate and be authorized.

We must also consider that when we subdivide our system into more elements that we are introducing a cost in the fact those elements will need to communicate, this cost must be balanced with the benefit the change in architecture would bring.

These are only some of the things we need to think about with a distributed approach. This post is too short to cover them in great detail, but the main takeaway should be that a distributed approach isn't cost free and sometimes it might not offer advantages over a monolithic approach. Getting a distributed approach right is hard and not an exact science, many things need to be considered and some missteps along the way should be expected.

As with any engineering decision its not right or wrong its a grey area where pro's and con's must be balanced.

Tuesday, 25 February 2025

Solid State World

It's often said that technology moves quickly, I'm actually of the opinion that its our learning of how to utilise technology that actually grows rapidly.

The invention of certain pieces of technology are transformative, so much so that in the initial stages we struggle to grasp its potential. Then over time the realisation of whats possible grows and more and more applications materialise.

The World Wide Web was one of those technological jumps forward, and Artificial Intelligence will undoubtedly, if it isn't already, be another notable point in history. However I believe the true giant leap forward is often not talked about as much as it should be since it underpins everything that has come since and kick started a revolution that has lasted 75+ years.

That giant leap forward was the invention of solid state electronics.

Triodes to Transistors

The early building block of electronics was the vacuum tube triode. These devices had applications across radar, radio, television and many other spheres. However they were large and power angry devices making it difficult to use them reliably in increasingly complex devices.

The key to the miniaturisation of electronics was the invention of the field effect transistor. The theory behind these devices was first formulated in the 1920s, but it wasn't until the late 1940s that the first practical examples were built.

Many different scientists independently discovered and worked on the transistor concept, however William Shockley, John Bardeen and Walter Brattain are widely credited with its invention. In the late 1950s at Bell Laboratories they first produced the Metal Oxide Semiconductor Field Effect Transistor (MOSFET).

These transistors had many applications, once of which was being able to act as a "digital" switch, with this the era of the semiconductor and solid state electronics was born.

Printing with Light

Once the MOSFET had been invented the next challenge related to developing techniques for reliably being able to manufacturer them whilst also being able to continue their miniaturisation. The ability to produce transistors at a smaller size would enable them to be more densely packed as well as reducing power consumption.

The idea of photolithography, using light to print patterns into materials, had been around for some time. But in the late 1950s Jules Andrus at Bell Laboratories looked to use similar techniques to build solid state electronics (Moe Abramson and Stanislaus Danko of the US Army Signal Corps are also credited with inventing similar techniques in the same period).

Using this technique a semiconductor substrate is covered with a material called photoresist. A mask is then placed over the top such that only certain areas of the material are exposed to a light source. The exposed areas go through a chemical change that renders them either soluble or insoluble in a developer solution depending on the type of photoresist that is used. Finally the material then goes through an etching process to leave the desired pattern of the semiconductor based components.

Typically now Ultra Violet light is used in the process, but much of the drive for ever smaller and smaller transistors was driven by using ever decreasing wavelengths of light in the photolithography process.

Moore's Law

Gordon Moore was an American engineer who originally worked at Schokley Semiconductor Laboratory, a company founded by William Schokley the co-inventor of the transistor. He would latter go on to be a founder member of both Fairchild Semiconductor and Intel.

In 1965 while working at Fairchild Semiconductor he was asked what he thought might happen to semiconductor technology over the next ten years. His answer to that question would eventually come to be known Moore's Law.

Put simply Moore's Law states that the number of transistors that can fit into a given size of integrated circuit (or "chip") will double roughly every two years.

Originally Moore predicted this would be the case for ten years, however remarkably it has continued to hold for at least 55 years with industry voices only in the last couple of years starting to question if we may now be starting to reach the limit of Moore's original prediction.

The fact the industry has been following Moore's law for so long has been a major driver in continually increasing processing power that has been a catalyst for innovation. It is now believed that the transistor is the most widely manufactured device in human history. In 2018 it was believed as many as 13 sextillion (13 followed by 21 zeros) had been manufactured.

The lives of virtually every human on the planet has been touched by solid state electronics and the technologies it underpins. It is fair to say that the birth of solid state electronics marked the birth of our technological age and changed the world forever. Of all the transformative technologies that have come since, and will continue to be developed in the future, I don't believe any will be as impactful as this initial giant leap forward.

Saturday, 9 November 2024

Network of Networks

When we're searching for analogies to describe the operation of the internet we often fall back on that of posting a letter. Each packet of data we need to send and receive can be compared to a parcel we need to get from A to B.

In order to achieve that you need to identify the best route to take between those two points. For parcels these routes are relatively static, for the internet that isn't the case with routes being much more dynamic.

Border Gateway Protocol (BGP) is a path vector routing protocol that makes routing decisions on how to get data packets to their intended destination. As internet users we may not directly interact with it but without it the internet wouldn't be functional across the globe as it is today.

Autonomous Systems

The internet as the name suggests is a network of networks, it isn't one cohesive systems, it is a large number of individual networks. These individual networks in the context of routing traffic are referred to as Autonomous Systems (AS).

If we continue our postal analogy then each AS is like an individual postal region covering a town or county.

Each AS consists of routers that know how to route traffic internally but need to rely on connections to neighbouring AS to route traffic outside of its region. Because the internet is a dynamic system with routes appearing and disappearing each AS needs to be kept up to date with routes each other AS can help with.

This is achieved via peering sessions where AS connect to each other in order to build up the full picture of how to route traffic to destinations outside of each others boundary. BGP is the mechanism that allows this routing information to be exchanged.

Not Just The Shortest

In order to become a AS you must register with the Internet Assigned Numbers Authority (IANA) which will assign the AS to a Regional Internet Registry (RIR) as well as allocating a 16 or 32 bit identifier. As of the late 2010s there were around 64,000 AS registered worldwide a number which will have continued to grow.

AS tend to be managed and run by large organisations, typically Internet Service Providers (ISP) but also large tech companies, governments or large institutions such as universities.

Often there are multiple possible routes for a packet to reach its destination. In order to allow the best route to be chosen BGP allows AS to apply attributes to each route that can be factored into the decision on which route to take.

These attributes may indicate hop count (how many steps are involved in getting to the destination) as well as weight where an AS can indicate which route it would prefer traffic to take.

Because some AS are managed by commercial business this creates an interesting quirk in how traffic is routed. It would be assumed that the shortest or fastest route would be chosen. But because some companies will charge for their AS to handle traffic or may not want to help competitors then commercial relationships are sometimes factored into which route to take.

When BGP Goes Wrong

As described earlier the management of routing traffic through the internet is autonomous, it is far too complicated and dynamic to be managed by hand. An AS uses BGP to announce which routes it can provide access to, other AS then use this information to route traffic from within inside their network.

There have been several examples of AS accidentally announcing they can provide access to routes which they can't in fact handle. In 2004 a Turkish ISP accidentally announced it could handle any route on the internet, as this misinformation spread to more and more AS internet access ground to a halt across a large part of the globe. In 2008 a Pakistani ISP which was blocking traffic internally to YouTube accidentally started announcing externally it could handle connections to YouTube and caused an outage in the region to that service.

These are both examples of BGP hijacking, in these cases the announcement of false routes was accidental but this isn't always the case. This can sometimes be done maliciously in order to lure traffic meant for a legitimate site to an imposter, in 2018 hackers announced bad BGP routes for traffic being hosted in Amazon AWS and were able to steal a large amount of cryptocurrency.

In an effort to combat this kind of activity Resource Public Key Infrastructure (RPKI) allows BGP data to be cryptographically signed in order to validate that an AS is authorised to announce a route for a particular resource.

The internet is such an important part of our day to day lives its easy to consider it a slick homogeneous system that always works, and it is true to say that its a miracle of engineering that its been able to scale to the network of networks that we use today. However it is actually a very large collection of individual elements and sometime its fragility is exposed and we see it is all too easy for it to falter.

Saturday, 19 October 2024

Underpinning Kubernetes

Kubernetes is the de facto choice for deploying containerized applications at scale. Because of that we are all now familiar with its building blocks that allow us to build our applications such as deployments, ingresses, services and pods.

But what is it that underpins these entities and how does Kubernetes manage this infrastructure. The answer lies in the Kubernetes control plane and the nodes it deploys our applications too.

The control plane manages and makes decisions related to the management of the cluster, in this sense it acts as the clusters brain. It also provides an interface to allow us to interact with the cluster for monitoring and management.

The nodes are the work horses of the cluster where infrastructure and applications are deployed and run.

Both the control plane and the nodes comprise a number of elements each with their own role in providing us with an environment in which to run our applications.

Control Plane Components

The control plane is made up of a number of components responsible for the management of the cluster, in general these components run on dedicated infrastructure away from the pods running applications.

The kube-apiserver provides the control plane with a front end via a suite of REST APIs. These APIs are resource based allowing for interactions with the various elements of Kubernetes such as deployments, services and pods.

In order to manage the cluster the control plane needs to be able to store data related to its state, this is provided by etcd in the form of a highly available key value store.

The kube-scheduler is responsible for detecting when new pods are required and allocating a node for them to run on. Many factors are taken into account when allocating a node including resource and affinity requirements, software or hardware restrictions and data locality.

The control plane contains a number of controllers responsible for different aspects of the management of the cluster, these controllers are all managed by the kube-controller-manager. In general each controller is responsible for monitoring and managing one or more resources within the clusters, as an example the Node Controller monitors for and responds to nodes failing.

By far the most common way of standing up a Kubernetes cluster is via a cloud provider. The cloud-controller-manager provides a bridge between the internal concepts of Kubernetes and the cloud provider specific API that is helping to implement them. An example of this would be the Route Controller responsible for configuring routes in the underlying cloud infrastructure.

Node Components

The node components run on every node in the cluster that are running pods providing the runtime environment for applications.

The kubelet is responsible for receiving PodSpecs and ensuring that the pods and containers it describes are running and healthy on the node.

An important element in being able to run containers is the Container Runtime. This runtime provides the mechanism for the node to act as a host for containers. This includes pulling the images from a container registry as well as managing their lifecycle. Kubernetes supports a number of different runtimes with this being a choice to be made when you are constructing your cluster.

An optional component is the kube-proxy that maintains network rules on the node that plays an important role in implementing the services concept within the cluster.

Add Ons

In order to allow the functionality of a cluster to be extended Kubernetes provides the ability to define Add ons.

Add ons cover many different pieces of functionality.

Some relate to networking by providing internal DNS for the cluster allowing for service discovery, or by providing load balancers to distribute traffic among the clusters nodes. Others relate to the provisioning of storage for use by the application running within the cluster. Another important aspect is security with some add ons allowing for security policies to be applied to clusters resources and applications.

Any add ons you choose to use are installed within the cluster with the above examples by no means being an exhaustive list.

As an application developer deploying code into a cluster you don't necessarily need a working knowledge of how this infrastructure is being underpinned. Bit I'm a believer that having an understanding of the environment where your code will run will help you write better code.

That isn't to say that you need to become an expert, but a working knowledge of the building blocks and the roles they play will help you become a more well rounded engineer and enable you to make better decisions.

Sunday, 13 October 2024

Compiling Knowledge

Any software engineer who works with a compiled language will know the almost religious concept of the build. Whether you've broken the build, whether you've declared that it builds on my machine, or whether you've ever felt like you are in a life or death struggle with the compiler. The build is a process that must happen to turn your toil into something useful for users to interact with.

But what is actually happening when your code is being complied? In this post we are certainly not going to take a deep dive into compiler theory, it takes a special breed of engineer to work in that realm, but an understanding of the various processes involved can be helpful on the road to becoming a well rounded developer.

From One to Another

To start at the beginning, why is a compiler needed? Software by the time it runs on a CPU is a set of pretty simple operations knows as the instruction set. These instructions involve simple mathematical and logical operations alongside moving data between registers and areas of memory.

Whilst it is possible to program at this level using assembly language, it would be an impossibly difficult task to write software at scale. As engineers we want to be able to code at a higher level.

Compilers give us the ability to do that by acting as translators, they take the software we've written in a high level language such as C++, C#, Java etc and turn this into a program the CPU can run.

That simple description of what a compiler does belies the complexity of what it takes to achieve that outcome, so it shouldn't be too much of a surprise that implementing it takes several different process and phases.

Phases

The first phase of compilation is Lexical Analysis, this involves reading the input source code and breaking it up into its constituent parts, usually referred to as tokens.

The next phase is Syntax Analysis, also know as parsing. This is where the compiler ensures that the tokens representing the input source code conform to the grammer of the programming language. The output of this stage is something called the Abstract Syntax Tree (AST), this represents the structure of the code as described by a series of interconnected nodes in a tree structure representing paths through the code.

Next comes Semantic Analysis, it is at this stage that the compiler checks that the code actually makes sense and obeys the rules of the programming language including its type system. The compiler is checking that variables are declared correctly, that functions are called correctly and any other semantic errors that may exist in the source code.

Once these analysis phases are complete the compiler can move onto Intermediate Code Generation. At this stage the compiler generates an intermediate representation of what will become the final program that is easier to translate into the machine code the CPU can run.

The compiler will then run an Optimisation stage to apply certain optimisations to the intermediate code to improve overall performance.

Finally the compiler moves onto Code Generation in order to produce the final binary, at this stage the high level language of the input source code has been converted into an executable that can be run on the target CPU.

Front End, Middle End and Backend

The phases described above are often segregated into front end, middle end and backend. This enables a layered approach to be taken to the architecture of the compiler and allows for a certain degree of independence. This means different teams can work on these areas of the compiler as well as making it possible for different parts of compilers to be re-used and shared.

Front end usually refers to the initial analysis phases and is specific to a particular programming language. Should any of the code fail this analysis errors and warnings will be generated to indicate to the developer which lines of source code are incorrect. In this sense the front end is the most visible part of the compiler that developers will interact with.

The middle end is generally responsible for optimisation. Many compilers will have settings for how aggressive this optimisation is and depending on the target environments may also distinguish between optimising for speed or memory footprint.

The backend represents the final stage where code that can actually run on the CPU is generated.

This layering allows for example for front ends related to different programming languages to be combined with different backends that produce code for particular families of CPUs, with the intermediary representation acting as the glue to bind them together.

As we said at the start of this post understanding exactly how compilers work is a large undertaking. But having any appreciation of the basic architecture and phases will help you deal with those battles you may have when trying to build your software. Compiler messages may sometimes seem unclear or frustrating so this knowledge may save valuable time in figuring out what you need to do to keep the compiler happy.

Saturday, 14 September 2024

Terraforming Your World

Software Engineers are very good at managing source code. We have developed effective strategies and tools to allow us to branch, version, review, cherry pick and revert changes. It is for this reason we've been keen to try and control all aspects of our engineering environment in the same way.

One such area is the infrastructure on which our code runs.

Infrastructure as Code (IaC) is the process of managing and deploying compute, network and storage resources via files that can be part of the source control process.

Traditionally these resources may have been managed via manual interactions with a portal or front end application from your cloud provider of choice. But manual processes are prone to error and inconsistency making it difficult and time consuming to manage and operate infrastructure in this way.

Environments might not always be created in exactly the same way, and in the event of a problem there is a lack of an effective change log to enable changes to be reverted back to a known good state.

One of the tools that attempts to allow infrastructure to be developed in the same way as software is Terraform by Hashicorp.

Terraform

Terraform allows required infrastructure to be defined in configuration files where the indicated resources are created within the cloud provider via interaction with their APIs.

These interactions are encapsulated via a Provider which defines the resources that can be created and managed within that cloud. These providers are declared within the configuration files and can be pulled in from the Terraform Registry which acts like a package manager for Terraform.

Working with Terraform follows a three stage process.

Firstly in the Write phase the configuration files which describe the required infrastructure are created. These files can span multiple cloud providers and can include anything from a VPC to compute resource to networking infrastructure.

Next comes the Plan phase. Terraform is a state driven tool, it records the current state of your infrastructure and applies the necessary changes based on the updated desired state in the configuration files. As part of the plan phase Terraform creates a plan of the actions that must be taken to move the current state of the infrastructure to match the desired state, whether this be creating, changing or deleting elements. This plan can then be reviewed to ensure it matches with the intention behind the configuration changes.

Finally in the Apply phase Terraform uses the cloud provider APIs, via the associated provider, to ensure the deployed infrastructure aligns with the new state of the configuration files.

Objects

Terraform supports a number of different objects that can be described in the configuration files, presented below is not an exhaustive list but describes some of the more fundamental elements that are used to manage common infrastructure requirements.

Firstly we have Resources which are used to describe any object that should be created in the cloud provider, this could be anything from a compute instance to a DNS record.

A Data Source provides Terraform with access to data defined outside of Terraform. This might be data from pre-existing infrastructure, configuration held outside of Terraform or a database.

Input Variables allow configuration files to be customised to a specific use case increasing their reusability. Output Variables allow the Terraform process to return certain data about the infrastructure that has been created, this might be to further document the infrastructure that is now in place or act as an input to another Terraform process managing separate but connected infrastructure.

Modules act like libraries in a software development context to allow infrastructure configuration to be packaged and re-used across many different applications.

Syntax

We have made many references in this article to configuration files but what do these actually consist of?

They are defined using the Hashicrop Configuration Language (HCL) and follow a format similar to JSON, in fact it is also possible for Terraform to work with JSON directly.

All content is defined with structures called blocks:

resource "my_provider_object_type" "my_object" {

some_argument = "abc123"

}

Presented above is an extremely simple example of a block.

Firstly we define the type of the block, in this case a resource. Then comes a series of labels with the number of labels being dependent on the block type. For a resource block two labels are required, the first describing the type of the resource as defined by the provider, the second being a name that can be used to refer to the resource in other areas of the configuration files.

Inside the block the resource may require a series of arguments to be provided to it in order to configure and control the underlying cloud resource that will be created.

This post hasn't been intended to be a deep dive into Terraform, instead I've been trying to stoke your interest in the ways an IaC approach can help you apply the same rigour and process to your infrastructure management as you do to your source code.

Many of the concepts within Terraform have a close alignment to those in software engineering. Using an IaC approach alongside traditional source code management can help foster a DevOps mentality where the team responsible for writing the software can also be responsible for managing the infrastructure it runs on. Not only will this allow their knowledge of the software to shape the creation of the infrastructure but also in reverse knowing where and on what infrastructure their code will run may well allow them to write better software.

Tuesday, 3 September 2024

Being at the Helm

The majority of containerized applications that are being deployed at any reasonable scale will likely be using some flavour of Kubernetes.

As a container orchestration platform Kubernetes allows the deployment of multiple applications to be organised around the concepts of pods, services, ingress, deployments etc defined in YAML configuration files.

In this post we won't go into detail around these concepts and will assume a familiarity with their purpose and operation.

Whilst Kubernetes makes this process simpler, when it's being used for multiple applications managing the large number of YAML files can come with its own challenges.

This is where Helm comes into the picture. Described as a package manager for Kubernetes Helm provides a way to manage updates to the YAML configuration files and version them to ensure consistency and allow for re-use.

I initially didn't quite understand the notation of Helm being a package manager but as I've used it more I've come to realise why this is how it's described.

Charts and Releases

The Helm architecture consists of two main elements, the client and the library.

The Helm client provides a command line interface (CLI) to indicate what needs to be updated in a cluster via a collection of standard Kubernetes YAML files, the library then contains the functionality to interact with the cluster to make this happen.

The collection of YAML files passed to the client are referred to as Helm Charts, they define the Kubernetes objects such as deployments, ingress, services etc.

The act of the library using these YAML files to update the cluster is referred to as a Release.

So far you maybe thinking that you can achieve the same outcome by applying the same YAML files to Kubernetes directly using the kubectl CLI. Whilst this is true where Helm adds value is where you need to deploy the same application into multiple environments with certain configuration or set-up differences.

Values, Parametrisation and Repositories

It will be a common practice to need to deploy an application to multiple environments with a differing numbers of instances, servicing requests on different domains or any other differences between testing and production environments.

Using Kubernetes directly means either maintaining multiple copies of YAML files or having some process to alter them prior to them being applied to the cluster. Both of these approaches have the potential to cause inconsistency and errors.

To avoid this Helm provides a templating engine to allow a single set of YAML files to become parameterised. The syntax of this templating when you first view it can be quite daunting, while we won't go into detail about it here, like with any language over time as you use it more it will eventually click.

Alongside these parameterised YAML files you specify a Values YAML that defines the environment specific values that should be applied to the parameterised YAML defining the Kubernetes objects.

This allows the YAML files to be consistent between all environments in terms of overall structure whilst varying where they need too. This combination of a Values YAML file and the parameterised YAML defining the Kubernetes objects are what we refer to as Helm Charts.

It maybe that your application is something that needs to be deployed in multiple clusters, for these situations your Helm charts can be provided via a repository in a similar way that we might make re-usable Docker images available.

I think it's at this point that describing Helm as a package manager starts to make sense.

When we think about code package managers we think of re-usable libraries where functionality can be shared and customised in multiple use cases. Helm is allowing us to achieve the same thing with Kubernetes. Without needing to necessarily understand all that is required to deploy the application we can pull down Helm charts, specify our custom values and start using the functionality in our cluster.

When to Use

The benefit you will achieve by using Helm is largely tied to the scale of the Kubernetes cluster you are managing and the number of applications you are deploying.

If you have a simple cluster with a minimal applications deployed maybe the overhead of dealing with Kubernetes directly is manageable. If you have a large cluster with multiple applications or you have many clusters each with different applications deployed you will benefit more due to the ability it offers to ensure consistency.

You will also benefit from using Helm if you need to deploy a lot of 3rd party applications into your cluster, whether this might be to manage databases, ingress controllers, certificate stores, monitoring or any other cross cutting concern you need to have available in he cluster.

The package manager nature of Helm will reduce the overhead in managing all these dependencies in the same way that you manage dependencies at a code level.

As with many tools your need for Helm may grow over time as you estate increases in complexity. If like me you didn't immediately comprehend the nature and purpose of Helm then hopefully this post has helped you recognise what it can offer and how it could benefit your use case.

Monday, 26 August 2024

Avoiding Toiling

Site Reliability Engineering (SRE) is the practice of applying software engineering principles to the management of infrastructure and operations.

Originating at Google in the early 2000s the sorts of things an SRE team might work on include system availability, latency and performance, efficiency, monitoring and the ability to deliver change.

Optimising these kinds of system aspects covers many different topics and areas, one of which is the management of toil.

Toil in this context is not work we don't particularly enjoy doing or don't find stimulating, it has a specific meaning defined by aspects other than our enjoyment of the tasks it involves.

What is Toil?

Toil is work that exhibits some or all of the following attributes.

It is Manual in nature, even if a human isn't necessarily doing the work it requires human initiation, monitoring or any other aspect that means a team member has to oversee its operation.

Toil is Repetitive, the times work has to be done may vary and may not necessarily be at regular intervals, but the task needs to be performed multiple times and will never necessarily be deemed finished.

It is Tactical meaning it is generally reactive, it has to be undertaken either in relation to something happening within the system for example when monitoring highlights something is failing or is sub-optimal.

It has No Enduring Value, this means it leaves the system in the same state as before the work happened. It hasn't improved any aspect of the system or eliminated the need for the work to happen again in the future.

It Scales with Service Growth. Some work items need to happen regardless of how much a system is used. This tends to be viewed as overhead and is simply the cost of having the system in the first place. Toil scales with system use meaning the more users you attract the greater the impact of the toil on your team.

Finally toil can be Automated, some tasks will always require human involvement, but for a task to be toil it must be possible for it to be automated.

What is Toils Impact?

It would be wrong to suggest that toil can be totally eliminated, having a production system being used by large numbers of people is always going to incur a certain amount of toil, and it is unlikely that the whole engineering effort of your organisation can be dedicated to removing it.

Also, much like technical debt, even if you do reach a point where you feel its eliminated the chances are a future change in the system will likely re-introduce it.

But also like technical debt the first step is to acknowledge toil exists, develop ways to be able to detect it and have a strategy for managing it and trying to keep it to a reasonable minimum.

Toils impact is that it engages your engineering resource on tasks that don't add to or improve your system. It may keep it up and running but that is a low ambition to have for any system

It's also important to recognise that large amounts of toil is likely to impact a teams morale, very few engineers will embark on their career looking to spend large amounts of time on repetitive tasks that lead to no overall value.

The Alternative to Toil

The alternative to spending time on toil is to spend time on engineering. Engineering is a broad concept but in this context it means work that improves the system itself or enables to to be managed in a more efficient way.

As we said previously completely eliminating toil is probably an unrealistic aim. But it is possible to measure how much time your team is spending on toil related tasks. Once you are able to estimate this then it is possible both to set a sensible limit on how much time is spent on these tasks but also measure the effectiveness of any engineering activities designed to reduce it.

This engineering activity might relate to software engineering, refactoring code for performance or reliability, automating testing or certain aspects of the build and deployment pipeline. It might also be more aimed at system engineering, analysing the correctness of the infrastructure the system is running on, analysing the nature of system failures or automating the management of infrastructure.

As previously stated we can view toil as a form of technical debt. In the early days of a system we may take certain shortcuts that at the time are manageable but as the system grows come with a bigger and bigger impact. Time spent trying to fix this debt will set you on a path for gradual system improvement, both for your users and the teams that work on the system.

Saturday, 13 July 2024

The Language of Love

Software engineers are often polyglots who will learn or be exposed to multiple programming languages over the course of their career. But I think most will always hold a special affection for the first language they learn, most likely because it's the first time they realise they have the ability to write code and achieve an outcome. Once that bug bites it steers you towards a path where you continue to hone that craft.

For me that language is C and its successor C++.

Potentially my view is biased because of the things I've outlined above but I believe C is a very good language for all potential developers to start with. If you learn how to code close to the metal it will develop skills and a way of thinking that will be of benefit to you as you progress onto high level languages with greater levels of abstraction from how your code is actually running.

In The Beginning

In the late 1960s and early 1970s as the Unix operating system was being developed engineers realised that they needed a program language that could be used to write utilities and programs to run on the newly forming platform.

One of the initial protagonists in this field was Ken Thompson.

After dismissing existing programming languages such as Fortran he started to develop a variant of an existing language called BCLP. He concentrated on simplifying the language structures and making it less verbose. He called this new language B with the first version being released around 1969.

In 1971 Dennis Ritchie continued to develop B to utilise features of more modern computers as well as adding new data types. This culminated in the release of New B. Throughout 1972 the development continued adding more data types, arrays and pointers and the language was renamed C.

In 1973 Unix was re-written in C with even more data types being added as C continued to be developed through the 1970s. This eventually resulted in the release of what many consider the be the definitive book on the C programming language, written by Brian Kernighan and Dennis Ritchie The C Programming Language became known as K&R C and became the unofficial specification for the language.

C has continued to be under active development right up until the present with C23 expected to be released in 2024.

C with Classes

In 1979 Bjarne Stroustrup began work on what he deemed "C with Classes".

Adding classes to C turned into an object oriented language, where C had found a home in embedded programming running close to the metal, adding classes made it more suitable for large scale software development.

In 1982 Stroustrup began work on C++ adding new features such as inheritance, polymorphism, virtual functions and operator overloading. In 1985 he released the book The C++ Programming Language which become the unofficial specification for the language with the first commercial version being released later that year.

Much like C, C++ has continued to be developed with new versions being released up until the present day.

Usage Today

Software Engineering is often considered to be a fast moving enterprise, and while many other programming languages have been developed over the lifetime of C and C++ both are still very widely used.

Often being used when performance is critical, the fact they run close to the metal allows for highly optimised code for use cases such as gaming, network appliances and operating systems.

Usage of C and C++ can often strike fear into the heart of developers who aren't experienced in their use. However the skills that usage of C and C++ can develop will prove invaluable even when working with higher level languages so I would encourage all software engineers to spend some time expose themselves to the languages.

Good engineers can apply their skills using any programming language, the principles and practices of good software development don't vary that much between languages or paradigms. But often there are better choices of language for certain situations, and C and C++ are still the correct choice for many applications.