Notes on Web Scalability

Here are some notes that I took while reading the book Web Scalability for Startup Engineers, by Artur Ejsmont.


Vertical vs Horizontal Scalability

Vertical Scalability

Web scalability can generally be divided into two categories – vertical and horizontal scalability. Vertical Scalability involves scaling individual pieces of hardware, such as adding more RAM to a server, or upgrading to a faster CPU.

This will quickly suffer from the Law of Diminishing Returns. Better hardware quickly becomes increasingly expensive.

Horizontal Scalability

Horizontal Scalability involves scaling by adding more pieces of hardware. This tends to be more flexible. This allows you to start a web server and add it to your server pool as needed, and then to gracefully terminate it when it isn’t needed. This is far more feasible than hot swapping and upgrading the components in a single web server when it is under load.

Planning for Scalability

Planning for scalability should really begin before the application is in its development phase. For a web application to be scaled efficiently, it should be functionally partitioned. It’s generally a good idea to plan this before actual development.

Take a search engine page as an example.

There are a few things going on here:

  • An HTTP request is sent to
  • The front-end server will receive requests and send requests to back-end servers for necessary information.
  • A back-end server is contacted to retrieve information related to my account, such as username and profile picture.


The frontend server and the backend server can be functionally partitioned into two separate services. By decoupling them, we benefit in multiple ways:

  • We can scale the front end and the backend service separately. A request to generate an HTML file generally isn’t expensive, but a database record retrieval can be. The backend should be scaled separately from the frontend.

Basic Web Scalability Tools

Load Balancer

Load balancers make horizontal scalability possible. A load balancer will distribute incoming requests amongst multiple servers. A load balancer should be able to tell when a server is down, and send requests only to working servers. In doing so, it can reduce single points of failures.

A few advantages to using a load balancer include:

  • Hidden maintenance: If your frontend servers are truly stateless, you can take any of them in and out of rotation, and have your load balance redirect traffic, without the client ever noticing any downtime.
  • Increase capacity: A load balancer allows you to easily increase capacity by adding a new server, and then routing traffic to it.
  • Failure management: If a server starts to fail, it can be taken out of rotation, and the load balancer simply stops sending requests to it.
  • Automated scaling: If a load balancer detects that we’re running out of servers to serve our users, it can automatically start scaling and adding more servers, without any human action.


Content Delivery Network

A content delivery network (or CDN) will serve static files (such as images, JavaScript, and CSS) for your application, reducing the load from your servers. CDNs generally already distribute their load across multiple servers, taking the burden of that responsibility off you. However, if a requested file is not on a CDN, the request will be forward to your server. CDN hosts data on external servers that are geographically distributed to provide faster delivery, reduce bandwidth, reduce load times, and even provide DDoS mitigation.

Web Cache

Instead of buying faster servers to respond to requests faster, caching can be used to avoid having to hit the server with these requests in the first place. But it can be hard to cache entire HTTP responses with a dynamic application – so carefully cache static content, and HTTP snippets when possible.

Web caches will temporarily save responses from previous requests (such as HTML files, images, scripts, and CSS files), so they do not have to be retrieved again. Caching can happen at the browser level (your browser will cache files for you), or at the server level (a server may cache popular files).

Designing for Scalability

Single Responsibility

The single responsibility principle states that your classes should have only one responsibility. Another popular way of thinking of it is as a single reason to change. In other words, if you can think of at least two reasons to change a class, it can probably be split into more granular classes with more discrete responsibilities. This generally makes each class more readable, easier to refactor, and to test.

Inversion of Control and Dependency Injection

Inversion of Control removes dependencies from your code by having classes not know who will create or use it. This keeps these classes as generic as possible. Code that follows the Inversion of Control principle will use dependency injection: that is, control over how a function’s behavior is passed into it, instead of being hardcoded in it. This helps you functionally partition your application.


Functional Partitioning

Functional partitioning refers decomposing a large application into smaller independent systems. This allows each smaller system to be developed and operated independent of one another – they can scale independently from one another and even be written in different frameworks.

Data Partitioning

Data partitioning refers to keeping small subsets of data on a machine, instead of the entire application’s dataset. For example, if a user login server has been set up to handle users with email starting from A through F, it only needs to store those users in its database.


A clone holds a copy of an application’s component, or of a server. It should be nearly identical to other clones, so a request can be fulfilled by any clone with identical results. This allows them to be swapped in and out of production as needed.


Keep your clones stateless

To efficiently use clones, they should be as stateless as possible. There should be no data stored on a clone that another clone would need to fulfill that same request – in other words, no session data. However, many applications will require session data of some sort (such as authentication). In this case, data will need to be synchronized between the clones.

Synchronizing sessions

If a web request does require session data to be persisted, it should be persisted on a server that is separate from the one serving the request. For example, it can be persisted on a machine that is solely dedicated to storing session data, or on the load balancer.

Storing session data in cookies

Session data, such as a UUID (Universally Unique Identifier) can be stored in a browser’s cache. This adds overhead to each request, as encrypted cookies would have to be sent with every request. This is not always negligible. Sending a 1KB cookie can especially become unnecessary when requesting static files.

Storing session data in cookies is also a security concern. Users can modify data stored in the cookies, an attack known as Session Hijacking.

Session data in databases

Session data can be stored in databases that can be accessed from any server. However, this will add overhead to each request, as it will have to read and/or write to the database. This puts enormous load on the database.

Session data in load balancers

Load balancers can also store session data. By always redirecting a user to the same server, session information does not need to be shared across nodes. However, this introduces a single point of failure – if the load balancer goes down, the session data it stored goes down with it.

Additionally, since requests are forwarded to the same server each time, the load balancer may not distribute traffic equally across nodes, when a node is added or removed.

Architectural Layers

Web Application Layer

On the Web application layer is responsible only for generating HTML code that is served to the client. It should not actually contain any business logic. All it does, is handle user interactions and translate them to internal web services calls.

Web Services Layer

The web services layer contains much of the application logic. For example, it could host what you might know as a microservice. Interactions between the web application layer and the web services layer are usually done over HTTP requests over a REST API.

At the highest level, an application backend can be scaled into smaller independent services. For example, a login service should be distinct form a file storage service.

Scaling MySQL Servers

Replication allows you to have duplicate copies of a piece of data, shared across multiple machines.

Single Master Machine

In the simplest MySQL scaling scenario, there is a single master machine, and one or more slave servers. The slave servers are read only, and the master server is the only server that can modify the database.

When the master machine handles any sort of update to the database, it writes these changes to a binlog, and each change has an ID associated with it. Slave servers can request changes from the master machine asynchronously. Since IDs are associated with changes, a slave can request binlogs from a checkpoint, asking for binlogs since a certain ID.

Handling a slave failure

Handling a failing MySQL slave machine is not simple. It can simply be rebooted and repopulated from binlogs. It must be restored to a backup of the database, with the ID of the last binlog that it had synchronized from the master.

In other words, you’re going to need to back up your database periodically.

Master-Master setup

A master-master setup is not used to increase write scalability, it is used to increase availability. At the end of the day, writes will still have to be performed by both master servers.

With a master-master setup, all writes done by a master are recorded to its binlog, and then sent to another master server’s relay log. It then performs everything in the relay log, and also sends them to its binlog, so that slave servers can also replicate them. Information is also stored about the server that originated a request, so that they are not re-sent to the original machine.

Keep in mind, this means the master machines have the same data set. There is not data partitioning in this scenario. In fact, there will also be replication lag because of the overhead associated with adding more master servers.

Replication does not help you scale your writes, but it does help you scale your reads. If your application is read-heavy, it is a good idea to replicate so that you can have multiple slave machines handling reads.

Data Partitioning

Sharding involves splitting the data set into multiple smaller sets that can be distributed across multiple machines. That prevents any server from being responsible for the full data set. This is also known as sharding.

Sharding Key

When you shard information, your application needs to know which server has the information that is being requested. This is what sharding keys are for.

For example, if you were splitting a database associated with a user’s purchases, purchases could be shared across multiple servers and the sharding key would be a User ID associated with each purchase.

ACID Transactions

ACID refers to transactions properties that are supported by most relational database engines.

  • Atomicity: A transaction is executed in its entirety, it is either completed or rejected.
  • Consistency: Every transaction transforms the data set from one consistent state to another
  • Isolation: Transactions can run in parallel, independently of one another
  • Durability: Data is saved before the request returns to the client, so if the server fails the transaction shouldn’t be lost.

When you distribute your data across multiple machines, you lose your database’s ACID properties.

What to do with Shard keys when adding more servers

Long story short… You can use modulo on IDs to determine which server to send requests to. But this becomes a massive pain when you add a server, and you have to recalculate which requests go to which server.

Instead, store key-server mapping data on another server, that’s responsible only for this mapping process.

Scaling NoSQL

NoSQL servers drop ACID property guarantees but allow you to scale data in interesting ways. Choosing a NoSQL data store isn’t a trivial choice – it depends on your priorities and what guarantees are important to you. Which brings us to the CAP theorem…

CAP Theorem

According to the CAP Theorem, it is impossible to build a distributed system that guarantees consistency, availability, and partition tolerance.

Consistency: all nodes have the same data at the same time. Reading from a server will return the most up to date version of a piece of data.

Availability: A node will return a response within a reasonable amount of time, without error-ing or timing out.

Partition tolerance: The system will function even while partitioning.

Of the three, the CAP theorem states that we can only have two at once. Thing is, failures happen, so we always need partition tolerance. So this means we have to choose between consistency and availability.

Eventual Consistency

Some NoSQL stores sacrifice consistency to scale databases. Instead of having full consistency, they have eventual consistency. When you retrieve a piece of data from a server, you might get a stale version of it.

Servers will accept reads and writes at all times, and try to replicate state changes to their peers eventually. However, this could lead to conflicting changes. One policy is to accept the most recent write, but it could lead to data loss.

Dynamo handles this differently – instead of having the databases handle conflicts, it puts that responsibility onto the client. The can will then decide how to resolve the conflict.

Quorum consistency is when you propagate changes to peers, and the majority of servers need to confirm that they have persisted your change. 


Cassandra is a NoSQL data store developed by Facebook. In the Cassandra architecture, all nodes are considered to be equal. Each node is a subset of the overall data set, and they communicate with one another so that they know which nodes are responsible for parts of the data set. In other words, there is no key-mapping server, because each node knows this information. As a result, they can delegate queries to the appropriate servers.

When a Cassandra node fails, it does not need to restored from backup like a MySQL server does. Each piece of data is stored on multiple servers, so Cassandra just needs to know which node a new node is replacing, and the proper data will be transferred over to it.


MongoDB trades availability for consistency. It does not guarantee that all clients can read and write data all of the time, but it does focus on failure recovery. Each piece of data belongs to a single server, so if that server goes down, MongoDB will reject all writes to that data.

Replica sets are used to mitigate the hit to availability. Replica sets allow you to share data across multiple servers, with one being elected as the primary. When the primary fails, one of the replicas takes over. But if a primary node fails before it can send its changes over to the replica nodes, then the data is permanently lost.

In other words, even though MongoDB trades off availability, it’s not exactly consistent. There can be data loss.