Turning IPTables into a TCP load balancer for fun and profit

In this technical deep dive into iptables, the Linux network security configuration utility, we'll see why and how to build a sophisticated TCP router and load balancer capable of handling IoT application traffic.

Most Platform as a Service solutions are limited to web application hosting, reachable via the HTTP protocol. However, in memory-, CPU- and battery-constrained environments, like the IoT world, folks don't use HTTP. Instead, a custom, fast, and lightweight TCP-based protocol is usually preferred.

When you think about it, the application "BUILD" and "RUN" stages are very similar to those of a web application. Programming languages (NodeJS especially) and databases are usually shared as well. Therefore, the only limiting factor preventing a PaaS from hosting IoT apps is the lack of a TCP routing layer.

This TCP routing layer must be able to do the following operations:

Route raw TCP packets to the correct application
Load balance those connections across multiple containers

For HTTP routing, Scalingo uses OpenResty. However, it cannot be used for TCP routing (or so we thought, see the conclusion). That's why we chose another approach based on iptables.

Network infrastructure and goals

First, let's define our different networks. In this article, we will consider two distinct networks:

Our public network: 192.168.1.0/24 where the clients are
Our private network: 10.0.0.0/24 where the servers are (they host the app containers)

The public network has one client with the IP: 192.168.1.2 and the private network has three servers with the IPs: 10.0.0.2, 10.0.0.3 and 10.0.0.4.

Last part of the setup is a front server which makes the link between both networks with the IPs: 10.0.0.1 and 192.168.1.1.

In the following sections, we’ll assume that all operations and commands take place on the front server unless stated otherwise.

NAT

Let's start by trying to redirect all traffic coming to the TCP port 27017 on the 192.168.1.1 IP to the port 1234 of the 10.0.0.2 server in the private network.

This is done via a process called Network Address Translation (or NAT). In this article we will focus on two different NAT methods: DNAT and SNAT.

DNAT

The DNAT method changes the Destination header of the IP and TCP packet.

Here, the IP and the TCP headers should be rewritten. So the destination IP of our packet should be rewritten to 10.0.0.2 and the destination port should be rewritten to 1234.

The following transformation happens:

   PACKET RECEIVED                   PACKET FORWARDED
|---------------------|           |---------------------|
|    IP PACKET        |           |    IP PACKET        |
|                     |           |                     |
| SRC: 192.168.1.2    |           | SRC: 192.168.1.2    |
| DST: 192.168.1.1    |           | DST: 10.0.0.2       |
| |---------------|   |           | |---------------|   |
| |   TCP PACKET  |   | =(DNAT)=> | |   TCP PACKET  |   |
| | DPORT: 27017  |   |           | | DPORT: 1234   |   |
| | SPORT: 23456  |   |           | | SPORT: 23456  |   |
| | ... DATA ...  |   |           | | ... DATA ...  |   |
| |---------------|   |           | |---------------|   |
|---------------------|           |---------------------|

To do so, we will need to use the PREROUTING Chain in the nat table of iptables.

iptables \
  -A PREROUTING    # Append a rule to the PREROUTING chain
  -t nat           # The PREROUTING chain is in the nat table
  -p tcp           # Apply this rules only to tcp packets
  -d 192.168.1.1   # and only if the destination IP is 192.168.1.1
  --dport 27017    # and only if the destination port is 27017
  -j DNAT          # Use the DNAT target
  --to-destination # Change the TCP and IP destination header
     10.0.0.2:1234 # to 10.0.0.2:1234

That's all. Now if we try to connect to the iptables host on the port 27017 our traffic will be redirected to our server.

If we try that on the client:

user@client ~ $ echo "Hi from client" | nc 192.168.1.1 27017

This command hangs, and the server shows nothing.

By looking at the packets received by Server 1, we can see that the iptables rule worked and the traffic has been redirected to the correct destination.

user@server-1 ~ $ tcpdump -i eth1
15:19:17.832609 IP 192.168.1.2.23456 > 10.0.0.2.1234: Flags [S],
  seq 37761180, win 29200, options [mss 1460,sackOK,
  TS val 21306607 ecr 0,nop,wscale 6], length 0

SNAT

The reason why the command hung is that the server does not know how to respond to that client since the source IP is set to 192.168.1.2 which is not on his network.

The solution is to also modify the source IP and source port headers on the front server. This is done using the SNAT method.

The following transformations will occur:

  PACKET RECEIVED                                             PACKET FORWARDED
|-------------------|         |-------------------|         |-------------------|
|    IP PACKET      |         |     IP PACKET     |         |     IP PACKET     |
|                   |         |                   |         |                   |
| SRC: 192.168.1.2  |         | SRC: 192.168.1.2  |         | SRC: 10.0.0.1     |
| DST: 192.168.1.1  |         | DST: 10.0.0.2     |         | DST: 10.0.0.2     |
| |---------------| |         | |---------------| |         | |---------------| |
| |   TCP PACKET  | |=(DNAT)=>| |   TCP PACKET  | |=(SNAT)=>| |   TCP PACKET  | |
| | DPORT: 27017  | |         | | DPORT: 1234   | |         | | DPORT: 1234   | |
| | SPORT: 23456  | |         | | SPORT: 23456  | |         | | SPORT: 38921  | |
| | ... DATA ...  | |         | | ... DATA ...  | |         | | ... DATA ...  | |
| |---------------| |         | |---------------| |         | |---------------| |
|-------------------|         |-------------------|         |-------------------|

The SNAT takes place after all routing decision (including our DNAT rule) has been made, so we need to add the SNAT rule in the POSTROUTING chain in the nat table.

iptables \
  -A POSTROUTING
  -t nat
  -p tcp
  -d 10.0.0.2    # Apply this rule if the packet is going to the IP 10.0.0.2
  --dport 1234   # and if the packet is going to port 1234
  -j SNAT        # Use the SNAT target
  --to-source 10.0.0.1 # To change the SRC IP header to 10.0.0.1

Iptables keeps a translation table in memory and automatically handles the connections returning from the server, redirecting them to the client.

By retrying our previous nc command, we get:

user@client ~ $ echo "Hi from client" | nc 192.168.1.1 27017
Hi from server

By looking at the packets received by Server 1, we can see that the source and destination IP has been changed by our front server.

user@server-1 ~ $ tcpdump -i eth1
15:29:37.384773 IP 10.0.0.1.38921 > 10.0.0.2.1234:
  Flags [S], seq 3215489734, win 29200, options [mss 1460,sackOK,
  TS val 21461495 ecr 0,nop,wscale 6], length 0

Securing the system

Iptables is commonly used as a firewall. It's time to use its main feature by adding some rules to drop every forwarded packet not explicitly allowed.

Each iptables chain has a default policy. Any packet that doesn’t match a rule in the chain follows that default policy. With a DROP default policy, any connection that is not explicitly accepted will be dropped.

iptables -t filter -P FORWARD DROP

The SNAT and DNAT rules previously written only modify the packet headers. The filtering is not impacted by those rules. With the default policy set to drops, we now need to explicitly accept traffic coming from and going to Server 1:

# Accept traffic to Server 1
iptables -t filter -A FORWARD -d 10.0.0.2 --dport 1234 -j ACCEPT
# Accept traffic from Server 1
iptables -t filter -A FORWARD -s 10.0.0.2 --sport 1234 -j ACCEPT

We are now able to forward traffic going to the TCP port 27017 of our front server to a server hosting a single node application.

Load balancing

The next step is now to distribute connections across multiple nodes hosting our application.

In order to load balance between multiple hosts, a solution is to change the DNAT rule so it won't always redirect the clients to a single node but distribute them across multiple nodes.

To distribute those connections between Server 1, Server 2 and Server 3, we could be tempted to define those rules:

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -j DNAT --to-destination 10.0.0.2:1234

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -j DNAT --to-destination 10.0.0.3:1234

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -j DNAT --to-destination 10.0.0.4:1234

However iptables engine is deterministic and the first matching rule will always be used. In this example, Server 1 will get all the connections.

To address this issue, iptables includes a module called statistic, which skips or accepts a rule based on statistical conditions.

The statistic module support two different modes:

random: the rule is skipped based on a probability
nth: the rule is skipped based on a round robin algorithm

Note that the load balancing will only be done during the connection phase of the TCP protocol. Once the connection has been established, the connection will always be routed to the same server.

Random balancing

To really load balance traffic on 3 different servers, the previous three rules become:

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -m statistic --mode random --probability 0.33            \
         -j DNAT --to-destination 10.0.0.2:1234

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -m statistic --mode random --probability 0.5             \
         -j DNAT --to-destination 10.0.0.3:1234

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -j DNAT --to-destination 10.0.0.4:1234

Notice that 3 different probabilities are defined and not 0.33 everywhere. The reason is that the rules are executed sequentially.

With a probability of 0.33, the first rule will be executed 33% of the time and skipped 66% of the time.

With a probability of 0.5, the second rule will be executed 50% of the time and skipped 50% of the time. However, since this rule is placed after the first one, it will only be executed 66% of the time. Hence this rule will be applied to only \(50\%*66\%=33\%\) of requests.

Since only 33% of the traffic reaches the last rule, it must always be applied.

You can compute the probability to set on every rule based on the number of rule \(n\) and the rule index \(i\) (starting at 1) with \(p=\frac {1}{n-i+1}\)

Round Robin

The other way to do this is to use the nth algorithm. This algorithm implements a round robin algorithm.

This algorithm takes two different parameters: every (n) and packet(p). The rule will be evaluated every n packet starting at the packet p.

To load balance between three different hosts you will need to create those three rules:

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -m statistic --mode nth --every 3 --packet 0              \
         -j DNAT --to-destination 10.0.0.2:1234

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -m statistic --mode nth --every 2 --packet 0              \
         -j DNAT --to-destination 10.0.0.3:1234

iptables -A PREROUTING -t nat -p tcp -d 192.168.1.1 --dport 27017 \
         -j DNAT --to-destination 10.0.0.4:1234

Allowing the traffic to pass

Since we have a DROP default policy on our FORWARD chain in the filter table, we need to allow the three remote servers. This can be done with 6 iptables rules:

iptables -t filter -A FORWARD -d 10.0.0.2 --dport 1234 -j ACCEPT
iptables -t filter -A FORWARD -d 10.0.0.3 --dport 1234 -j ACCEPT
iptables -t filter -A FORWARD -d 10.0.0.4 --dport 1234 -j ACCEPT
iptables -t filter -A FORWARD -s 10.0.0.2 --sport 1234 -j ACCEPT
iptables -t filter -A FORWARD -s 10.0.0.3 --sport 1234 -j ACCEPT
iptables -t filter -A FORWARD -s 10.0.0.4 --sport 1234 -j ACCEPT

Now if our client tries to contact our application, we get the following output from our client:

user@client ~ $ echo "Hi from client" | nc 192.168.1.1 27017
Hi from 10.0.0.2
user@client ~ $ echo "Hi from client" | nc 192.168.1.1 27017
Hi from 10.0.0.3
user@client ~ $ echo "Hi from client" | nc 192.168.1.1 27017
Hi from 10.0.0.4
user@client ~ $ echo "Hi from client" | nc 192.168.1.1 27017
Hi from 10.0.0.2
[...]

Conclusion

In this article we saw how to build a TCP load balancer based on iptables and the Linux kernel. We use this method to create a TCP Gateway which is currently used in production IoT applications. The same method is used to build database's Internet direct access.

In light of Cloudflare’s recent work on their Spectrum product we may incorporate some of their ideas in our own TCP load balancer.

Stay tuned, we'll announce official support of TCP apps in the following weeks!

Turning IPTables into a TCP load balancer
for fun and profit