Use UDP with external passthrough Network Load Balancers

This document discusses how to work with external passthrough Network Load Balancers by using the User Datagram Protocol (UDP). The document is intended for app developers, app operators, and network administrators.

About UDP

UDP is used commonly in apps. The protocol, which is described in RFC-768 , implements a stateless, unreliable datagram packet service. For example, Google's QUIC protocol improves the user experience by using UDP to speed up stream-based apps.

The stateless part of UDP means that the transport layer doesn't maintain a state. Therefore, each packet in a UDP "connection" is independent. In fact, there is no real connection in UDP. Instead, its participants usually use a 2-tuple ( ip:port ) or a 4-tuple ( src-ip:src-port , dest-ip:dest-port ) to recognize each other.

Like TCP-based apps, UDP-based apps can also benefit from a load balancer, which is why external passthrough Network Load Balancers are used in UDP scenarios.

External passthrough Network Load Balancer

External passthrough Network Load Balancers are passthrough load balancers; they process incoming packets and deliver them to backend servers with the packets intact. The backend servers then send the returning packets directly to the clients. This technique is called Direct Server Return (DSR). On each Linux virtual machine (VM) running on Compute Engine that is a backend of a Google Cloud external passthrough Network Load Balancer, an entry in the local routing table routes traffic that's destined for the load balancer's IP address to the network interface controller (NIC). The following example demonstrates this technique:

 root@backend-server:~#  
ip  
ro  
ls  
table  
 local 
 local 
  
 10 
.128.0.2  
dev  
eth0  
proto  
kernel  
scope  
host  
src  
 10 
.128.0.2
broadcast  
 10 
.128.0.2  
dev  
eth0  
proto  
kernel  
scope  
link  
src  
 10 
.128.0.2 local 
  
 198 
.51.100.2  
dev  
eth0  
proto  
 66 
  
scope  
host
broadcast  
 127 
.0.0.0  
dev  
lo  
proto  
kernel  
scope  
link  
src  
 127 
.0.0.1 local 
  
 127 
.0.0.0/8  
dev  
lo  
proto  
kernel  
scope  
host  
src  
 127 
.0.0.1 local 
  
 127 
.0.0.1  
dev  
lo  
proto  
kernel  
scope  
host  
src  
 127 
.0.0.1
broadcast  
 127 
.255.255.255  
dev  
lo  
proto  
kernel  
scope  
link  
src  
 127 
.0.0.1

In the preceding example, 198.51.100.2 is the load balancer's IP address. The google-network-daemon.service agent is responsible for adding this entry. However, as the following example shows, the VM does not actually have an interface that owns the load balancer's IP address:

 root@backend-server:~#  
ip  
ad  
ls 1 
:  
lo:  
<LOOPBACK,UP,LOWER_UP>  
mtu  
 65536 
  
qdisc  
noqueue  
state  
UNKNOWN  
group  
default  
qlen  
 1 
  
link/loopback  
 00 
:00:00:00:00:00  
brd  
 00 
:00:00:00:00:00  
inet  
 127 
.0.0.1/8  
scope  
host  
lo  
valid_lft  
forever  
preferred_lft  
forever  
inet6  
::1/128  
scope  
host  
valid_lft  
forever  
preferred_lft  
forever 2 
:  
eth0:  
<BROADCAST,MULTICAST,UP,LOWER_UP>  
mtu  
 1460 
  
qdisc  
mq  
state  
UP  
group  
default  
qlen  
 1000 
  
link/ether  
 42 
:01:0a:80:00:02  
brd  
ff:ff:ff:ff:ff:ff  
inet  
 10 
.128.0.2/32  
brd  
 10 
.128.0.2  
scope  
global  
eth0  
valid_lft  
forever  
preferred_lft  
forever  
inet6  
fe80::4001:aff:fe80:2/64  
scope  
link  
valid_lft  
forever  
preferred_lft  
forever

The external passthrough Network Load Balancer transmits the incoming packets, with the destination address untouched, to the backend server. The local routing table entry routes the packet to the correct app process, and the response packets from the app are sent directly to the client.

The following diagram shows how external passthrough Network Load Balancers work. The incoming packets are processed by a load balancer called Maglev , which distributes the packets to the backend servers. Outgoing packets are then sent directly to the clients through DSR.

Maglev distributes incoming packets to backend servers, which distribute the packets through DSR.

An issue with UDP return packets

When you work with DSR, there is a slight difference between how the Linux kernel treats TCP and UDP connections. Because TCP is a stateful protocol, the kernel has all the information it needs about the TCP connection, including the client address, client port, server address, and server port. This information is recorded in the socket data structure that represents the connection. Thus, each returning packet of a TCP connection has the source address correctly set to the server address. For a load balancer, that address is the load balancer's IP address.

Recall that UDP is stateless, however, so the socket objects that are created in the app process for UDP connections don't have the connection information. The kernel doesn't have the information about the source address of an outgoing packet, and it doesn't know the relation to a previously received packet. For the packet's source address, the kernel can only fill in the address of the interface that the returning UDP packet goes to. Or if the app previously bound the socket to a certain address, the kernel uses that address as the source address.

The following code shows a simple echo program:

  #!/usr/bin/python3 
 import 
  
 socket 
 , 
 struct 
 def 
  
 loop_on_socket 
 ( 
 s 
 ): 
 while 
 True 
 : 
 d 
 , 
 addr 
 = 
 s 
 . 
 recvfrom 
 ( 
 1500 
 ) 
 print 
 ( 
 d 
 , 
 addr 
 ) 
 s 
 . 
 sendto 
 ( 
 "ECHO: " 
 . 
 encode 
 ( 
 'utf8' 
 ) 
 + 
 d 
 , 
 addr 
 ) 
 if 
 __name__ 
 == 
 "__main__" 
 : 
 HOST 
 , 
 PORT 
 = 
 "0.0.0.0" 
 , 
 60002 
 sock 
 = 
 socket 
 . 
 socket 
 ( 
 type 
 = 
 socket 
 . 
 SocketKind 
 . 
 SOCK_DGRAM 
 ) 
 sock 
 . 
 bind 
 (( 
 HOST 
 , 
 PORT 
 )) 
 loop_on_socket 
 ( 
 sock 
 )

Following is the tcpdump output during a UDP conversation:

14:50:04.758029 IP 203.0.113.2.40695 > 198.51.100.2.60002: UDP, length 3
14:50:04.758396 IP 10.128.0.2.60002 > 203.0.113.2.40695: UDP, length 2T

198.51.100.2 is the load balancer's IP address, and 203.0.113.2 is the client IP address.

After the packets leave the VM, another NAT device–a Compute Engine gateway–in the Google Cloud network translates the source address to the external address. The gateway doesn't know which external address should be used, so only the VM's external address (not the load balancer's) can be used.

From the client side, if you check the output from tcpdump , the packets from the server look like the following:

23:05:37.072787 IP 203.0.113.2.40695 > 198.51.100.2.60002: UDP, length 5
23:05:37.344148 IP 198.51.100.3.60002 > 203.0.113.2.40695: UDP, length 4

198.51.100.3 is the VM's external IP address.

From the client's point of view, the UDP packets are not coming from an address that the client sent them to. This causes problems: the kernel drops these packets, and if the client is behind a NAT device, so does the NAT device. As a result, the client app gets no response from the server. The following diagram shows this process where the client rejects returning packets because of address mismatches.

Client rejects returning packets.

Solving the UDP problem

To solve the no-response problem, you must rewrite the source address of outgoing packets to the load balancer's IP address at the server that's hosting the app. Following are several options that you can use to accomplish this header rewrite. The first solution uses a Linux-based approach with iptables ; the other solutions take app-based approaches.

The following diagram shows the core idea of these options: rewrite the source IP address of the returning packets in order to match the load balancer's IP address.

Rewrite the source IP address of the returning packets in order to match the load balancer's
IP address.

Use NAT policy in the backend server

The NAT policy solution is to use the Linux iptables command to rewrite the destination address from the load balancer's IP address to the VM's IP address. In the following example, you add an iptables DNAT rule to change the destination address of the incoming packets:

 iptables  
-t  
nat  
-A  
POSTROUTING  
-j  
RETURN  
-d  
 10 
.128.0.2  
-p  
udp  
--dport  
 60002 
iptables  
-t  
nat  
-A  
PREROUTING  
-j  
DNAT  
--to-destination  
 10 
.128.0.2  
-d  
 198 
.51.100.2  
-p  
udp  
--dport  
 60002

This command adds two rules to the NAT table of the iptables system. The first rule bypasses all incoming packets that target the local eth0 address. As a result, traffic that doesn't come from the load balancer isn't affected. The second rule changes the destination IP address of incoming packets to the VM's internal IP address. The DNAT rules are stateful, which means that the kernel tracks the connections and rewrites the returning packets' source address automatically.

Pros	Cons
The kernel translates the address, with no change required to apps.	Extra CPU is used to do the NAT. And because DNAT is stateful, memory consumption might also be high.
Supports multiple load balancers.

Use `nftables` to statelessly mangle the IP header fields

In the nftables solution, you use the nftables command to mangle the source address in the IP header of outgoing packets. This mangling is stateless, so it consumes fewer resources than using DNAT. To use nftables , you need a Linux kernel version greater than 4.10.

You use the following commands:

 nft  
add  
table  
raw
nft  
add  
chain  
raw  
postrouting  
 { 
 type 
  
filter  
hook  
postrouting  
priority  
 300 
 ) 
nft  
add  
rule  
raw  
postrouting  
ip  
saddr  
 10 
.128.0.2  
udp  
sport  
 60002 
  
ip  
saddr  
 set 
  
 198 
.51.100.2

Pros	Cons
The kernel translates the address, with no change required to apps.	Does not support multiple load balancers.
The address translation process is stateless, so resource consumption is much lower.	Extra CPU is used to do the NAT.
	`nftables` are available only to newer Linux kernel versions. Some distros, like Centos 7.x, cannot use `nftables` .

Let the app explicitly bind to the load balancer's IP address

In the binding solution, you modify your app so that it binds explicitly to the load balancer's IP address. For a UDP socket, the bind operation lets the kernel know which address to use as the source address when sending UDP packets that use that socket.

The following example shows how to bind to a specific address in Python:

  #!/usr/bin/python3 
 import 
  
 socket 
 def 
  
 loop_on_socket 
 ( 
 s 
 ): 
 while 
 True 
 : 
 d 
 , 
 addr 
 = 
 s 
 . 
 recvfrom 
 ( 
 1500 
 ) 
 print 
 ( 
 d 
 , 
 addr 
 ) 
 s 
 . 
 sendto 
 ( 
 "ECHO: " 
 . 
 encode 
 ( 
 'utf8' 
 ) 
 + 
 d 
 , 
 addr 
 ) 
 if 
 __name__ 
 == 
 "__main__" 
 : 
 # Instead of setting HOST to "0.0.0.0", 
 # we set HOST to the Load Balancer IP 
 HOST 
 , 
 PORT 
 = 
 "198.51.100.2" 
 , 
 60002 
 sock 
 = 
 socket 
 . 
 socket 
 ( 
 type 
 = 
 socket 
 . 
 SocketKind 
 . 
 SOCK_DGRAM 
 ) 
 sock 
 . 
 bind 
 (( 
 HOST 
 , 
 PORT 
 )) 
 loop_on_socket 
 ( 
 sock 
 ) 
 # 198.51.100.2 is the load balancer's IP address 
 # You can also use the DNS name of the load balancer's IP address

The preceding code is a UDP server; it echoes back the bytes received, with a preceding "ECHO: " . Pay attention to lines 12 and 13, where the server is bound to the address 198.51.100.2 , which is the load balancer's IP address.

Pros	Cons
Can be achieved with a simple code change to the app.	Does not support multiple load balancers.

Use `recvmsg` / `sendmsg` instead of `recvfrom` / `sendto` to specify the address

In this solution, you use recvmsg / sendmsg calls instead of recvfrom / sendto calls. In comparison to recvfrom / sendto calls, the recvmsg / sendmsg calls can handle ancillary control messages along with the payload data. These ancillary control messages include the source or destination address of the packets. This solution lets you fetch destination addresses from incoming packets, and because those addresses are real load balancer addresses, you can use them as source addresses when sending replies.

The following example program demonstrates this solution:

  #!/usr/bin/python3 
 import 
  
 socket 
 , 
 struct 
 def 
  
 loop_on_socket 
 ( 
 s 
 ): 
 while 
 True 
 : 
 d 
 , 
 ctl 
 , 
 flg 
 , 
 addr 
 = 
 s 
 . 
 recvmsg 
 ( 
 1500 
 , 
 1024 
 ) 
 # ctl contains the destination address information 
 s 
 . 
 sendmsg 
 ([ 
 "ECHO: " 
 . 
 encode 
 ( 
 "utf8" 
 ), 
 d 
 ], 
 ctl 
 , 
 0 
 , 
 addr 
 ) 
 if 
 __name__ 
 == 
 "__main__" 
 : 
 HOST 
 , 
 PORT 
 = 
 "0.0.0.0" 
 , 
 60002 
 s 
 = 
 socket 
 . 
 socket 
 ( 
 type 
 = 
 socket 
 . 
 SocketKind 
 . 
 SOCK_DGRAM 
 ) 
 s 
 . 
 setsockopt 
 ( 
 0 
 , 
 # level is 0 (IPPROTO_IP) 
 8 
 , 
 # optname is 8 (IP_PKTINFO) 
 1 
 ) 
 s 
 . 
 bind 
 (( 
 HOST 
 , 
 PORT 
 )) 
 loop_on_socket 
 ( 
 s 
 )

This program demonstrates how to use recvmsg / sendmsg calls. In order to fetch address information from packets, you must use the setsockopt call to set the IP_PKTINFO option.

Pros	Cons
Works even if there are multiple load balancers–for example, when there are both internal and external load balancers configured to the same backend.	Requires you to make complex changes to the app. In some cases, this might not be possible.

What's next

Learn how to configure an external passthrough Network Load Balancer and distribute traffic in Set up an external passthrough Network Load Balancer .
Read more about external passthrough Network Load Balancers .
Read more about the Maglev technique behind external passthrough Network Load Balancers.