This document discusses how to work with external passthrough Network Load Balancers by using the User Datagram Protocol (UDP). The document is intended for app developers, app operators, and network administrators.
About UDP
UDP is used commonly in apps. The protocol, which is described in RFC-768 , implements a stateless, unreliable datagram packet service. For example, Google's QUIC protocol improves the user experience by using UDP to speed up stream-based apps.
The stateless part of UDP means that the transport layer doesn't maintain a
state. Therefore, each packet in a UDP "connection" is independent. In fact,
there is no real connection in UDP. Instead, its participants usually use a
2-tuple ( ip:port
) or a 4-tuple ( src-ip:src-port
, dest-ip:dest-port
) to
recognize each other.
Like TCP-based apps, UDP-based apps can also benefit from a load balancer, which is why external passthrough Network Load Balancers are used in UDP scenarios.
External passthrough Network Load Balancer
External passthrough Network Load Balancers are passthrough load balancers; they process incoming packets and deliver them to backend servers with the packets intact. The backend servers then send the returning packets directly to the clients. This technique is called Direct Server Return (DSR). On each Linux virtual machine (VM) running on Compute Engine that is a backend of a Google Cloud external passthrough Network Load Balancer, an entry in the local routing table routes traffic that's destined for the load balancer's IP address to the network interface controller (NIC). The following example demonstrates this technique:
root@backend-server:~#
ip
ro
ls
table
local
local
10
.128.0.2
dev
eth0
proto
kernel
scope
host
src
10
.128.0.2
broadcast
10
.128.0.2
dev
eth0
proto
kernel
scope
link
src
10
.128.0.2 local
198
.51.100.2
dev
eth0
proto
66
scope
host
broadcast
127
.0.0.0
dev
lo
proto
kernel
scope
link
src
127
.0.0.1 local
127
.0.0.0/8
dev
lo
proto
kernel
scope
host
src
127
.0.0.1 local
127
.0.0.1
dev
lo
proto
kernel
scope
host
src
127
.0.0.1
broadcast
127
.255.255.255
dev
lo
proto
kernel
scope
link
src
127
.0.0.1
In the preceding example, 198.51.100.2
is the load balancer's IP address. The google-network-daemon.service
agent is responsible for adding this entry.
However, as the following example shows, the VM does not actually have an
interface that owns the load balancer's IP address:
root@backend-server:~#
ip
ad
ls 1
:
lo:
<LOOPBACK,UP,LOWER_UP>
mtu
65536
qdisc
noqueue
state
UNKNOWN
group
default
qlen
1
link/loopback
00
:00:00:00:00:00
brd
00
:00:00:00:00:00
inet
127
.0.0.1/8
scope
host
lo
valid_lft
forever
preferred_lft
forever
inet6
::1/128
scope
host
valid_lft
forever
preferred_lft
forever 2
:
eth0:
<BROADCAST,MULTICAST,UP,LOWER_UP>
mtu
1460
qdisc
mq
state
UP
group
default
qlen
1000
link/ether
42
:01:0a:80:00:02
brd
ff:ff:ff:ff:ff:ff
inet
10
.128.0.2/32
brd
10
.128.0.2
scope
global
eth0
valid_lft
forever
preferred_lft
forever
inet6
fe80::4001:aff:fe80:2/64
scope
link
valid_lft
forever
preferred_lft
forever
The external passthrough Network Load Balancer transmits the incoming packets, with the destination address untouched, to the backend server. The local routing table entry routes the packet to the correct app process, and the response packets from the app are sent directly to the client.
The following diagram shows how external passthrough Network Load Balancers work. The incoming packets are processed by a load balancer called Maglev , which distributes the packets to the backend servers. Outgoing packets are then sent directly to the clients through DSR.
An issue with UDP return packets
When you work with DSR, there is a slight difference between how the Linux kernel treats TCP and UDP connections. Because TCP is a stateful protocol, the kernel has all the information it needs about the TCP connection, including the client address, client port, server address, and server port. This information is recorded in the socket data structure that represents the connection. Thus, each returning packet of a TCP connection has the source address correctly set to the server address. For a load balancer, that address is the load balancer's IP address.
Recall that UDP is stateless, however, so the socket objects that are created in the app process for UDP connections don't have the connection information. The kernel doesn't have the information about the source address of an outgoing packet, and it doesn't know the relation to a previously received packet. For the packet's source address, the kernel can only fill in the address of the interface that the returning UDP packet goes to. Or if the app previously bound the socket to a certain address, the kernel uses that address as the source address.
The following code shows a simple echo program:
#!/usr/bin/python3
import
socket
,
struct
def
loop_on_socket
(
s
):
while
True
:
d
,
addr
=
s
.
recvfrom
(
1500
)
print
(
d
,
addr
)
s
.
sendto
(
"ECHO: "
.
encode
(
'utf8'
)
+
d
,
addr
)
if
__name__
==
"__main__"
:
HOST
,
PORT
=
"0.0.0.0"
,
60002
sock
=
socket
.
socket
(
type
=
socket
.
SocketKind
.
SOCK_DGRAM
)
sock
.
bind
((
HOST
,
PORT
))
loop_on_socket
(
sock
)
Following is the tcpdump
output during a UDP conversation:
14:50:04.758029 IP 203.0.113.2.40695 > 198.51.100.2.60002: UDP, length 3 14:50:04.758396 IP 10.128.0.2.60002 > 203.0.113.2.40695: UDP, length 2T
198.51.100.2
is the load balancer's IP address, and 203.0.113.2
is the
client IP address.
After the packets leave the VM, another NAT device–a Compute Engine gateway–in the Google Cloud network translates the source address to the external address. The gateway doesn't know which external address should be used, so only the VM's external address (not the load balancer's) can be used.
From the client side, if you check the output from tcpdump
, the packets from
the server look like the following:
23:05:37.072787 IP 203.0.113.2.40695 > 198.51.100.2.60002: UDP, length 5 23:05:37.344148 IP 198.51.100.3.60002 > 203.0.113.2.40695: UDP, length 4
198.51.100.3
is the VM's external IP address.
From the client's point of view, the UDP packets are not coming from an address that the client sent them to. This causes problems: the kernel drops these packets, and if the client is behind a NAT device, so does the NAT device. As a result, the client app gets no response from the server. The following diagram shows this process where the client rejects returning packets because of address mismatches.
Solving the UDP problem
To solve the no-response problem, you must rewrite the source address of
outgoing packets to the load balancer's IP address at the server that's hosting
the app. Following are several options that you can use to accomplish this
header rewrite. The first solution uses a Linux-based approach with iptables
;
the other solutions take app-based approaches.
The following diagram shows the core idea of these options: rewrite the source IP address of the returning packets in order to match the load balancer's IP address.
Use NAT policy in the backend server
The NAT policy solution is to use the Linux iptables
command to rewrite the
destination address from the load balancer's IP address to the VM's IP address.
In the following example, you add an iptables
DNAT rule to change the
destination address of the incoming packets:
iptables
-t
nat
-A
POSTROUTING
-j
RETURN
-d
10
.128.0.2
-p
udp
--dport
60002
iptables
-t
nat
-A
PREROUTING
-j
DNAT
--to-destination
10
.128.0.2
-d
198
.51.100.2
-p
udp
--dport
60002
This command adds two rules to the NAT table of the iptables
system. The
first rule bypasses all incoming packets that target the local eth0
address.
As a result, traffic that doesn't come from the load balancer isn't affected.
The second rule changes the destination IP address of incoming packets to the
VM's internal IP address. The DNAT rules are stateful, which means that the
kernel tracks the connections and rewrites the returning packets' source address
automatically.
Pros | Cons |
---|---|
The kernel translates the address, with no change required to apps. | Extra CPU is used to do the NAT. And because DNAT is stateful, memory consumption might also be high. |
Supports multiple load balancers. |
Use nftables
to statelessly mangle the IP header fields
In the nftables
solution, you use the nftables
command to mangle the source
address in the IP header of outgoing packets. This mangling is stateless, so it
consumes fewer resources than using DNAT. To use nftables
, you need a Linux
kernel version greater than 4.10.
You use the following commands:
nft
add
table
raw
nft
add
chain
raw
postrouting
{
type
filter
hook
postrouting
priority
300
)
nft
add
rule
raw
postrouting
ip
saddr
10
.128.0.2
udp
sport
60002
ip
saddr
set
198
.51.100.2
Pros | Cons |
---|---|
The kernel translates the address, with no change required to apps. | Does not support multiple load balancers. |
The address translation process is stateless, so resource consumption is much lower. | Extra CPU is used to do the NAT. |
nftables
are available only to newer Linux kernel
versions. Some distros, like Centos 7.x, cannot use nftables
. |
Let the app explicitly bind to the load balancer's IP address
In the binding solution, you modify your app so that it binds explicitly to the
load balancer's IP address. For a UDP socket, the bind
operation lets the
kernel know which address to use as the source address when sending UDP packets
that use that socket.
The following example shows how to bind to a specific address in Python:
#!/usr/bin/python3
import
socket
def
loop_on_socket
(
s
):
while
True
:
d
,
addr
=
s
.
recvfrom
(
1500
)
print
(
d
,
addr
)
s
.
sendto
(
"ECHO: "
.
encode
(
'utf8'
)
+
d
,
addr
)
if
__name__
==
"__main__"
:
# Instead of setting HOST to "0.0.0.0",
# we set HOST to the Load Balancer IP
HOST
,
PORT
=
"198.51.100.2"
,
60002
sock
=
socket
.
socket
(
type
=
socket
.
SocketKind
.
SOCK_DGRAM
)
sock
.
bind
((
HOST
,
PORT
))
loop_on_socket
(
sock
)
# 198.51.100.2 is the load balancer's IP address
# You can also use the DNS name of the load balancer's IP address
The preceding code is a UDP server; it echoes back the bytes received, with a
preceding "ECHO: "
. Pay attention to lines 12 and 13, where the server
is bound to the address 198.51.100.2
, which is the load balancer's IP address.
Pros | Cons |
---|---|
Can be achieved with a simple code change to the app. | Does not support multiple load balancers. |
Use recvmsg
/ sendmsg
instead of recvfrom
/ sendto
to specify the address
In this solution, you use recvmsg
/ sendmsg
calls instead of recvfrom
/ sendto
calls. In comparison to recvfrom
/ sendto
calls, the recvmsg
/ sendmsg
calls can handle ancillary control messages along with the
payload data. These ancillary control messages include the source or destination
address of the packets. This solution lets you fetch destination addresses from
incoming packets, and because those addresses are real load balancer
addresses, you can use them as source addresses when sending replies.
The following example program demonstrates this solution:
#!/usr/bin/python3
import
socket
,
struct
def
loop_on_socket
(
s
):
while
True
:
d
,
ctl
,
flg
,
addr
=
s
.
recvmsg
(
1500
,
1024
)
# ctl contains the destination address information
s
.
sendmsg
([
"ECHO: "
.
encode
(
"utf8"
),
d
],
ctl
,
0
,
addr
)
if
__name__
==
"__main__"
:
HOST
,
PORT
=
"0.0.0.0"
,
60002
s
=
socket
.
socket
(
type
=
socket
.
SocketKind
.
SOCK_DGRAM
)
s
.
setsockopt
(
0
,
# level is 0 (IPPROTO_IP)
8
,
# optname is 8 (IP_PKTINFO)
1
)
s
.
bind
((
HOST
,
PORT
))
loop_on_socket
(
s
)
This program demonstrates how to use recvmsg
/ sendmsg
calls. In order to
fetch address information from packets, you must use the setsockopt
call to
set the IP_PKTINFO
option.
Pros | Cons |
---|---|
Works even if there are multiple load balancers–for example, when there are both internal and external load balancers configured to the same backend. | Requires you to make complex changes to the app. In some cases, this might not be possible. |
What's next
- Learn how to configure an external passthrough Network Load Balancer and distribute traffic in Set up an external passthrough Network Load Balancer .
- Read more about external passthrough Network Load Balancers .
- Read more about the Maglev technique behind external passthrough Network Load Balancers.