How to improve performance of unicast to broadcast UDP packet transformation using eBPF/XDP

Background

We have some devices sending unicast UDP packets which we want to receive on multiple computers by transforming those packets to broadcast. We can't make any changes to the devices sending the packets nor the computers receiving the data. The UDP packets are sent with the same headers and packet size (less than 1500 bytes). The transmitting devices are on the 192.168.1.0/24 subnet and the receiving computers are on the 192.168.40.0/24 subnet.

Current Setup

We need a standalone device that can forward the unicast packets to the 192.168.40.0/24 subnet as broadcast packets. The current approach I use is to reuse an RB760iGS RouterBoard, which has an SFP port and 5 gigabit Ethernet ports. The incoming packets arrive at the SFP port and the receiving computers are connected to the Ethernet ports. The receiving computers will not send any data back

The bandwidth is relatively high at around 250 Mbps, so I wrote an eBPF XDP program to transform the packets:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/udp.h>

#define IP_HLEN  sizeof(struct iphdr)   // IP header length (assuming no options)
#define UDP_HLEN sizeof(struct udphdr)  // UDP header length
#define TOTAL_HLEN (ETH_HLEN + IP_HLEN + UDP_HLEN + 24) // Total headers length

#define TARGET_INTERFACE_INDEX 9  // interface index for br-lan

#define NEW_IP_DST      0xFF28A8C0 // 192.168.40.255 (network byte order)
#define NEW_UDP_DST_PORT 0xB315
#define CHECKSUM_DIFF 0xFFFFD828 // pcn_csum_add(0xC0A80128, ~0xC0A828FF & 0xFFFFFFFF);

static inline __u16 pcn_csum_add(__u32 csum, __u32 addend) {
    __u32 result;
    __u8 carry = __builtin_uadd_overflow(csum, addend, &result);
    result += carry;  // Add carry if overflow occurred
    return result;
}

SEC("xdp_redirect")
int xdp_prog(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    if (data + TOTAL_HLEN > data_end)
        return XDP_PASS;

    struct ethhdr *eth = data;
    struct iphdr *ip = data + ETH_HLEN;
    struct udphdr *udp = data + ETH_HLEN + IP_HLEN;

    ip->daddr = NEW_IP_DST;
    udp->dest = NEW_UDP_DST_PORT;

    *((__u32*)&eth->h_dest[0]) = 0xffffffff;  // First four bytes
    *((__u16*)&eth->h_dest[4]) = 0xffff;      // Last two bytes

    // The CHECKSUM_DIFF is a precomputed value based on the change from the original destination IP
    ip->check = bpf_htons(pcn_csum_add(bpf_ntohs(ip->check), CHECKSUM_DIFF));

    // Redirect packet to the specified interface index
    return bpf_redirect(TARGET_INTERFACE_INDEX, 0);
}

char _license[] SEC("license") = "GPL";

This works when I only have one device connected on the br-lan interface, but when I connect more devices, a lot of packets are dropped. The router's CPU is a 880MHz dual-core processor with 256MB of RAM.

Questions

  • Are there additional optimizations that could be done to the XDP program?
  • Any settings that could be done to optimize for high bandwidth (like using a more efficient qdisc)?
  • Is there a router that supports OpenWrt with better performance than the RB760iGS in a similar size?

Thank you for your advice!

(Update: I was able to handle up to 250mbps without drops and multiple receivers if I put a gigabit switch between the router and the receivers, the drawback with this approach is that I need 2 devices to handle this task instead of 1 and also additional latency is introduced. I am still looking for optimizations to be able to use one device)

1 Like

You describe multicast reinvention.

Well, multicast would be the way to go if I could modify the devices but as I said I need to solve this by transforming the packets with a hardcoded destination ip.

I think the easiest solution is to find a faster hardware. This one looks promising https://www.pcengines.ch/apu6b4.htm, it has a SFP port and 3x RJ45 and it is 4 cores instead of 2 compared to the RB760iGS.

I would first try broadcast just in case receivers can play that, otherwise you need to take some multicast multipoint distributor and patch it to make unicast streams.
You can also duplicate packets in nftables and so on. Im not sure but probably tc too.

I can confirm broadcast works, the receivers are able to pick up the packets when listening on port 5555, that's what I am doing with the XDP program. Duplicating packets with nftables, iptables or tc is what I did initially but it is too slow. Since the XDP program hijacks the packets prior to the rest of the network stack I believe it is likely the fastest approach.

nftables offload follows right after.... no hardware offload towards br-lan though

I mean you get 2-3x better throughput with software offload from same CPU resources. Hardware offload would be at no significant CPU usage, faster than normal nftables, how much you have to measure on hardware.

You can see with conntrack -E / -L if the connection is ASSURED or OFFLOAD

Hardware offloading would be interesting to try. Does nftables support modifying the destination mac? I know it can be done using tc and pedit action but that would be slower than the current approach I use.

Offload takes ingress packet and adjusts all addresses according to what NAT conntrack state dictates.

Does conntrack allow for broadcast addresses? I would need all incoming udp packets on port 5555 and port 5556 to be adjusted to 192.168.40.255:5555 and with destination mac modified to ff:ff:ff:ff:ff:ff.

The APU2+ hardware is using a rather dated and slow AMD Jaguar SOC, it just barely manages routing at 1 GBit/s wirespeed, but it's way to slow for anything beyond that (and this already applies to VPN or sqm for 1 GBit/s).

Are there any alternatives? I found this thread but 10gbit seems a bit overkill for my use case (250mbps with up to 3 receivers): Fanless x86 PC with SFP+ slot?

SoC switch hardware does not handle that broadcast rate. Probably you need some swconfig/ip-bridge tricks to permit high rate broadcast before writing it off.

Apart from being EOLed, this device is really not something I'd buy in 2024 - for anything. You can get better, for less. The hardware is solid, but very dated.

I'll look into swconfig/ip-bridge, it's not something I have worked with before. I'm still open to purchase faster hardware. Maybe the raspberry pi 5 could work for this if there is a HAT with SFP and 3x RJ45 ports.

The following thread suggests the Banana Pi BPi R4 for upto 10gbit speeds: A new dual 10G router based on Filogic 880 (Banana Pi BPi-R4)

It seems like a good alternative to the APU2+.