Linux Routing: Configure Internet Link Failover for High Availability

When the second internet link stops being “extra”

In enterprise networks, the first internet link is rarely the problem. It is the second one—the “backup” that nobody touches for months—that quietly becomes the risk. At first, it is comforting: a second circuit, a second ISP, a second path out. Then the business grows. More SaaS. More VPNs. More remote users. More uptime expectations. And one day, a fiber cut or upstream routing incident hits, and we discover the uncomfortable truth: having two links is not the same as having failover.

Failover is not a checkbox. It is a controlled behavior: traffic exits the right interface, NAT follows the active path, return traffic stays symmetric enough to avoid breaking sessions, and the system recovers without a human logging in at 2 a.m. In this guide, we will build that control using software routing on Linux—production-grade, persistent across reboots, and verifiable at every step.

Architecture we are implementing

We will configure a Linux router with two upstream internet links:

WAN1: primary internet uplink
WAN2: secondary internet uplink
LAN: internal network interface

We will implement:

Policy-based routing so each WAN has its own routing table and default route.
Connection marking so return traffic follows the same WAN it left on (reduces asymmetric routing issues).
NAT per WAN so outbound traffic is translated correctly on the active link.
Health checking using keepalived with VRRP-style priority changes driven by real reachability tests.
Persistence across reboots using systemd and nftables.

This approach is software routing end-to-end. No reliance on hardware-only features, and no assumptions about proprietary appliances.

Prerequisites and assumptions

Before we touch commands, we need to be explicit about the environment. Failover routing is unforgiving when assumptions are vague.

Platform: Linux (enterprise-grade). The steps below are written for Debian 12 / Ubuntu 22.04+ style systems using systemd. The same concepts apply to RHEL-family systems, but package names and file locations may differ.
Role: This host is acting as an internet edge router for a LAN. It has at least three interfaces: WAN1, WAN2, LAN.
Addressing:
- WAN interfaces receive IP configuration via DHCP or static addressing from each ISP.
- LAN has a static RFC1918 subnet (for example, 10.10.0.0/24).
Privileges: We need root privileges. We will use sudo -i to avoid partial permission failures.
Change window: Applying routing and NAT changes can interrupt active sessions. We should schedule a maintenance window.
Security posture: This guide enables forwarding and NAT. We must also enforce a firewall policy to avoid turning the router into an exposed transit host.
Persistence: We will make changes persistent using sysctl, nftables, and systemd units.

Step 1: Confirm interfaces, addresses, and current routes

Before we change anything, we will inventory the current state. This prevents the most common enterprise outage pattern: “we assumed interface names.” We will capture interface names, IPs, and default routes as they exist right now.

sudo -i
ip -br link
ip -br addr
ip route show
ip rule show

We are now operating as root, and we have a concise view of interfaces and addressing. The route and rule output tells us whether policy routing is already in use. If we see existing custom rules or multiple default routes, we should pause and reconcile them before proceeding.

Step 2: Define copy/paste-safe variables for WAN and LAN

We will set shell variables so the rest of the commands are consistent and less error-prone. Because interface names vary (for example, ens18, enp1s0, eth0), we will first print candidates and then set variables explicitly. This keeps commands copy/paste-safe without guessing.

ip -br link

From the output, we will identify the three interfaces. Now we will set variables. We will also print them back to confirm we did not mistype anything.

WAN1_IFACE="ens18"
WAN2_IFACE="ens19"
LAN_IFACE="ens20"

printf "WAN1=%snWAN2=%snLAN=%sn" "$WAN1_IFACE" "$WAN2_IFACE" "$LAN_IFACE"

We have now pinned the interface names for the rest of the configuration. If these are wrong, everything that follows will be wrong in a very consistent way, so this is the moment to be strict.

Step 3: Enable IPv4 forwarding persistently

A Linux host does not route packets between interfaces unless forwarding is enabled. We will enable IPv4 forwarding immediately and persist it across reboots using sysctl. We will also apply a couple of sane kernel settings that reduce common routing edge cases in multi-WAN environments.

cat > /etc/sysctl.d/99-routing-failover.conf <<'EOF'
net.ipv4.ip_forward=1

# Reduce surprises with asymmetric routing and multi-homing
net.ipv4.conf.all.rp_filter=2
net.ipv4.conf.default.rp_filter=2

# Keep routing behavior consistent
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.default.send_redirects=0
EOF

sysctl --system

We have written a persistent sysctl file and applied it immediately. Forwarding is now enabled, reverse path filtering is set to “loose” mode (important when traffic can legitimately return via a different interface during failover), and ICMP redirects are disabled to avoid clients learning unstable paths.

We will verify the effective values.

sysctl net.ipv4.ip_forward
sysctl net.ipv4.conf.all.rp_filter
sysctl net.ipv4.conf.default.rp_filter

If ip_forward is not 1, routing will not work. If rp_filter is strict (1), we may see intermittent drops during failover or when ISPs behave unexpectedly.

Step 4: Install required packages

We will install the tooling needed for production-grade failover: nftables for firewall/NAT, keepalived for health-driven priority changes, and a few utilities for verification and health checks.

apt-get update
apt-get install -y nftables keepalived iproute2 iputils-ping curl

The system now has the firewall/NAT engine, the failover daemon, and basic network tools. Next, we will build the routing logic first, then enforce it with firewall/NAT, and finally automate failover decisions.

Step 5: Create dedicated routing tables for WAN1 and WAN2

In multi-WAN routing, a single main routing table becomes ambiguous. We will create two additional routing tables so each WAN has a clean, deterministic default route. Then we can steer traffic into the correct table using policy rules.

We will register table names in /etc/iproute2/rt_tables so the configuration is readable and maintainable.

grep -qE '^s*100s+wan1s*$' /etc/iproute2/rt_tables || echo '100 wan1' >> /etc/iproute2/rt_tables
grep -qE '^s*200s+wan2s*$' /etc/iproute2/rt_tables || echo '200 wan2' >> /etc/iproute2/rt_tables

tail -n 5 /etc/iproute2/rt_tables

We have now defined two routing tables: wan1 (ID 100) and wan2 (ID 200). This does not change traffic yet; it only prepares the system for deterministic routing rules.

Step 6: Discover WAN gateways and source addresses safely

To build correct default routes, we need each WAN’s gateway and the WAN interface’s source IP. We will extract them from the current kernel state rather than guessing.

First, we will read the default route for each WAN interface if it exists.

ip route show default dev "$WAN1_IFACE" || true
ip route show default dev "$WAN2_IFACE" || true

If either command prints nothing, that WAN may not have a default route yet (common if it is down or not configured). In that case, we must fix link addressing before continuing.

Now we will set variables for gateways and source IPs using the routing and address information. These commands are designed to fail loudly if the required values are missing.

WAN1_GW=$(ip route show default dev "$WAN1_IFACE" | awk '/default/ {print $3; exit}')
WAN2_GW=$(ip route show default dev "$WAN2_IFACE" | awk '/default/ {print $3; exit}')

WAN1_SRC=$(ip -4 -o addr show dev "$WAN1_IFACE" | awk '{print $4}' | cut -d/ -f1 | head -n1)
WAN2_SRC=$(ip -4 -o addr show dev "$WAN2_IFACE" | awk '{print $4}' | cut -d/ -f1 | head -n1)

printf "WAN1_GW=%s WAN1_SRC=%sn" "$WAN1_GW" "$WAN1_SRC"
printf "WAN2_GW=%s WAN2_SRC=%sn" "$WAN2_GW" "$WAN2_SRC"

test -n "$WAN1_GW" -a -n "$WAN2_GW" -a -n "$WAN1_SRC" -a -n "$WAN2_SRC"

We have now captured the gateways and source IPs for both WANs. The final test ensures none of these are empty; if it fails, we should stop and correct WAN connectivity before proceeding.

Step 7: Build policy routing rules and routes

Now we will create the actual routing behavior. The goal is simple:

Traffic marked as “WAN1” uses the wan1 table and exits WAN1.
Traffic marked as “WAN2” uses the wan2 table and exits WAN2.
We keep the main table default route as the currently preferred path, but we do not rely on it for correctness.

We will add default routes to each custom table, including an explicit source address. This reduces ambiguity in multi-homed systems.

ip route replace default via "$WAN1_GW" dev "$WAN1_IFACE" src "$WAN1_SRC" table wan1
ip route replace default via "$WAN2_GW" dev "$WAN2_IFACE" src "$WAN2_SRC" table wan2

ip route show table wan1
ip route show table wan2

Each table now has a clean default route. This still does not steer traffic by itself; it only defines where traffic goes once it is placed into a table.

Next, we will add policy rules that select a table based on a firewall mark. We will use two marks: 0x1 for WAN1 and 0x2 for WAN2.

ip rule add fwmark 0x1 lookup wan1 priority 100 || true
ip rule add fwmark 0x2 lookup wan2 priority 110 || true

ip rule show | sed -n '1,200p'

The system will now consult wan1 or wan2 tables when packets carry the corresponding mark. The marks themselves will be applied by nftables in the next step.

Step 8: Configure nftables for NAT, stateful firewalling, and connection marking

This is where production setups either become stable or become mysterious. We need three things:

Stateful firewalling to allow established traffic and restrict inbound exposure.
NAT masquerading on both WAN interfaces for LAN egress.
Connection marking so replies follow the same WAN as the original flow, which reduces session breakage during normal operation.

We will implement a conservative baseline firewall:

Allow forwarding from LAN to WANs.
Allow established/related traffic back.
Drop unsolicited inbound from WAN to LAN by default.

We will also mark new LAN-originated connections to prefer WAN1 by default, and we will preserve the mark for the life of the connection using conntrack marks.

We will write a full /etc/nftables.conf so it is persistent and auditable.

cat > /etc/nftables.conf <<EOF
#!/usr/sbin/nft -f

flush ruleset

define WAN1 = $WAN1_IFACE
define WAN2 = $WAN2_IFACE
define LAN  = $LAN_IFACE

table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;

    ct state established,related accept
    iif "lo" accept

    # Allow essential ICMP for troubleshooting and PMTU
    ip protocol icmp accept
    ip6 nexthdr icmpv6 accept

    # Allow SSH from LAN only (adjust to enterprise policy)
    iifname $LAN tcp dport 22 accept

    # Optional: allow keepalived VRRP advertisements on LAN if used there
    # ip protocol vrrp accept

    counter drop
  }

  chain forward {
    type filter hook forward priority 0; policy drop;

    ct state established,related accept

    # Allow LAN to reach the internet via either WAN
    iifname $LAN oifname $WAN1 accept
    iifname $LAN oifname $WAN2 accept

    counter drop
  }

  chain output {
    type filter hook output priority 0; policy accept;
  }
}

table ip mangle {
  chain prerouting {
    type filter hook prerouting priority -150; policy accept;

    # Restore packet mark from conntrack mark for established flows
    ct mark 0x1 meta mark set 0x1
    ct mark 0x2 meta mark set 0x2

    # For new LAN-originated flows, prefer WAN1 by default
    iifname $LAN ct state new meta mark set 0x1
    iifname $LAN ct state new ct mark set meta mark
  }
}

table ip nat {
  chain postrouting {
    type nat hook postrouting priority 100; policy accept;

    # NAT for LAN egress on each WAN
    oifname $WAN1 masquerade
    oifname $WAN2 masquerade
  }
}
EOF

chmod 600 /etc/nftables.conf
systemctl enable --now nftables
nft -f /etc/nftables.conf

We have now:

Enabled a default-drop firewall on input and forward chains.
Allowed LAN-to-WAN forwarding while keeping unsolicited WAN-to-LAN blocked.
Enabled NAT on both WANs.
Implemented connection marking so flows keep their WAN affinity.

We will verify that nftables is active and the ruleset is loaded.

systemctl status nftables --no-pager
nft list ruleset | sed -n '1,200p'

If the ruleset is empty or the service is not active, routing may still work partially, but NAT and marking will not, and failover behavior will be unreliable.

Step 9: Make routing rules persistent with a systemd unit

Routes and rules added with ip commands do not automatically persist across reboots unless we manage them. In enterprise environments, “it worked until reboot” is not acceptable. We will create a small systemd oneshot service that re-applies the policy routing configuration after networking is up.

First, we will create a script that re-detects gateways and source IPs at boot time. This matters because WAN addressing may be DHCP-based and can change.

cat > /usr/local/sbin/apply-multiwan-routing.sh <<'EOF'
#!/bin/sh
set -eu

WAN1_IFACE="${WAN1_IFACE:-ens18}"
WAN2_IFACE="${WAN2_IFACE:-ens19}"

WAN1_GW="$(ip route show default dev "$WAN1_IFACE" | awk '/default/ {print $3; exit}')"
WAN2_GW="$(ip route show default dev "$WAN2_IFACE" | awk '/default/ {print $3; exit}')"

WAN1_SRC="$(ip -4 -o addr show dev "$WAN1_IFACE" | awk '{print $4}' | cut -d/ -f1 | head -n1)"
WAN2_SRC="$(ip -4 -o addr show dev "$WAN2_IFACE" | awk '{print $4}' | cut -d/ -f1 | head -n1)"

[ -n "$WAN1_GW" ] && [ -n "$WAN2_GW" ] && [ -n "$WAN1_SRC" ] && [ -n "$WAN2_SRC" ]

ip route replace default via "$WAN1_GW" dev "$WAN1_IFACE" src "$WAN1_SRC" table wan1
ip route replace default via "$WAN2_GW" dev "$WAN2_IFACE" src "$WAN2_SRC" table wan2

ip rule add fwmark 0x1 lookup wan1 priority 100 2>/dev/null || true
ip rule add fwmark 0x2 lookup wan2 priority 110 2>/dev/null || true

exit 0
EOF

chmod 750 /usr/local/sbin/apply-multiwan-routing.sh

We have created a boot-safe script that re-applies routes and rules based on the live WAN configuration. Next, we will provide the interface names to the script via an environment file so we do not hardcode them in multiple places.

cat > /etc/default/multiwan-routing <<EOF
WAN1_IFACE=$WAN1_IFACE
WAN2_IFACE=$WAN2_IFACE
EOF

chmod 640 /etc/default/multiwan-routing

Now we will create the systemd unit that runs after the network is online.

cat > /etc/systemd/system/multiwan-routing.service <<'EOF'
[Unit]
Description=Apply multi-WAN policy routing tables and rules
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
EnvironmentFile=/etc/default/multiwan-routing
ExecStart=/usr/local/sbin/apply-multiwan-routing.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now multiwan-routing.service

The routing policy is now persistent across reboots. We will verify the service and confirm the rules and tables are present.

systemctl status multiwan-routing.service --no-pager
ip rule show | sed -n '1,200p'
ip route show table wan1
ip route show table wan2

Step 10: Implement health-driven failover with keepalived

At this point, we have two WAN paths and deterministic routing tables. What we still need is a decision-maker: when WAN1 is unhealthy, we must shift new flows to WAN2, and when WAN1 recovers, we should shift back in a controlled way.

We will use keepalived not for classic VRRP between two routers, but as a robust health-check engine that can run scripts and adjust priority. In a single-router design, the “priority” becomes our internal signal to switch behavior.

We will implement this switching by changing which mark is applied to new LAN connections. The cleanest way to do that is to update an nftables set or rule. We will keep it simple and safe: we will maintain a small file that indicates the preferred WAN mark, and a script will update nftables accordingly.

Create a controlled switch mechanism for the preferred WAN

We will add a dedicated nftables chain that sets the default mark for new LAN flows, and we will make it easy to flip between WAN1 and WAN2 without rewriting the whole ruleset.

We will update /etc/nftables.conf to include a named map-like behavior using a variable file is not supported directly by nft, so we will implement a small script that replaces a single rule in-place by reloading a tiny include file. This keeps changes minimal and auditable.

First, we will create an include file that contains only the “preferred WAN” marking rule.

cat > /etc/nftables.d <<'EOF'
EOF

The directory may not exist yet, so we will create it properly and then write the include file.

mkdir -p /etc/nftables.d

cat > /etc/nftables.d/preferred-wan.nft <<'EOF'
# This file is managed by keepalived scripts.
# It controls which WAN mark is applied to NEW LAN-originated connections.
#
# 0x1 = WAN1
# 0x2 = WAN2
#
# Default: WAN1
add rule ip mangle prerouting iifname $LAN ct state new meta mark set 0x1
add rule ip mangle prerouting iifname $LAN ct state new ct mark set meta mark
EOF

chmod 600 /etc/nftables.d/preferred-wan.nft

Now we will adjust the main nftables config to remove the earlier “prefer WAN1” lines and include this file instead. We will rewrite /etc/nftables.conf fully to keep it consistent and production-auditable.

cat > /etc/nftables.conf <<EOF
#!/usr/sbin/nft -f

flush ruleset

define WAN1 = $WAN1_IFACE
define WAN2 = $WAN2_IFACE
define LAN  = $LAN_IFACE

table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;

    ct state established,related accept
    iif "lo" accept

    ip protocol icmp accept
    ip6 nexthdr icmpv6 accept

    iifname $LAN tcp dport 22 accept

    counter drop
  }

  chain forward {
    type filter hook forward priority 0; policy drop;

    ct state established,related accept

    iifname $LAN oifname $WAN1 accept
    iifname $LAN oifname $WAN2 accept

    counter drop
  }

  chain output {
    type filter hook output priority 0; policy accept;
  }
}

table ip mangle {
  chain prerouting {
    type filter hook prerouting priority -150; policy accept;

    # Restore packet mark from conntrack mark for established flows
    ct mark 0x1 meta mark set 0x1
    ct mark 0x2 meta mark set 0x2
  }
}

include "/etc/nftables.d/preferred-wan.nft"

table ip nat {
  chain postrouting {
    type nat hook postrouting priority 100; policy accept;

    oifname $WAN1 masquerade
    oifname $WAN2 masquerade
  }
}
EOF

chmod 600 /etc/nftables.conf
nft -f /etc/nftables.conf

We have now separated “preferred WAN selection” into a small include file. This is important operationally: failover should change one small, controlled piece of configuration, not reload an entire firewall policy with higher risk.

We will verify that the include rules are present.

nft list ruleset | sed -n '1,260p'

Create scripts that switch preferred WAN safely

We will create two scripts: one to prefer WAN1 and one to prefer WAN2. Each script will rewrite only the include file and then reload nftables. This is deterministic and easy to audit.

cat > /usr/local/sbin/prefer-wan1.sh <<'EOF'
#!/bin/sh
set -eu

cat > /etc/nftables.d/preferred-wan.nft <<'RULES'
# Managed by keepalived scripts
add rule ip mangle prerouting iifname $LAN ct state new meta mark set 0x1
add rule ip mangle prerouting iifname $LAN ct state new ct mark set meta mark
RULES

nft -f /etc/nftables.conf
EOF

cat > /usr/local/sbin/prefer-wan2.sh <<'EOF'
#!/bin/sh
set -eu

cat > /etc/nftables.d/preferred-wan.nft <<'RULES'
# Managed by keepalived scripts
add rule ip mangle prerouting iifname $LAN ct state new meta mark set 0x2
add rule ip mangle prerouting iifname $LAN ct state new ct mark set meta mark
RULES

nft -f /etc/nftables.conf
EOF

chmod 750 /usr/local/sbin/prefer-wan1.sh /usr/local/sbin/prefer-wan2.sh

We now have two controlled switch scripts. They do not guess interface names; they rely on the $LAN definition inside nftables, which is already set in /etc/nftables.conf. Reloading nftables applies the new preference immediately for new connections, while existing connections keep their conntrack mark and continue on their original WAN.

Configure keepalived health checks

Now we will define what “WAN is healthy” means. In production, link state alone is not enough. We can have an interface that is up while upstream routing is broken. We will check reachability to multiple stable public endpoints and require success before we consider a WAN healthy.

We will create two health-check scripts, one per WAN, that force the check to egress via the correct interface. This avoids false positives where the check accidentally exits the other WAN.

cat > /usr/local/sbin/check-wan1.sh <<'EOF'
#!/bin/sh
set -eu

WAN1_IFACE="${WAN1_IFACE:-ens18}"

# Check multiple targets; success if any one responds quickly.
ping -I "$WAN1_IFACE" -c 1 -W 1 1.1.1.1 >/dev/null 2>&1 && exit 0
ping -I "$WAN1_IFACE" -c 1 -W 1 8.8.8.8 >/dev/null 2>&1 && exit 0
ping -I "$WAN1_IFACE" -c 1 -W 1 9.9.9.9 >/dev/null 2>&1 && exit 0

exit 1
EOF

cat > /usr/local/sbin/check-wan2.sh <<'EOF'
#!/bin/sh
set -eu

WAN2_IFACE="${WAN2_IFACE:-ens19}"

ping -I "$WAN2_IFACE" -c 1 -W 1 1.1.1.1 >/dev/null 2>&1 && exit 0
ping -I "$WAN2_IFACE" -c 1 -W 1 8.8.8.8 >/dev/null 2>&1 && exit 0
ping -I "$WAN2_IFACE" -c 1 -W 1 9.9.9.9 >/dev/null 2>&1 && exit 0

exit 1
EOF

chmod 750 /usr/local/sbin/check-wan1.sh /usr/local/sbin/check-wan2.sh

We have created deterministic health checks that validate real upstream reachability per WAN. Next, we will configure keepalived to run these checks and switch preference accordingly.

We will provide interface names to these scripts via the same environment file we already created.

cat >> /etc/default/multiwan-routing <<EOF
LAN_IFACE=$LAN_IFACE
EOF

chmod 640 /etc/default/multiwan-routing
cat /etc/default/multiwan-routing

Now we will configure keepalived. We will run it locally and use track_script weights to decide whether WAN1 is preferred. When WAN1 check fails, we will switch to WAN2. When WAN1 recovers, we will switch back.

cat > /etc/keepalived/keepalived.conf <<'EOF'
global_defs {
  enable_script_security
  script_user root
}

vrrp_script chk_wan1 {
  script "/usr/local/sbin/check-wan1.sh"
  interval 2
  timeout 2
  fall 3
  rise 5
  weight 20
}

vrrp_script chk_wan2 {
  script "/usr/local/sbin/check-wan2.sh"
  interval 2
  timeout 2
  fall 3
  rise 5
  weight 10
}

vrrp_instance WAN_FAILOVER {
  state BACKUP
  interface lo
  virtual_router_id 51
  priority 100
  advert_int 1

  track_script {
    chk_wan1
    chk_wan2
  }

  notify_master "/usr/local/sbin/prefer-wan1.sh"
  notify_backup "/usr/local/sbin/prefer-wan2.sh"
  notify_fault "/usr/local/sbin/prefer-wan2.sh"
}
EOF

systemctl enable --now keepalived

We have configured keepalived to run health checks continuously and to call our switch scripts based on state transitions. We used lo as the interface because we are not advertising VRRP on a physical network; we are using keepalived’s state machine and script handling locally.

Now we will verify keepalived status and logs.

systemctl status keepalived --no-pager
journalctl -u keepalived --no-pager -n 200

If keepalived is running correctly, we should see periodic script execution and state transitions when we simulate failures. The preferred WAN marking will change only when keepalived triggers the notify scripts.

Step 11: Verification from routing to real traffic

We will verify in layers. In production, we do not trust a single “ping worked” as proof. We will validate:

Policy rules exist
Routing tables have correct defaults
nftables is loaded and NAT is active
Traffic from LAN is forwarded and NATed
Failover changes the preferred WAN for new flows

Verify policy routing objects

We will confirm the rules and tables are present and readable.

ip rule show | sed -n '1,200p'
ip route show table wan1
ip route show table wan2

We should see rules for fwmark 0x1 and fwmark 0x2, and each table should have a default route via its respective gateway.

Verify nftables and NAT counters

We will confirm nftables is active and then watch counters increase as traffic flows.

systemctl status nftables --no-pager
nft list ruleset | sed -n '1,260p'

Now we will generate outbound traffic from a LAN client and then inspect NAT and forward counters. If we do not have a LAN client available, we can still validate on the router by forwarding tests later, but real validation should come from a LAN host.

nft list table inet filter
nft list table ip nat

We should see counters increment on the forward accept rules and on the masquerade rules corresponding to the active WAN.

Verify preferred WAN selection

We will check which preferred-wan rule is currently installed. This tells us what new LAN connections will use.

cat /etc/nftables.d/preferred-wan.nft
nft -a list chain ip mangle prerouting | sed -n '1,200p'

If the include file sets mark 0x1, new flows prefer WAN1. If it sets 0x2, new flows prefer WAN2. Existing flows will continue based on conntrack marks.

Simulate failover safely

In a controlled window, we can simulate WAN1 failure by bringing the interface down. This is disruptive, so we do it intentionally and watch keepalived react.

ip link set dev "$WAN1_IFACE" down
sleep 8
journalctl -u keepalived --no-pager -n 80
cat /etc/nftables.d/preferred-wan.nft

We should see keepalived transition and the preferred WAN switch to WAN2. New LAN connections should now exit via WAN2.

Now we restore WAN1 and confirm it switches back after the rise threshold.

ip link set dev "$WAN1_IFACE" up
sleep 15
journalctl -u keepalived --no-pager -n 120
cat /etc/nftables.d/preferred-wan.nft

We should see the preferred WAN return to WAN1. The delay is intentional; it prevents flapping during unstable upstream conditions.

Security and firewall considerations for enterprises

Routing failover is a network availability feature, but it can accidentally become a security regression if we treat the router like a “dumb pipe.” A few enterprise-grade considerations we should keep in place:

Default-drop inbound: We implemented a default-drop input policy and only allowed SSH from LAN. We should further restrict SSH to a management subnet and enforce key-based auth.
Logging strategy: Excessive firewall logging can become a denial-of-service vector. If we add logging, we should rate-limit it.
Management plane separation: Ideally, management access is on a dedicated interface/VLAN, not the general LAN.
Upstream exposure: If we need inbound services (VPN concentrator, published apps), we should add explicit DNAT and forward rules per service, per WAN, and document expected behavior during failover.
Change control: Keep /etc/nftables.conf, /etc/keepalived/keepalived.conf, and /usr/local/sbin scripts under configuration management and peer review.

Troubleshooting

Symptom: LAN clients have no internet access

Likely cause: IPv4 forwarding is disabled.
Fix: Verify and re-apply sysctl.

sysctl net.ipv4.ip_forward
sysctl --system

If forwarding was off, enabling it immediately restores routing, and the sysctl file ensures it stays enabled after reboot.

Symptom: Internet works on the router itself, but LAN clients cannot reach anything

Likely cause: NAT is missing or nftables is not loaded.
Fix: Verify nftables service and NAT rules.

systemctl status nftables --no-pager
nft list table ip nat
nft -f /etc/nftables.conf

If masquerade rules are missing, LAN traffic will route out but return traffic will not know how to get back to private addresses.

Symptom: Failover does not happen when WAN1 is down

Likely cause: keepalived is not running, scripts are not executable, or script security is blocking execution.
Fix: Check service status, logs, and permissions.

systemctl status keepalived --no-pager
journalctl -u keepalived --no-pager -n 200
ls -l /usr/local/sbin/check-wan1.sh /usr/local/sbin/check-wan2.sh /usr/local/sbin/prefer-wan1.sh /usr/local/sbin/prefer-wan2.sh

If scripts are not executable or keepalived cannot run them, we will see explicit errors in the journal. Correcting permissions and restarting keepalived typically resolves it.

Symptom: Failover happens, but some sessions break immediately

Likely cause: Existing sessions were established via WAN1 and cannot survive a path change, especially with NAT and stateful upstreams.
Fix: Accept that some sessions will reset on hard failover, and design critical applications with reconnection logic. For higher continuity, consider application-layer resilience or multi-path designs.

What we have built is controlled failover for new flows and best-effort continuity for existing flows. That is the realistic boundary for many NAT-based enterprise edges.

Symptom: Traffic exits WAN2 even when WAN1 is healthy

Likely cause: Preferred WAN include file is set to WAN2, or keepalived is in BACKUP/FAULT state due to failing WAN1 checks.
Fix: Inspect the preferred-wan file and keepalived logs, then validate WAN1 reachability from the correct interface.

cat /etc/nftables.d/preferred-wan.nft
journalctl -u keepalived --no-pager -n 120
ping -I "$WAN1_IFACE" -c 2 -W 1 1.1.1.1

If the ping fails from WAN1 specifically, the issue is real upstream reachability, not routing preference. If ping succeeds but keepalived still fails, we should review script timeouts and DNS dependencies (we intentionally used IPs to avoid DNS as a dependency).

Common mistakes

Mistake: Using the wrong interface names

Symptom: NAT counters do not increment; forwarding rules never match; traffic silently drops.
Fix: Re-check interface names and rewrite variables, then reload nftables.

ip -br link
printf "WAN1=%s WAN2=%s LAN=%sn" "$WAN1_IFACE" "$WAN2_IFACE" "$LAN_IFACE"
nft -f /etc/nftables.conf

When interface names are wrong, the rules exist but never match, which looks like “routing is broken” even though the configuration is simply targeting the wrong devices.

Mistake: Forgetting persistence for policy routing

Symptom: Everything works until reboot, then failover logic becomes inconsistent.
Fix: Ensure the systemd unit is enabled and successful.

systemctl is-enabled multiwan-routing.service
systemctl status multiwan-routing.service --no-pager
ip rule show | sed -n '1,120p'

If the service is disabled or failing, rules and routes may not be present after boot, and nftables marking will have nothing deterministic to steer into.

Mistake: Strict reverse path filtering in multi-WAN

Symptom: Intermittent drops, especially during failover or when return traffic arrives on a different interface.
Fix: Set rp_filter to loose mode and re-apply sysctl.

sysctl net.ipv4.conf.all.rp_filter
sysctl -w net.ipv4.conf.all.rp_filter=2
sysctl -w net.ipv4.conf.default.rp_filter=2
sysctl --system

Loose mode keeps basic spoofing protection while allowing legitimate multi-homed routing behavior.

How do we at NIILAA look at this

This setup is not impressive because it is complex. It is impressive because it is controlled. Every component is intentional. Every configuration has a reason. This is how infrastructure should scale — quietly, predictably, and without drama.

At NIILAA, we help enterprises design, deploy, secure, and maintain software-based routing and high-availability internet edge patterns like this in real production environments. That includes architecture validation, security hardening, operational runbooks, monitoring, change control, and failure testing so the first real outage is not the first real test.

Website: https://www.niilaa.com
Email: [email protected]
LinkedIn: https://www.linkedin.com/company/niilaa
Facebook: https://www.facebook.com/niilaa.llc