K3s, kube-vip, and metallb

Man, this blog seems to be all my trials and tribulations with kubernetes at this point. Well, to add to it, here’s another issue I stumbled into…

Tl;dr – when using metallb with kube-vip do not use the --services switch for generating the daemonset manifest, as that will conflict with metallb for load-balancing your services.


I built a new cluster (based on Miniforum’s awesome MS-01) about two months ago. As part of this new build, I wanted to load balance the control plane instead of the DNS round-robin I had been using. This lead me to kube-vip.

(Un)Fortunately, kube-vip can also provide the same capabilities as metallb – in that it can provide load balancing capabilities to services running in the cluster on bare metal without an external load balancer. However, I was happy with metallb in the old cluster and didn’t want to change that part.

So I went and installed k3s with the created manifest per the kube-vip k3s instructions which links to their manifest creation instructions. I even went and looked at other, similar articles to see basically the same instructions.

All was pretty good until I started having some weird issues where my ingresses would just sort of go offline. When I’d try to hit a website (like this one), I’d never see the request make it to the ingress, but I could ping the IP. Not seeing the request in the ingress logs made me think it was something with metallb not doing it’s L2 advertisement correctly. This seemed to happen if and when I had to restart the nodes for any reason (patches usually).

Knowing the only real difference for this piece was related to kube-vip, I knew something was going on between kube-vip and metallb. I just didn’t know what. I attempted to upgrade kube-vip, and downgrade metallb, but nothing seemed to work. Figuring it was kube-vip and metallb fighting, I disabled kube-vip from the services (even though I didn’t want it touching them in the first place). Thinking I had fixed it, I left it. Not more than 3 hours later, the ingresses went down again. In fact, I actually made it worse where every 2-4 hours the ingresses would go do for 20 minutes, but then fix themselves. It was incredibly nerve wracking.

Metallb even has a whole troubleshooting section on it’s website for this exact issue. Sadly, nothing there really helped, but there were some weird pieces, like with arping where it’d return multiple MAC addressess for the initial ping until it standardized on the right one. And then yesterday, while the ingresses went down I, on a whim, cleared the arp cache on my router to have it immediately fix the problem. Hmmm, could it be something with the router?!

In a fit of frustration, I deleted the kube-vip daemonset from the cluster. Surely, that would fix it?! No, 2 hours later it was flapping again!

Thinking through the router issue, the only thing I could think of was that it was getting conflicting info, and the only way that would happen is if there were duplicate IPs on the network. I logged into each one of the servers and ran ip -o -f inet addr show. Lo and behold on two different servers I saw the same IP address. Metallb doesn’t bind the IP to the network address, kube-vip does, so it was kube-vip that was causing the issues! Good thing I deleted it, but now I needed to restart the servers to have it remove the IP binding. Thankfully after the restart the IPs were removed.

However, I really liked the fact that my control plane is load balanced instead of pointing to an individual node or relying on round-robin for DNS. Digging into the configuration a bit more, I see that there are 2 main features: --controlplane and --services. Sadly, the default instructions include services, which is what metallb was doing for me. Therefore, I updated the manifest script to be the following:

kube-vip manifest daemonset \
    --interface $INTERFACE \
    --address $VIP \
    --inCluster \
    --taint \
    --controlplane \
##    --services \
    --arp \
    --leaderElection

Redployed and over 24 hours later, all is resolved! Man, rough 2 months dealing with that…

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *