PTS 2024 - day 2 and 3... the bad days

Following on from day 1

Joel and I spent some more time working out disk provisioning and then decided to upgrade the nodes in the cluster... this is where the problems started...

I shutdown a node to resize it... and the site went down, no healthy backends was then displayed to all users by Fastly (our CDN) for any content that wasn't in their cache. This is not meant to happen!

We also couldn't connect to Argo (web UI for Kuberneties deployment and a view on the K8's API status) or even the kubectl command line tool.

Starting the node backup (after having upgraded) and all came back. We quickly realised that everything was using Round Robin DNS to all 3 node IP's. There was ` Traefik ` setup but it was tied to those IPs and something was not happy. We then looked at alternative tooling and thought it might be worth using rke2 instead of k3s as the underlying flavour of K8s as this would give us a little more flexibility.

Then things just got worse... we couldn't even get rke2 running... time went by, many options tried and lots of docs / blogs / issues read. We even used other example projects to spin up what seemed to be working Rke2, so the theory was there... rke2 seems want to do things on private IP's, on a specific interface and that seems to clash with Hetzner's way of provisioning private IP's. We also pulled in Robert who uses rke1 currently and he pointed us at a few things to try...

At one point we told the underlying rke2 networking to use wireguard to provide private connection over the public interface. This worked - on a throw away Hetzner box in a test project! Which we threw away - and then could not reproduce again! We proved that we could start things up in a provider other than Hetzner, but that didn't help at this point.

We didn't want to waste more time, so went back to thinking about using k3s, which meant we needed some sort of high availability tooling, ultimately with Hetzner handling this... we could use their Load balancer with health checks... except to have certificates you have to then have Hetzner running the DNS and we already rely on it being CloudFlare for other things.

To say we were disheartened doesn't even come close!

... there is better news on day 4!

Sponsors who make this work possible Monetary sponsors: Booking.com, The Perl and Raku Foundation, Deriv, cPanel, Inc Japan Perl Association, Perl-Services, Simplelists Ltd, Ctrl O Ltd, Findus Internet-OPAC Harald Joerg, Steven Schubiger. In kind sponsors: Fastmail, Grant Street Group, Deft, Procura, Healex GmbH, SUSE, Zoopla.

Leave a comment

About Ranguard

user-pic London Perl developer