Were you at AWS re:Invent 2019?
I was, and it was a revelation.
“Will you reboot your Linux server in the next 30 days?”
That’s what I asked almost everyone who came to the KernelCare stand.
A third of you said yes. The main reason? Compliance.
This makes sense. Most compliance policies tell companies they must install security patches within 30 days of issue. The problem is, on Linux, patching the kernel means rebooting the server—there’s simply no way around it. (Or is there?)
I must have asked around 2000 people. Think about how many servers each of them looks after. Then multiply by how long a reboot takes, not only to perform, but to plan and coordinate. Even with this small sample, it comes to a startling amount of human effort. And money.
Rebooting: An Inconvenient Necessity
Linux Patch Management is the fancy way of saying “keep a server up to date by installing the latest software”. When that software is the Linux kernel, you find the most common reason for system reboots. For large firms, security patching (like failure) is not an option—it’s an obligation.
A typical IT security compliance policy will have a clause that says something like:
“…security patches must be applied to the system within X days of them being released by the vendor…”
X is usually 30 days. That’s why, in practice, sysadmins batch up patches and install them in bulk, doing the work in a pre-planned maintenance window, a time when everyone has agreed it’s okay for the system to go down for a short time. (Sadly, that window is often Saturday night/Sunday morning, when the impact to customers is the least.)
Planning that window takes a lot of effort: time spent in meetings, words spilled in emails, hearts broken in compromise. And that’s without the lost weekends of cold coffee and stale pizza. From my time at AWS re:Invent, I got the feeling that 30% of Linux system administrators are suffering needlessly.
Automation: An Absolute Necessity
No one should patch-manage manually. Even for one server, it’s a huge amount of effort keeping it up to date.
Sysadmins, being the technical creatures they are, love to automate such routine tasks, especially if they manage large server fleets. Often, the work of automating is more fun than what is being automated. There are other benefits, too:
- It’s safer, less prone to human error.
- The process can be recorded and audited.
- It’s easier to share the work and responsibility.
There are many ways to automate. Deciding which to use can itself be an obstacle. For example, should you:
- Build your own automation tool? There’s plenty of scripting languages to choose from, but which one is best, and do you have the skills, patience, and time?
- Use the vendor’s preferred support tool? Red Hat have Satellite (and Spacewalk, the open-source equivalent), and Canonical have Landscape, but these are for their own platforms, and come only as part of a support bundle.
- Use a service? Again, there’s many to choose from: Automox, GFI, Ivanti, Kaseya, ManageEngine, Pulseway, to name but a few. Someone has to look at them, figure out what they do, how they do it, and if they’ll be suitable. Meanwhile, patches keep coming and need installing.
- Use an orchestration tool? Ansible, Chef, or Puppet are some options, but they, too, need checking out, and have a learning curve to climb before their power pays dividends.
There is one other tempting option, if you work with managed cloud services like VMware Cloud on AWS, who will look after a virtualized platform for you (for a fee, naturally). But virtualization doesn’t eliminate reboots. It just makes them quicker and less painful, lessening their effect through clustering and other system redundancy approaches.
A reboot, no matter how short, closes file handles, network connections, user sessions, and stops all processes. For many classes of application and application user, they will go unnoticed; sessions restore, connections re-establish. But for other types of application, interruptions are ruinous. Think of long-running scientific computations, real-time analytics, live gaming servers, IoT devices, …
A Cure for Reboot Syndrome
Live patching is a way of installing Linux kernel security patches, automatically, and without rebooting. Even though the technique has been around since 2010, it hasn’t found the fame that surrounds Linux.
That’s mainly because it’s hard to pull off. It requires a deep understanding of the kernel source code, and the ability to code patches quickly and accurately. Because of this, only the big Linux vendors can offer live patching solutions: Oracle (with Ksplice), Red Hat (with Kpatch), SUSE (with SUSE Live Patching, née Kgraft), and Canonical (with Livepatch).
How to patch the Linux kernel without rebooting
KernelCare works the same way Ksplice, Kpatch and Kgraft do. The difference between KernelCare and those tools is in how the patches are created. (Another difference is that those products are part of vendors’ service contracts, while KernelCare subscriptions are stand-alone, and therefore significantly more cost-effective and flexible.) Also, KernelCare supports a vast selection of kernel versions and kinds, many more than all the other vendors combined.
Here’s a simple overview of how KernelCare does its stuff.
Why Sysadmins Love KernelCare
I’ve spoken to many customers in my time as Marketing Manager for KernelCare. Here’s what they tell me are their favorite benefits.
- Reduced reporting. Less need for management and compliance reports saying what was done, why, and who by.
- Fewer emails. Rebooting less means less communication and negotiation with dozens of stake-holders from diverse departments.
- More sleep. Rebooting Saturday night/Sunday morning makes sense for anyone impacted by downtime. But it’s hell for the people who have to do it. It doesn’t matter if they’re onsite or remote, someone still has to be awake to run the commands, to check the automation runs smoothly, and to do sanity checks on the system after it comes up. Management paranoia is rife in such situations, rightly so, considering what’s at stake.
- Life is simpler. Having fewer scheduled reboots streamlines the automation config, irrespective of the method. System playbooks are simpler and easier to understand.
Why Service Providers Love KernelCare
If you think KernelCare is just for techies, you’re wrong. Anyone with basic Linux console skills can install KernelCare with a couple of commands. This is great news for a class of people I call Service Providers. These are the managers and executives and small business owners who can’t be bothered with the technicalities of live patching. They simply want to know their Linux servers stay patched, compliant, and safe. I’ve talked to those people too, and, as you’d expect, they see KernelCare’s benefits from a different perspective.
- There’s less downtime. That means less disruption to customers, and fewer reasons for them to move on or complain.
- There’s less risk that the system won’t be the same after a reboot, and less chance of having to roll everything back just to keep a business running.
- There’s less money spent on administration. Live patching is a self-sufficient approach to Linux kernel patch management.
The End (of Reboots)
This year’s AWS re:Invent 2019 was a revelation, not just because it was my first, but because I saw for myself the scale of effort and expense that Linux server admins suffer dealing with pesky reboots. I felt I should be shouting from our KernelCare stand at the top of my voice:
Don’t be a slave to reboots! We have the answer! Join us!
But, of course, I didn’t. I just handed out some brochures, stickers, and t-shirts, and felt sorry for many hundreds of really nice, hard-working people for they had to reboot their servers to apply security updates.