Bug 1940939 - Wrong Openshift node IP as kubelet setting VIP as node IP
Summary: Wrong Openshift node IP as kubelet setting VIP as node IP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks: 1944394
TreeView+ depends on / blocked
 
Reported: 2021-03-19 15:22 UTC by Arnab Ghosh
Modified: 2021-07-30 08:29 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: kubelet picks wrong node IP, in unclear circumstances Consequence: Node becomes NotReady immediately after boot, until it is rebooted Fix: kubelet should reliably use the correct node IP Result: no NotReady nodes
Clone Of:
Environment:
Last Closed: 2021-07-27 22:54:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2470 0 None open Bug 1940939: Do "systemctl daemon-reload" after running "runtimecfg node-ip" 2021-03-19 19:43:49 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:55:01 UTC

Comment 2 Dan Winship 2021-03-19 17:21:23 UTC
Try running on the affected node:

  systemctl daemon-reload
  systemctl restart kubelet

Then after kubelet has restarted, do "ps wwaux | grep kubelet | grep --color node-ip" and make sure you see "--node-ip=192.168.48.21 --address=192.168.48.21 ..." rather than "--node-ip= --address= ..."

I think that should fix it at least until the next upgrade... I'm not sure if they'll need to run the commands again after a z-stream upgrade or not.

----

From the sosreport, we can see that it found the right IP:

    $ cat ./etc/systemd/system/kubelet.service.d/20-nodenet.conf
    [Service]
    Environment="KUBELET_NODE_IP=192.168.48.21" "KUBELET_NODE_IPS=192.168.48.21"

but those values aren't being passed to kubelet:

    $ grep ' kubelet ' ./sos_commands/process/ps_auxwww
    root        2113 11.7  1.0 3212924 168416 ?      Ssl  Mar17 184:29 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=rhcos --node-ip= --address= --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider= --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a5340e9f5a326ca05e038b2997c10fdb5b9da2ab4c11d59a37b2415665f311b --v=2

(specifically "--node-ip= --address="). I think the problem is that the node-ip-handling code was simplified a bit in 4.7 and lost a necessary "systemctl daemon-reload" to make sure that systemd reads the file with KUBELET_NODE_IP after it gets written out. Then since kubelet is started with "--node-ip=", it runs its own node-ip-detecting code, which is not reliable on hosts with multiple IPs, and sometimes picks the wrong IP.

Comment 7 W. Trevor King 2021-04-01 15:36:50 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug.  When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 8 Dan Winship 2021-04-05 14:01:19 UTC
(In reply to W. Trevor King from comment #7)
> We're asking the following questions to evaluate whether or not this bug
> warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.

(This ImpactStatementRequest was inspired by someone suggesting on Slack that this bz might be related to recent upgrade failures, but I don't think it is, particularly given that we can't even reproduce it.)
 
> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?

Only one cluster is _known_ to be impacted. We have not managed to reproduce the bug, though it's reasonably clear what's going on. (We are relying on undefined behavior on the part of systemd, and it flipped from undefined-but-doing-what-we-wanted to undefined-and-not-doing-what-we-wanted in this one cluster.) My best guess is that they have some other MachineConfig that is somehow triggering the change in the undefined behavior. In theory other customers might do the same thing, but we don't know what this thing is (the customer has not responded in the support case since the initial bug filing).

The maximum possible impact is "everyone on bare-metal/on-prem platforms with nodes that have multiple potentially-valid node IPs". (But of course, lots of people meeting that definition have already upgraded to 4.7 and not encountered this bug, so we know the impact isn't actually that big.)

> What is the impact?  Is it serious enough to warrant blocking edges?

Every time a node reboots, there's a chance it will hit the bug and then be NotReady.

> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?

Manually reboot any failing nodes until they come up in a working state. (But next time they reboot, they may fail again.)

> Is this a regression (if all previous versions were also vulnerable,
> updating to the new, vulnerable version does not increase exposure)?

The underlying bug (failing to call "systemctl daemon-reload") has apparently always existed. It is not clear why it affects this one customer, but not anyone else, and why it affects them in 4.7 but not 4.6. (The customer never responded to our earlier questions.) In theory, it could be a regression (eg, a change in systemd between 4.6's RHCOS and 4.7's resulted in a change to the undefined behavior). Given that the problem does not appear to be widespread and has not been reproducible locally, it seems more likely to me that it's being triggered by something specific to the customer cluster.

Comment 9 W. Trevor King 2021-04-05 17:19:37 UTC
Makes sense to me.  Dropping UpgradeBlocker for now, and we can revisit if our understanding changes later.

Comment 12 errata-xmlrpc 2021-07-27 22:54:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.