Bug 1940939

Summary:	Wrong Openshift node IP as kubelet setting VIP as node IP
Product:	OpenShift Container Platform	Reporter:	Arnab Ghosh <arghosh>
Component:	Networking	Assignee:	Dan Winship <danw>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aconstan, bbennett, danw, openshift-bugs-escalate, rcarrier, wking
Version:	4.7
Target Milestone:	---
Target Release:	4.8.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: kubelet picks wrong node IP, in unclear circumstances Consequence: Node becomes NotReady immediately after boot, until it is rebooted Fix: kubelet should reliably use the correct node IP Result: no NotReady nodes	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:54:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1944394

Comment 2 Dan Winship 2021-03-19 17:21:23 UTC

Try running on the affected node:

  systemctl daemon-reload
  systemctl restart kubelet

Then after kubelet has restarted, do "ps wwaux | grep kubelet | grep --color node-ip" and make sure you see "--node-ip=192.168.48.21 --address=192.168.48.21 ..." rather than "--node-ip= --address= ..."

I think that should fix it at least until the next upgrade... I'm not sure if they'll need to run the commands again after a z-stream upgrade or not.

----

From the sosreport, we can see that it found the right IP:

    $ cat ./etc/systemd/system/kubelet.service.d/20-nodenet.conf
    [Service]
    Environment="KUBELET_NODE_IP=192.168.48.21" "KUBELET_NODE_IPS=192.168.48.21"

but those values aren't being passed to kubelet:

    $ grep ' kubelet ' ./sos_commands/process/ps_auxwww
    root        2113 11.7  1.0 3212924 168416 ?      Ssl  Mar17 184:29 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=rhcos --node-ip= --address= --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider= --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5a5340e9f5a326ca05e038b2997c10fdb5b9da2ab4c11d59a37b2415665f311b --v=2

(specifically "--node-ip= --address="). I think the problem is that the node-ip-handling code was simplified a bit in 4.7 and lost a necessary "systemctl daemon-reload" to make sure that systemd reads the file with KUBELET_NODE_IP after it gets written out. Then since kubelet is started with "--node-ip=", it runs its own node-ip-detecting code, which is not reliable on hosts with multiple IPs, and sometimes picks the wrong IP.

Comment 7 W. Trevor King 2021-04-01 15:36:50 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug.  When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 8 Dan Winship 2021-04-05 14:01:19 UTC

(In reply to W. Trevor King from comment #7)
> We're asking the following questions to evaluate whether or not this bug
> warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.

(This ImpactStatementRequest was inspired by someone suggesting on Slack that this bz might be related to recent upgrade failures, but I don't think it is, particularly given that we can't even reproduce it.)
 
> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?

Only one cluster is _known_ to be impacted. We have not managed to reproduce the bug, though it's reasonably clear what's going on. (We are relying on undefined behavior on the part of systemd, and it flipped from undefined-but-doing-what-we-wanted to undefined-and-not-doing-what-we-wanted in this one cluster.) My best guess is that they have some other MachineConfig that is somehow triggering the change in the undefined behavior. In theory other customers might do the same thing, but we don't know what this thing is (the customer has not responded in the support case since the initial bug filing).

The maximum possible impact is "everyone on bare-metal/on-prem platforms with nodes that have multiple potentially-valid node IPs". (But of course, lots of people meeting that definition have already upgraded to 4.7 and not encountered this bug, so we know the impact isn't actually that big.)

> What is the impact?  Is it serious enough to warrant blocking edges?

Every time a node reboots, there's a chance it will hit the bug and then be NotReady.

> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?

Manually reboot any failing nodes until they come up in a working state. (But next time they reboot, they may fail again.)

> Is this a regression (if all previous versions were also vulnerable,
> updating to the new, vulnerable version does not increase exposure)?

The underlying bug (failing to call "systemctl daemon-reload") has apparently always existed. It is not clear why it affects this one customer, but not anyone else, and why it affects them in 4.7 but not 4.6. (The customer never responded to our earlier questions.) In theory, it could be a regression (eg, a change in systemd between 4.6's RHCOS and 4.7's resulted in a change to the undefined behavior). Given that the problem does not appear to be widespread and has not been reproducible locally, it seems more likely to me that it's being triggered by something specific to the customer cluster.

Comment 9 W. Trevor King 2021-04-05 17:19:37 UTC

Makes sense to me.  Dropping UpgradeBlocker for now, and we can revisit if our understanding changes later.

Comment 12 errata-xmlrpc 2021-07-27 22:54:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438