1668802 – Node can't post Ready status when there are many IPs on primary interface and cloudprovider is vSphere

Bug 1668802 - Node can't post Ready status when there are many IPs on primary interface and cloudprovider is vSphere

Summary: Node can't post Ready status when there are many IPs on primary interface and...

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1650392 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-23 15:45 UTC by Pablo Alonso Rodriguez
Modified:	2023-10-06 18:06 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-03 07:23:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4035601	0	None	None	None	2019-08-27 09:01:47 UTC

Description Pablo Alonso Rodriguez 2019-01-23 15:45:09 UTC

Description of problem:

If vsphere cloud provider is active and a secondary IP is added to the main interface, node fails posting its status to the master. The node eventually becomes not ready and messages like these are shown on node logs:

Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: E0123 16:27:44.720954   27538 kubelet_node_status.go:391] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"addresses\":[{\"address\":\"10.74.138.55\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.55\",\"type\":\"InternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"InternalIP\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"Ready\"}]}}" for node "node-0.local.lab": The order in patch list:
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP address:10.74.138.55] map[address:10.74.138.217 type:ExternalIP] map[address:10.74.138.55 type:InternalIP] map[address:10.74.138.217 type:InternalIP]]
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: doesn't match $setElementOrder list:
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: E0123 16:27:44.733055   27538 kubelet_node_status.go:391] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"addresses\":[{\"address\":\"10.74.138.55\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.55\",\"type\":\"InternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"InternalIP\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"Ready\"}]}}" for node "node-0.local.lab": The order in patch list:
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP address:10.74.138.55] map[address:10.74.138.217 type:ExternalIP] map[address:10.74.138.55 type:InternalIP] map[address:10.74.138.217 type:InternalIP]]
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: doesn't match $setElementOrder list:
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]
Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: E0123 16:27:44.733072   27538 kubelet_node_status.go:379] Unable to update node status: update node status exceeds retry count

The secondary IP can be either added manually (with ip address add command) or by setting up the node to host an static egress IP for a project. Both reproduce the issue.

This seems to be the very same problem than the one described at https://bugzilla.redhat.com/show_bug.cgi?id=1552644 but, in this case, it is reproducible on 3.11 and it is not needed to use ipfailover to reproduce it, it can be reproduced just by adding a secondary IP to the interface.

Version-Release number of selected component (if applicable):

3.11

How reproducible:

Always unless nodeIP is configured on node-config.yaml

Steps to Reproduce:
1. Look for a healthy node which does not have nodeIP set on node-config.yaml
2. Run `ip address add $CIDR dev $IFNAME` (where $CIDR is an IP with prefix at the same subnet than the main one and $IFNAME is the interface name).
3. Wait and see how the node becomes NotReady and log messages are shown

Actual results:

Nodes without nodeIP set at node-config.yaml and with more than one IP on the main interface cannot post their status. This includes nodes hosting static egress IPs for projects.

Expected results:

Nodes without nodeIP set at node-config.yaml and with more than one IP on the main interface should be able to post their status.

Additional info:

Setting nodeIP at node-config.yaml allows to workaround this issue. However, this should not be required while using vsphere cloud provider.

Comment 4 Greg Scott 2019-07-29 21:07:43 UTC

I have a customer experiencing the same problem in multiple sites. I attached the support case to this BZ. This quote from my customer in the support case might be useful:

> BTW, I checked the issue with my colleagues ... in our R&D that working with OCP longer than me 😊
> They have the same behavior, but they said that it was working on the previous minor version of OCP 3.11
> However, all our OCP clusters now at version 3.11.98 where it is not working.
> Looks like this bug was introduced in the latest update.

Would it be possible to get a fix for this into the next 3.11 z-stream?

thanks

Comment 5 Manish Sawlani 2019-08-02 01:37:00 UTC

Hi,

Is it fixed in latest build of 3.11.
We are using 3.11.98 and facing same issue reported here.

Comment 17 Michael Gugino 2020-05-19 02:30:06 UTC

Moving this to the node component.  If the kubelet requires specific ordering of addresses, please let us know what it is.  Ideally, the kubelet should order these as it sees fit, enforcing an ordering in a list seems like a poor fit.

Comment 18 Ryan Phillips 2020-05-20 19:18:08 UTC

The Kubelet uses a priority to figure out the Node's host ip:

func GetNodeHostIP(node *v1.Node) (net.IP, error) {
	addresses := node.Status.Addresses
	addressMap := make(map[v1.NodeAddressType][]v1.NodeAddress)
	for i := range addresses {
		addressMap[addresses[i].Type] = append(addressMap[addresses[i].Type], addresses[i])
	}
	if addresses, ok := addressMap[v1.NodeInternalIP]; ok {
		return net.ParseIP(addresses[0].Address), nil
	}
	if addresses, ok := addressMap[v1.NodeExternalIP]; ok {
		return net.ParseIP(addresses[0].Address), nil
	}
	return nil, fmt.Errorf("host IP unknown; known addresses: %v", addresses)
}

https://github.com/kubernetes/kubernetes/blob/eb3405877799b770c72848c11aef967bda887eac/pkg/util/node/node.go#L96

The preference is for the _first_ InternalIP, and a fallback to the _first_ external IP. The cloud provider should be ordering this preference so the preferred IP address is not changing, ie: a new IP is added to the end of the list.

Comment 21 Ryan Phillips 2020-06-02 16:46:16 UTC

*** Bug 1650392 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.