Description of problem: If vsphere cloud provider is active and a secondary IP is added to the main interface, node fails posting its status to the master. The node eventually becomes not ready and messages like these are shown on node logs: Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: E0123 16:27:44.720954 27538 kubelet_node_status.go:391] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"addresses\":[{\"address\":\"10.74.138.55\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.55\",\"type\":\"InternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"InternalIP\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"Ready\"}]}}" for node "node-0.local.lab": The order in patch list: Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP address:10.74.138.55] map[address:10.74.138.217 type:ExternalIP] map[address:10.74.138.55 type:InternalIP] map[address:10.74.138.217 type:InternalIP]] Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: doesn't match $setElementOrder list: Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]] Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: E0123 16:27:44.733055 27538 kubelet_node_status.go:391] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"addresses\":[{\"address\":\"10.74.138.55\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"ExternalIP\"},{\"address\":\"10.74.138.55\",\"type\":\"InternalIP\"},{\"address\":\"10.74.138.217\",\"type\":\"InternalIP\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2019-01-23T15:27:44Z\",\"type\":\"Ready\"}]}}" for node "node-0.local.lab": The order in patch list: Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP address:10.74.138.55] map[address:10.74.138.217 type:ExternalIP] map[address:10.74.138.55 type:InternalIP] map[address:10.74.138.217 type:InternalIP]] Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: doesn't match $setElementOrder list: Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: [map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]] Jan 23 16:27:44 node-0.local.lab atomic-openshift-node[27538]: E0123 16:27:44.733072 27538 kubelet_node_status.go:379] Unable to update node status: update node status exceeds retry count The secondary IP can be either added manually (with ip address add command) or by setting up the node to host an static egress IP for a project. Both reproduce the issue. This seems to be the very same problem than the one described at https://bugzilla.redhat.com/show_bug.cgi?id=1552644 but, in this case, it is reproducible on 3.11 and it is not needed to use ipfailover to reproduce it, it can be reproduced just by adding a secondary IP to the interface. Version-Release number of selected component (if applicable): 3.11 How reproducible: Always unless nodeIP is configured on node-config.yaml Steps to Reproduce: 1. Look for a healthy node which does not have nodeIP set on node-config.yaml 2. Run `ip address add $CIDR dev $IFNAME` (where $CIDR is an IP with prefix at the same subnet than the main one and $IFNAME is the interface name). 3. Wait and see how the node becomes NotReady and log messages are shown Actual results: Nodes without nodeIP set at node-config.yaml and with more than one IP on the main interface cannot post their status. This includes nodes hosting static egress IPs for projects. Expected results: Nodes without nodeIP set at node-config.yaml and with more than one IP on the main interface should be able to post their status. Additional info: Setting nodeIP at node-config.yaml allows to workaround this issue. However, this should not be required while using vsphere cloud provider.
I have a customer experiencing the same problem in multiple sites. I attached the support case to this BZ. This quote from my customer in the support case might be useful: > BTW, I checked the issue with my colleagues ... in our R&D that working with OCP longer than me 😊 > They have the same behavior, but they said that it was working on the previous minor version of OCP 3.11 > However, all our OCP clusters now at version 3.11.98 where it is not working. > Looks like this bug was introduced in the latest update. Would it be possible to get a fix for this into the next 3.11 z-stream? thanks
Hi, Is it fixed in latest build of 3.11. We are using 3.11.98 and facing same issue reported here.
Moving this to the node component. If the kubelet requires specific ordering of addresses, please let us know what it is. Ideally, the kubelet should order these as it sees fit, enforcing an ordering in a list seems like a poor fit.
The Kubelet uses a priority to figure out the Node's host ip: func GetNodeHostIP(node *v1.Node) (net.IP, error) { addresses := node.Status.Addresses addressMap := make(map[v1.NodeAddressType][]v1.NodeAddress) for i := range addresses { addressMap[addresses[i].Type] = append(addressMap[addresses[i].Type], addresses[i]) } if addresses, ok := addressMap[v1.NodeInternalIP]; ok { return net.ParseIP(addresses[0].Address), nil } if addresses, ok := addressMap[v1.NodeExternalIP]; ok { return net.ParseIP(addresses[0].Address), nil } return nil, fmt.Errorf("host IP unknown; known addresses: %v", addresses) } https://github.com/kubernetes/kubernetes/blob/eb3405877799b770c72848c11aef967bda887eac/pkg/util/node/node.go#L96 The preference is for the _first_ InternalIP, and a fallback to the _first_ external IP. The cloud provider should be ordering this preference so the preferred IP address is not changing, ie: a new IP is added to the end of the list.
*** Bug 1650392 has been marked as a duplicate of this bug. ***