Bug 1643348

Summary:	[vsphere] The "Internal IP/Host IP" of the infra nodes starts changing to the VIPs, and changes constantly/randomly all on its own, to any of these VIPs on eth0 ( confirmed by oc get hostsubnet output).
Product:	OpenShift Container Platform	Reporter:	Miheer Salunke <misalunk>
Component:	Cloud Compute	Assignee:	Dan Winship <danw>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	adeshpan, aos-bugs, danw, emahoney, jcrumple, jokerman, jrosenta, knakai, misalunk, mmccomas, openshift-bugs-escalate, wsun, zzhao
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: A changed introduced in Kubernetes 1.11 affected nodes with many IP addresses in vSphere deployments. Consequence: Under vSphere, a node hosting several Egress IPs or Router HA addresses would sporadically "forget" which of the IPs was its official "node IP" (even if that node IP had been explicitly specified in the node configuration) and start using one of the other ones, causing networking problems. Fix: If a "node IP" is specified in the node configuration, it will be used correctly, regardless of how many other IPs the node has. Result: Networking should work reliably.	Story Points:	---
Clone Of:
Clones:	1666820 (view as bug list)		Environment:
Last Closed:	2019-06-04 10:40:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 11 Dan Winship 2018-11-01 15:35:55 UTC

Can you get "oc get node NODENAME -o yaml; oc get hostsubnet NODENAME -o yaml" for one of the nodes, both before and after a "flap". (There's lots of "oc get hostsubnet" output here, but no "oc get node" as far as I've seen.)

It seems like probably the kubelet code is getting confused about what the node's real IP is, and incorrectly updating its Node resource, which then causes other things to be updated incorrectly based on that.

Comment 14 Dan Winship 2018-11-05 20:19:54 UTC

OK... I'm provisionally calling this a kubelet bug (which I think corresponds to "Pod" in bugzilla?), though one could argue it was a vSphere CloudProvider bug instead.

attachment 1500320 [details] shows that the Node resource is being updated with an incorrect InternalIP (while still keeping the node's correct IP as its ExternalIP):

    status:
      addresses:
      - address: x.y.z.17  # Right
        type: ExternalIP
      - address: x.y.z.30  # Wrong
        type: InternalIP
      - address: ...
        type: Hostname

What is happening is that the kubelet periodically calls setNodeAddress() to update its node address. Since the node in question has a CloudProvider, setNodeAddress() first calls the cloud NodeAddresses() method, then keeps the first Address in the returned list that matches kl.nodeIP (the configmap-specified node IP), along with the first Address of each other type.

The vSphere provider's NodeAddresses() just pulls all of the IP addresses off the default interface, and then returns each one as both an ExternalIP and an InternalIP. For each IP address, the ExternalIP always appears first, which means it will always be the one that matches kubelet.setNodeAddress()'s "first Address that matches kl.nodeIP" rule. But that means that kubelet will only end picking the right InternalIP if kl.nodeIP is the first IP in the list returned by NodeAddresses(). Apparently, due to the vagaries of either pkg/net or the kernel APIs, the node's oldest IP gets returned first up until there are more than 5 IPs on the interface, at which point the return value gets reordered for some reason and a different IP is listed first, throwing things into chaos.


I'm not sure if vSphere's behavior here is correct: most other cloud providers do not return the same IP as both InternalIP and ExternalIP. (AFAICT only ovirt does.) However, the docs do not appear to forbid this, and https://kubernetes.io/docs/concepts/architecture/nodes/#addresses outright declares that "The usage of these fields varies depending on your cloud provider".

So given that, I think that kubelet's logic should be changed so that instead of taking "the first address of any type that matches kl.nodeIP, followed by the first address of each other type", it should take "*every* address that matches kl.nodeIP, followed by the first address of each other type". And then in this case it would always return the kl.nodeIP-based ExternalIP and InternalIP, regardless of the order that the CloudProvider returned the addresses in.

Alternatively, vSphere could be changed to not claim the IPs as both internal and external, but that would require doc updates to explain what it *should* be doing...

Comment 20 Miheer Salunke 2018-11-07 04:12:35 UTC

(In reply to Dan Winship from comment #14)
> OK... I'm provisionally calling this a kubelet bug (which I think
> corresponds to "Pod" in bugzilla?), though one could argue it was a vSphere
> CloudProvider bug instead.
> 
> attachment 1500320 [details] shows that the Node resource is being updated
> with an incorrect InternalIP (while still keeping the node's correct IP as
> its ExternalIP):
> 
>     status:
>       addresses:
>       - address: x.y.z.17  # Right
>         type: ExternalIP
>       - address: x.y.z.30  # Wrong
>         type: InternalIP
>       - address: ...
>         type: Hostname
> 
> What is happening is that the kubelet periodically calls setNodeAddress() to
> update its node address. Since the node in question has a CloudProvider,
> setNodeAddress() first calls the cloud NodeAddresses() method, then keeps
> the first Address in the returned list that matches kl.nodeIP (the
> configmap-specified node IP), along with the first Address of each other
> type.
> 
> The vSphere provider's NodeAddresses() just pulls all of the IP addresses
> off the default interface, and then returns each one as both an ExternalIP
> and an InternalIP. For each IP address, the ExternalIP always appears first,
> which means it will always be the one that matches
> kubelet.setNodeAddress()'s "first Address that matches kl.nodeIP" rule. But
> that means that kubelet will only end picking the right InternalIP if
> kl.nodeIP is the first IP in the list returned by NodeAddresses().
> Apparently, due to the vagaries of either pkg/net or the kernel APIs, the
> node's oldest IP gets returned first up until there are more than 5 IPs on
> the interface, at which point the return value gets reordered for some
> reason and a different IP is listed first, throwing things into chaos.
> 
> 
> I'm not sure if vSphere's behavior here is correct: most other cloud
> providers do not return the same IP as both InternalIP and ExternalIP.
> (AFAICT only ovirt does.) However, the docs do not appear to forbid this,
> and https://kubernetes.io/docs/concepts/architecture/nodes/#addresses
> outright declares that "The usage of these fields varies depending on your
> cloud provider".
> 
> So given that, I think that kubelet's logic should be changed so that
> instead of taking "the first address of any type that matches kl.nodeIP,
> followed by the first address of each other type", it should take "*every*
> address that matches kl.nodeIP, followed by the first address of each other
> type". And then in this case it would always return the kl.nodeIP-based
> ExternalIP and InternalIP, regardless of the order that the CloudProvider
> returned the addresses in.
> 

I think this will need a fix in the kublet code which might need some time.

> Alternatively, vSphere could be changed to not claim the IPs as both
> internal and external, but that would require doc updates to explain what it
> *should* be doing...

How can we achieve this? Any pointers on this will be highly appreciated.

Comment 21 Dan Winship 2018-11-07 13:50:05 UTC

(In reply to Miheer Salunke from comment #20)
> (In reply to Dan Winship from comment #14)
> > Alternatively, vSphere could be changed to not claim the IPs as both
> > internal and external, but that would require doc updates to explain what it
> > *should* be doing...
> 
> How can we achieve this? Any pointers on this will be highly appreciated.

No, that would also be a code change. As I commented in the support case, there is no workaround for the customer, other than limiting the number of failover/egress IPs on each node.

Comment 22 Dan Winship 2018-11-08 15:51:24 UTC

Filed https://github.com/kubernetes/kubernetes/pull/70805

Comment 35 errata-xmlrpc 2019-06-04 10:40:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758