1552644 – Node can't post Ready status using ipfailover and cloudprovider vSphere

Bug 1552644 - Node can't post Ready status using ipfailover and cloudprovider vSphere

Summary: Node can't post Ready status using ipfailover and cloudprovider vSphere

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.10.0
Assignee:	ravig
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-07 13:26 UTC by Borja Aranda
Modified:	2019-01-23 15:48 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-23 04:59:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Borja Aranda 2018-03-07 13:26:43 UTC

Description of problem:
Node with cloudprovider vsphere fails to post status Ready when a ipfailover is configured.

atomic-openshift-node seems to duplicate ip's and fails with 
- "doesn't match $setElementOrder list: [map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]"



Full logs here:

where: Node ip: 159.103.104.245 and VIP: 159.103.104.40

Find local IP address 159.103.104.245 and set type to                                                                                                                                                               
vsphere.go:503] Find local IP address 159.103.104.40 and set type to                                                                                                                                                                
vsphere.go:503] Find local IP address 172.17.0.1 and set type to
vsphere.go:503] Find local IP address 10.214.0.1 and set type to
round_trippers.go:405] PATCH https://console-ocptestinfra.julisbaer.com:443/api/v1/nodes/srp10556lx/status 500 Internal Server Error in 1 milliseconds 
                                                                            
kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type":"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type":"DiskPressure"},{\"type\":\"Ready\"}],\"addresses\":[{\"address":"159.103.104.40\",\"type\":\"ExternalIP\"},{\"address":"159.103.104.245\",\"type\":\"ExternalIP\"},{\"address":"159.103.104.40\",\"type\":\"InternalIP\"},{\"address":"159.103.104.245\",\"type\":\"InternalIP\"}],\"conditions":{\"lastHeartbeatTime\":\"2018-03-06T21:43:42Z\",\"type\":\"OutOfDisk"},\"lastHeartbeatTime\":\"2018-03-06T21:43:42Z\",\"type\":\"MemoryPressure"},\"lastHeartbeatTime\":\"2018-03-06T21:43:42Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2018-03-06T21:43:42Z","type":"Ready"}],"volumesInUse":\"kubernetes.iovsphere-volume/[t-zrh07-dc1-imple/t-zrh07-dc1-03-openshift] kubevols/kubernetes-dynamic-pvc-65f51bfb-215e-11e8-873c-005056a81271.vmdk\",\"kubernetes.io/vsphere-volume/[t-zrh07-dc1-simple/t-zrh07-dc1-03-openshift] kubevols/kubernetes-dynamic-pvc-4e384322-215e-11e8-873c-005056a81271.vmdk\"]}}" for node "srp10556lx": The order in patch list:                                                 
    
atomic-openshift-node[64170]: [map[address:159.103.104.40 type:ExternalIP] map[address:159.103.104.245 type:ExternalIP] map[address:159.103.104.40 type:InternalIP] map[address:159.103.104.245 type:InternalIP]]
                                                                            
atomic-openshift-node[64170]: doesn't match $setElementOrder list:                                                         

atomic-openshift-node[64170]: [map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]


Version-Release number of selected component (if applicable):
3.7

How reproducible:
This happens every time the node the vSphere cloudprovider is active, when it's not ipfailover works as expected.

Comment 3 Seth Jennings 2018-03-07 19:01:57 UTC

Potentially related
https://github.com/kubernetes/contrib/issues/2761

Related bz:
https://bugzilla.redhat.com/show_bug.cgi?id=1527315

There are a few vSphere related fixes coming out in the upcoming 3.7 errata (scheduled to release tomorrow).  I'd be interested to see if those fixes address this issue as well.

Comment 4 Seth Jennings 2018-03-07 19:10:03 UTC

The yet to be released version with the potential fixes is v3.7.36

Comment 20 ravig 2018-04-23 04:59:41 UTC

Thanks Borja, I am closing this bug now.

Comment 21 cg 2018-08-27 06:10:11 UTC

Hi

Is this still likely to be the issue in 3.10.14? I have the exact error and logs as above in this env using ipfailover with vsphere provider. Everything works great until IP failover is added.

Sorry to resurrect an old report, I can open another if needed.

Thanks for your time.

Comment 22 cg 2018-09-12 21:45:49 UTC

FYI for anyone that comes across this. The cause is the nodeIP not being defined, and needing to have the openshift_set_node_ip and such that was removed in 3.10, but worked fine in 3.9

https://bugzilla.redhat.com/show_bug.cgi?id=1624679  RFE for this.

As a workaround, setting the nodeIP: {{ ip_address}}   manually in the /etc/origin/node/node-config.yaml  and quickly restarting the atomic-openshift-node.service before the node-config.yaml is overwritten again, sorts this out.
However every time the atomic-openshift-node.service is restarted (manually or on a reboot), the issue reappears.

Comment 23 luxq 2018-11-16 04:01:11 UTC

openshift 3.11 with cloudprovider vsphere have same problem,openshift add IP failover get this error?Is there a way to avoid this problems?


E1116 02:56:54.095820       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address:10.6.0.197 type:ExternalIP] map[address:10.6.0.192 type:ExternalIP] map[address:10.6.0.197 type:InternalIP] map[address:10.6.0.192 type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}
E1116 02:56:54.105421       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address:10.6.0.197 type:ExternalIP] map[address:10.6.0.192 type:ExternalIP] map[type:InternalIP address:10.6.0.197] map[address:10.6.0.192 type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}
E1116 02:56:54.133776       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address:10.6.0.197 type:ExternalIP] map[address:10.6.0.192 type:ExternalIP] map[address:10.6.0.197 type:InternalIP] map[type:InternalIP address:10.6.0.192]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}
E1116 02:56:54.159910       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address:10.6.0.197 type:ExternalIP] map[address:10.6.0.192 type:ExternalIP] map[type:InternalIP address:10.6.0.197] map[address:10.6.0.192 type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}
E1116 02:56:54.183218       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address:10.6.0.197 type:ExternalIP] map[address:10.6.0.192 type:ExternalIP] map[address:10.6.0.197 type:InternalIP] map[address:10.6.0.192 type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}
E1116 02:57:04.192781       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[type:ExternalIP address:10.6.0.197] map[address:10.6.0.192 type:ExternalIP] map[address:10.6.0.197 type:InternalIP] map[address:10.6.0.192 type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}
E1116 02:57:04.203555       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address:10.6.0.197 type:ExternalIP] map[type:ExternalIP address:10.6.0.192] map[address:10.6.0.197 type:InternalIP] map[address:10.6.0.192 type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}
E1116 02:57:04.232610       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address:10.6.0.197 type:ExternalIP] map[address:10.6.0.192 type:ExternalIP] map[address:10.6.0.197 type:InternalIP] map[address:10.6.0.192 type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}

Comment 24 cg 2018-11-17 21:40:56 UTC

@luxq  We had to create a config map for every single node in the cluster, with the nodeIP configured as an "edit" part in the map (we're using 3.11 too, but this is the case since 3.10)

https://access.redhat.com/solutions/3625721

Expanding slightly, we kept the first three default maps for master, compute and infra as is, but added our own in;


openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}, {'name': 'cm-osm01', 'labels': ['node-role.kubernetes.io/master=true'], 'edits': [{'key': 'nodeIP','value': '10.x.x.x'}]}, {'name': 'cm-osm02', 'labels': ['node-role.kubernetes.io/master=true'], 'edits': [{'key': 'nodeIP','value': '10.x.x.x'}]} ..... etc

where "cs-osm01" is our first master, cs-osm02 was second .. etc
Edit the labels according to that server's role.

The next problem you'll hit is probably yet another PR we have lodged for vSphere cloud provider;  https://bugzilla.redhat.com/show_bug.cgi?id=1643348  in 3.11, where any more than 5 IP's on the primary interface (ipfailover, egressIP etc) will make vSphere cloud provider go into a spin.

Comment 25 luxq 2018-11-21 06:47:56 UTC

(In reply to cg from comment #24)
> @luxq  We had to create a config map for every single node in the cluster,
> with the nodeIP configured as an "edit" part in the map (we're using 3.11
> too, but this is the case since 3.10)
> 
> https://access.redhat.com/solutions/3625721
> 
> Expanding slightly, we kept the first three default maps for master, compute
> and infra as is, but added our own in;
> 
> 
> openshift_node_groups=[{'name': 'node-config-master', 'labels':
> ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra',
> 'labels': ['node-role.kubernetes.io/infra=true']}, {'name':
> 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']},
> {'name': 'cm-osm01', 'labels': ['node-role.kubernetes.io/master=true'],
> 'edits': [{'key': 'nodeIP','value': '10.x.x.x'}]}, {'name': 'cm-osm02',
> 'labels': ['node-role.kubernetes.io/master=true'], 'edits': [{'key':
> 'nodeIP','value': '10.x.x.x'}]} ..... etc
> 
> where "cs-osm01" is our first master, cs-osm02 was second .. etc
> Edit the labels according to that server's role.
> 
> The next problem you'll hit is probably yet another PR we have lodged for
> vSphere cloud provider;  https://bugzilla.redhat.com/show_bug.cgi?id=1643348
> in 3.11, where any more than 5 IP's on the primary interface (ipfailover,
> egressIP etc) will make vSphere cloud provider go into a spin.

@cg Is that right?master and infra node is put together, will trigger this problem? ipfailover run on the master

[OSEv3:vars]
...
...
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}, {'name': 'node-master01', 'labels': ['node-role.kubernetes.io/master=true'],'edits': [{'key': 'nodeIP','value': '10.x.x.1'}]},{'name': 'node-master02', 'labels': ['node-role.kubernetes.io/master=true'],'edits': [{'key': 'nodeIP','value': '10.x.x.2'}]},{'name': 'node-master03', 'labels': ['node-role.kubernetes.io/master=true'],'edits': [{'key': 'nodeIP','value': '10.x.x.3'}]}]


[nodes]
master1.openshift.sz.clio openshift_schedulable=true openshift_node_group_name='node-master01'
master2.openshift.sz.clio openshift_schedulable=true openshift_node_group_name='node-master02'
master3.openshift.sz.clio openshift_schedulable=true openshift_node_group_name='node-master03'
master[1:3].openshift.sz.clio openshift_schedulable=true openshift_node_group_name='node-config-infra'
node1.openshift.sz.clio openshift_schedulable=true openshift_node_group_name='node-config-compute'  
node2.openshift.sz.clio openshift_schedulable=true openshift_node_group_name='node-config-compute'  
node3.openshift.sz.clio openshift_schedulable=true openshift_node_group_name='node-config-compute'

Note You need to log in before you can comment on or make changes to this bug.