1886854 – metal3 pod got started on two masters

Bug 1886854 - metal3 pod got started on two masters

Summary: metal3 pod got started on two masters

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Derek Higgins
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-09 14:20 UTC by Derek Higgins
Modified:	2020-12-07 12:39 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-07 12:39:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Derek Higgins 2020-10-09 14:20:08 UTC

Description of problem:

During cluster deploy, the openshift api didn't come up (or at least we couldn't access it)

E1009 09:59:43.938183   22779 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ClusterVersion: Get "https://api.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 10.19.16.20:6443: connect: no route to host

Looking closer we found that the metal3 pod had been started on 2 master nodes,

[root@cnfdb3-installer ~]# for x in 12 13 14 ; do ssh -o UserKnownHostsFile=/dev/null core.17.$x sudo crictl ps -a | grep static ; done 
Warning: Permanently added '10.19.17.12' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.19.17.13' (ECDSA) to the list of known hosts.
0c2983663ce36       135e0bf4bc081b3b01daf52513a3b3fd9ec684f3c46e9510b545a9e4a98d1889                                                         3 hours ago         Running             metal3-static-ip-manager                      0                   623a1320d6a13
8e63f6979bee3       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:550a230293580919793a404afe9de6bf664c335e5ea9b6211b07bccf0f80efc7   3 hours ago         Exited              metal3-static-ip-set                          0                   623a1320d6a13
Warning: Permanently added '10.19.17.14' (ECDSA) to the list of known hosts.
101af1dccb32a       135e0bf4bc081b3b01daf52513a3b3fd9ec684f3c46e9510b545a9e4a98d1889                                                         3 hours ago         Running             metal3-static-ip-manager                      0                   945761d8fcd90
846dcfc6dbcd3       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:550a230293580919793a404afe9de6bf664c335e5ea9b6211b07bccf0f80efc7   3 hours ago         Exited              metal3-static-ip-set                          0                   945761d8fcd90

Comment 5 Zane Bitter 2020-10-13 16:41:13 UTC

It looks like the metal3 pod was initially started on 10.19.17.13 at 10:58:13. At around 11:01, 10.19.17.13 and 10.19.17.12 lose access to the API:

Oct 09 11:01:34 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com hyperkube[3102]: E1009 11:01:34.215755    3102 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com": Get "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes/dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com?resourceVersion=0&timeout=10s": dial tcp 10.19.16.20:6443: connect: no route to host
Oct 09 11:01:15 dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com hyperkube[3158]: E1009 11:01:15.836453    3158 reflector.go:127] k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: Get "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com&resourceVersion=16854&timeoutSeconds=533&watch=true": dial tcp 10.19.16.20:6443: connect: no route to host


At 11:04:09 a duplicate pod is started on 10.19.17.14, which is probably the node with the API VIP since it's the only one that can still contact the API. The original pod appears to get restarted but not removed.

So I would guess here that the cause is a loss of connectivity between the masters, and since there is no fencing we end up with duplicates.

Comment 6 Derek Higgins 2020-10-20 14:18:13 UTC

(In reply to Zane Bitter from comment #5)
> It looks like the metal3 pod was initially started on 10.19.17.13 at
> 10:58:13. At around 11:01, 10.19.17.13 and 10.19.17.12 lose access to the
> API:
> 
> Oct 09 11:01:34 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com
> hyperkube[3102]: E1009 11:01:34.215755    3102 kubelet_node_status.go:442]
> Error updating node status, will retry: error getting node
> "dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com": Get
> "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes/dhcp19-
> 17-13.clus2.t5g.lab.eng.bos.redhat.com?resourceVersion=0&timeout=10s": dial
> tcp 10.19.16.20:6443: connect: no route to host
> Oct 09 11:01:15 dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com
> hyperkube[3158]: E1009 11:01:15.836453    3158 reflector.go:127]
> k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: Get
> "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/
> nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddhcp19-17-12.
> clus2.t5g.lab.eng.bos.redhat.
> com&resourceVersion=16854&timeoutSeconds=533&watch=true": dial tcp
> 10.19.16.20:6443: connect: no route to host
> 
> 
> At 11:04:09 a duplicate pod is started on 10.19.17.14, which is probably the
> node with the API VIP since it's the only one that can still contact the
> API. The original pod appears to get restarted but not removed.

Also looking at both metal3 pods I don't see any indications that any of the
containers died, we've ended up with 2 fully running pods which supports what
your saying

> 
> So I would guess here that the cause is a loss of connectivity between the
> masters, and since there is no fencing we end up with duplicates.

so do we close this as not a bug as there was no fencing to deal with the
loss of connectivity?

Comment 7 Derek Higgins 2020-12-07 12:39:45 UTC

Closing as it hasn't been reported/reproduced since and was probably caused by loss of connectivity.

Note You need to log in before you can comment on or make changes to this bug.