Bug 1886854

Summary: metal3 pod got started on two masters
Product: OpenShift Container Platform Reporter: Derek Higgins <derekh>
Component: InstallerAssignee: Derek Higgins <derekh>
Installer sub component: OpenShift on Bare Metal IPI QA Contact: Amit Ugol <augol>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: high CC: rpittau, tsedovic
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-07 12:39:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Derek Higgins 2020-10-09 14:20:08 UTC
Description of problem:

During cluster deploy, the openshift api didn't come up (or at least we couldn't access it)

E1009 09:59:43.938183   22779 reflector.go:153] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.ClusterVersion: Get "https://api.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 10.19.16.20:6443: connect: no route to host

Looking closer we found that the metal3 pod had been started on 2 master nodes,

[root@cnfdb3-installer ~]# for x in 12 13 14 ; do ssh -o UserKnownHostsFile=/dev/null core.17.$x sudo crictl ps -a | grep static ; done 
Warning: Permanently added '10.19.17.12' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.19.17.13' (ECDSA) to the list of known hosts.
0c2983663ce36       135e0bf4bc081b3b01daf52513a3b3fd9ec684f3c46e9510b545a9e4a98d1889                                                         3 hours ago         Running             metal3-static-ip-manager                      0                   623a1320d6a13
8e63f6979bee3       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:550a230293580919793a404afe9de6bf664c335e5ea9b6211b07bccf0f80efc7   3 hours ago         Exited              metal3-static-ip-set                          0                   623a1320d6a13
Warning: Permanently added '10.19.17.14' (ECDSA) to the list of known hosts.
101af1dccb32a       135e0bf4bc081b3b01daf52513a3b3fd9ec684f3c46e9510b545a9e4a98d1889                                                         3 hours ago         Running             metal3-static-ip-manager                      0                   945761d8fcd90
846dcfc6dbcd3       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:550a230293580919793a404afe9de6bf664c335e5ea9b6211b07bccf0f80efc7   3 hours ago         Exited              metal3-static-ip-set                          0                   945761d8fcd90

Comment 5 Zane Bitter 2020-10-13 16:41:13 UTC
It looks like the metal3 pod was initially started on 10.19.17.13 at 10:58:13. At around 11:01, 10.19.17.13 and 10.19.17.12 lose access to the API:

Oct 09 11:01:34 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com hyperkube[3102]: E1009 11:01:34.215755    3102 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com": Get "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes/dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com?resourceVersion=0&timeout=10s": dial tcp 10.19.16.20:6443: connect: no route to host
Oct 09 11:01:15 dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com hyperkube[3158]: E1009 11:01:15.836453    3158 reflector.go:127] k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: Get "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com&resourceVersion=16854&timeoutSeconds=533&watch=true": dial tcp 10.19.16.20:6443: connect: no route to host


At 11:04:09 a duplicate pod is started on 10.19.17.14, which is probably the node with the API VIP since it's the only one that can still contact the API. The original pod appears to get restarted but not removed.

So I would guess here that the cause is a loss of connectivity between the masters, and since there is no fencing we end up with duplicates.

Comment 6 Derek Higgins 2020-10-20 14:18:13 UTC
(In reply to Zane Bitter from comment #5)
> It looks like the metal3 pod was initially started on 10.19.17.13 at
> 10:58:13. At around 11:01, 10.19.17.13 and 10.19.17.12 lose access to the
> API:
> 
> Oct 09 11:01:34 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com
> hyperkube[3102]: E1009 11:01:34.215755    3102 kubelet_node_status.go:442]
> Error updating node status, will retry: error getting node
> "dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com": Get
> "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes/dhcp19-
> 17-13.clus2.t5g.lab.eng.bos.redhat.com?resourceVersion=0&timeout=10s": dial
> tcp 10.19.16.20:6443: connect: no route to host
> Oct 09 11:01:15 dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com
> hyperkube[3158]: E1009 11:01:15.836453    3158 reflector.go:127]
> k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: Get
> "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/
> nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddhcp19-17-12.
> clus2.t5g.lab.eng.bos.redhat.
> com&resourceVersion=16854&timeoutSeconds=533&watch=true": dial tcp
> 10.19.16.20:6443: connect: no route to host
> 
> 
> At 11:04:09 a duplicate pod is started on 10.19.17.14, which is probably the
> node with the API VIP since it's the only one that can still contact the
> API. The original pod appears to get restarted but not removed.

Also looking at both metal3 pods I don't see any indications that any of the
containers died, we've ended up with 2 fully running pods which supports what
your saying

> 
> So I would guess here that the cause is a loss of connectivity between the
> masters, and since there is no fencing we end up with duplicates.

so do we close this as not a bug as there was no fencing to deal with the
loss of connectivity?

Comment 7 Derek Higgins 2020-12-07 12:39:45 UTC
Closing as it hasn't been reported/reproduced since and was probably caused by loss of connectivity.