Bug 1886854
Summary: | metal3 pod got started on two masters | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Derek Higgins <derekh> |
Component: | Installer | Assignee: | Derek Higgins <derekh> |
Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Amit Ugol <augol> |
Status: | CLOSED NOTABUG | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | rpittau, tsedovic |
Version: | 4.6 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-07 12:39:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Derek Higgins
2020-10-09 14:20:08 UTC
It looks like the metal3 pod was initially started on 10.19.17.13 at 10:58:13. At around 11:01, 10.19.17.13 and 10.19.17.12 lose access to the API: Oct 09 11:01:34 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com hyperkube[3102]: E1009 11:01:34.215755 3102 kubelet_node_status.go:442] Error updating node status, will retry: error getting node "dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com": Get "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes/dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com?resourceVersion=0&timeout=10s": dial tcp 10.19.16.20:6443: connect: no route to host Oct 09 11:01:15 dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com hyperkube[3158]: E1009 11:01:15.836453 3158 reflector.go:127] k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: Get "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com&resourceVersion=16854&timeoutSeconds=533&watch=true": dial tcp 10.19.16.20:6443: connect: no route to host At 11:04:09 a duplicate pod is started on 10.19.17.14, which is probably the node with the API VIP since it's the only one that can still contact the API. The original pod appears to get restarted but not removed. So I would guess here that the cause is a loss of connectivity between the masters, and since there is no fencing we end up with duplicates. (In reply to Zane Bitter from comment #5) > It looks like the metal3 pod was initially started on 10.19.17.13 at > 10:58:13. At around 11:01, 10.19.17.13 and 10.19.17.12 lose access to the > API: > > Oct 09 11:01:34 dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com > hyperkube[3102]: E1009 11:01:34.215755 3102 kubelet_node_status.go:442] > Error updating node status, will retry: error getting node > "dhcp19-17-13.clus2.t5g.lab.eng.bos.redhat.com": Get > "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/nodes/dhcp19- > 17-13.clus2.t5g.lab.eng.bos.redhat.com?resourceVersion=0&timeout=10s": dial > tcp 10.19.16.20:6443: connect: no route to host > Oct 09 11:01:15 dhcp19-17-12.clus2.t5g.lab.eng.bos.redhat.com > hyperkube[3158]: E1009 11:01:15.836453 3158 reflector.go:127] > k8s.io/kubernetes/pkg/kubelet/kubelet.go:438: Failed to watch *v1.Node: Get > "https://api-int.cnfdb3.t5g.lab.eng.bos.redhat.com:6443/api/v1/ > nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Ddhcp19-17-12. > clus2.t5g.lab.eng.bos.redhat. > com&resourceVersion=16854&timeoutSeconds=533&watch=true": dial tcp > 10.19.16.20:6443: connect: no route to host > > > At 11:04:09 a duplicate pod is started on 10.19.17.14, which is probably the > node with the API VIP since it's the only one that can still contact the > API. The original pod appears to get restarted but not removed. Also looking at both metal3 pods I don't see any indications that any of the containers died, we've ended up with 2 fully running pods which supports what your saying > > So I would guess here that the cause is a loss of connectivity between the > masters, and since there is no fencing we end up with duplicates. so do we close this as not a bug as there was no fencing to deal with the loss of connectivity? Closing as it hasn't been reported/reproduced since and was probably caused by loss of connectivity. |