Bug 1918531
| Summary: | [Assisted-4.6][Staging] Cluster deployment failed due the Rebooting took longer than expected 1h10m0s | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Yuri Obshansky <yobshans> | ||||||||||||||||||||||||||||||||||||||
| Component: | assisted-installer | Assignee: | Igal Tsoiref <itsoiref> | ||||||||||||||||||||||||||||||||||||||
| assisted-installer sub component: | Installer | QA Contact: | Yuri Obshansky <yobshans> | ||||||||||||||||||||||||||||||||||||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||||||||||||||||||||||||||||||||||||
| Severity: | urgent | ||||||||||||||||||||||||||||||||||||||||
| Priority: | unspecified | CC: | alazar, aos-bugs, itsoiref, mfilanov, ohochman, slavie, vemporop | ||||||||||||||||||||||||||||||||||||||
| Version: | 4.6 | Keywords: | Reopened | ||||||||||||||||||||||||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||||||||||||||||||
| Whiteboard: | AI-Team-Core | ||||||||||||||||||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||||||||||||||||||
| Last Closed: | 2021-05-02 09:22:28 UTC | Type: | Bug | ||||||||||||||||||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
|
Description
Yuri Obshansky
2021-01-20 23:47:27 UTC
@slavie can someone on a triage duty to take a look? Yuri does this vm is running? Will be much easier to understand the issue. If it is not moving to configuring state, it mostly likely didn't start yuri 3 out of 3 times when we had access to the vms it turned out that the node got IPv6 address after rebooting from RHCOS and that is why it could not communicate with API VIP to pull ignition. To verify this there is a procedure vitali wrote to break into shell and collect the journal logs and inspect the ips. This is the document that specify it but at least for the first time please consult with him before hand https://docs.google.com/document/d/1kID6AkdA96MokavO4IxLIVK4spCaHQ-JLbcqalxG5YU/edit?usp=sharing (page 2) It is very important to us to understand when such incidents happen because it is related to timing. Did you use special VMs or bare metals that you did not use regualty? any problem with then network? worked with different DHCP ? any info can help Yuri please contact me so we can troubleshoot it together. Will be easier than letting you learn the procedure. Unfortunately, VM is not accessible since we ran Performance test and used the same pool of hypervisors to deploy cluster. Each iteration VM destroyed and stared again. We have only logs downloaded from cluster Created attachment 1751853 [details]
installation logs
Created attachment 1751854 [details]
libvirtd
Created attachment 1751855 [details]
messages
Created attachment 1751856 [details]
qemu
@yobshans do we have vm console logs? @slavie i think this is the case for new state, if @yobshans code that downloads logs from cluster will wait till must-gather will be added, it will help to debug cause right now i can't say anything regarding the issue. Unfortunately I never don't have those machines again Feel free to closed bug as Insufficient data @itsoiref @alazar I collected new data for failed servers using must-gather https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/3ac34054-c9df-4248-8b1f-4532ce1423d9 https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/a2fbda7b-738c-4181-8429-2842a952ca4a 3/13/2021, 3:43:45 AM error Host worker-0-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install) 3/13/2021, 3:43:45 AM error Host worker-0-1: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install)3/13/2021, 3:43:45 AM error Host master-0-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install) 3/13/2021, 3:42:11 AM Updated status of cluster ocp-cluster-f04-h15-0 to error 3/13/2021, 3:40:26 AM error Host master-0-2: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s) 3/13/2021, 3:40:25 AM error Host master-0-1: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s) See new attachments. Created attachment 1763389 [details]
installation logs
Created attachment 1763390 [details]
must-gather
Created attachment 1763391 [details]
installation logs
Created attachment 1763392 [details]
must-gather
worker-2-0 kubelet joined and crs got approve but node didn't become ready. from the events we can see that its kubelet stopped posting it's status. Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "StaticPodsDegraded: pod/kube-controller-manager-master-2-2 container \"kube-controller-manager-recovery-controller\" is terminated: Completed: \nStaticPodsDegraded: pods \"kube-controller-manager-worker-2-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"worker-2-0\" not ready since 2021-03-13 04:21:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "StaticPodsDegraded: pods \"kube-controller-manager-worker-2-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"worker-2-0\" not ready since 2021-03-13 04:21:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" @yobshans we need kubelet log from failed host and we need to check if it was up at this point or not. @itsoiref @alazar I collected new data for failed servers using must-gather/sos-report/kubelet https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/20d87bbb-b56e-4b83-a201-36d79840b555 3/25/2021, 9:49:51 PM error Host worker-1-1: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install) 3/25/2021, 9:49:51 PM error Host worker-1-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install) 3/25/2021, 9:49:51 PM error Host master-1-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install) 3/25/2021, 9:47:37 PM Updated status of cluster ocp-cluster-f33-h19-1 to error 3/25/2021, 9:46:21 PM error Host master-1-2: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s) 3/25/2021, 9:46:21 PM error Host master-1-1: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s) See attachments with NEW prefix Created attachment 1766629 [details]
NEW installation logs
Created attachment 1766630 [details]
NEW must-gather
Created attachment 1766631 [details]
NEW sos-report master 0
Created attachment 1766632 [details]
NEW sos-report master 1
Created attachment 1766633 [details]
NEW sos-report master 2
Created attachment 1766634 [details]
NEW kubelet master 0
Created attachment 1766635 [details]
NEW kubelet master 1
Created attachment 1766636 [details]
NEW kubelet master 2
latest failure happened due to :
time="2021-03-26T00:44:42Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to create service for dns default: failed to create dns service: Service \"dns-default\" is invalid: spec.clusterIPs: Invalid value: []string{\"172.30.0.10\"}: failed to allocated ip:172.30.0.10 with error:provided IP is already allocated"
we had this issue already. So latest failure is not really relevant to the one that ticket was opened with.
It was duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1891979#c7
latest failure happened due to :
time="2021-03-26T00:44:42Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to create service for dns default: failed to create dns service: Service \"dns-default\" is invalid: spec.clusterIPs: Invalid value: []string{\"172.30.0.10\"}: failed to allocated ip:172.30.0.10 with error:provided IP is already allocated"
we had this issue already. So latest failure is not really relevant to the one that ticket was opened with.
It was duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1891979#c7
Here is it another one failed cluster with the same error https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/af0c427c-6426-4edb-9e65-97747c00bfef Attached installation logs Created attachment 1767372 [details]
installation logs
It is not duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1891979#c7 Assisted Installer provides automatic way to deploy OCP. OCP deployment always will fail on that issue. Assisted Installer should apply mentioned workaround to forbid failure. There is no manual way to do it when we use Assisted Installer Service on cloud. It looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1943258, which I'm currently working on. @itsoiref please confirm, mark as Duplicate of needed. *** This bug has been marked as a duplicate of bug 1943258 *** |