Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1918531

Summary: [Assisted-4.6][Staging] Cluster deployment failed due the Rebooting took longer than expected 1h10m0s
Product: OpenShift Container Platform Reporter: Yuri Obshansky <yobshans>
Component: assisted-installerAssignee: Igal Tsoiref <itsoiref>
assisted-installer sub component: Installer QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: alazar, aos-bugs, itsoiref, mfilanov, ohochman, slavie, vemporop
Version: 4.6Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-02 09:22:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
installation logs
none
installation logs
none
libvirtd
none
messages
none
qemu
none
installation logs
none
must-gather
none
installation logs
none
must-gather
none
NEW installation logs
none
NEW must-gather
none
NEW sos-report master 0
none
NEW sos-report master 1
none
NEW sos-report master 2
none
NEW kubelet master 0
none
NEW kubelet master 1
none
NEW kubelet master 2
none
installation logs none

Description Yuri Obshansky 2021-01-20 23:47:27 UTC
Created attachment 1749257 [details]
installation logs

Description of problem:
Cluster deployment failed on timeout 
1/20/2021, 6:36:18 PM	
error Host master-0-2: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Rebooting took longer than expected 1h10m0s)
Detected during performance test

Version-Release number of selected component (if applicable):
v1.0.15.1",
Assisted-ui-lib version:  1.5.4

How reproducible:
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/c576466c-4093-4f73-ac38-c0221ca2b368
user:nshidlin-aiqe1-u1
password:L7uzs7oUcRJ/SgY4qi9Aupk7u425cFa2

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Michael Filanov 2021-01-21 06:42:00 UTC
@slavie can someone on a triage duty to take a look?

Comment 2 Igal Tsoiref 2021-01-21 08:15:12 UTC
Yuri does this vm is running? Will be much easier to understand the issue. If it is not moving to configuring state, it mostly likely didn't start

Comment 3 Sarah Lavie 2021-01-21 08:24:35 UTC
yuri 3 out of 3 times when we had access to the vms it turned out that the node got IPv6 address after rebooting from RHCOS and that is why it could not communicate with API VIP to pull ignition. To verify this  there is a procedure vitali wrote to break into shell and collect the journal logs and inspect the ips. This is the document that specify it but at least for the first time please consult with him before hand
https://docs.google.com/document/d/1kID6AkdA96MokavO4IxLIVK4spCaHQ-JLbcqalxG5YU/edit?usp=sharing (page 2)

It is very important to us to understand when such incidents happen because it is related to timing. Did you use special VMs or bare metals that you did not use regualty? any problem with then network? worked with different DHCP ? any info can help

Comment 4 vemporop 2021-01-21 08:41:38 UTC
Yuri please contact me so we can troubleshoot it together. Will be easier than letting you learn the procedure.

Comment 5 Yuri Obshansky 2021-01-21 13:22:55 UTC
Unfortunately, VM  is not accessible since we ran Performance test and used the same pool of hypervisors to deploy cluster.
Each iteration VM destroyed and stared again.
We have only logs downloaded from cluster

Comment 9 Yuri Obshansky 2021-01-28 21:31:24 UTC
Created attachment 1751853 [details]
installation logs

Comment 10 Yuri Obshansky 2021-01-28 21:31:58 UTC
Created attachment 1751854 [details]
libvirtd

Comment 11 Yuri Obshansky 2021-01-28 21:32:19 UTC
Created attachment 1751855 [details]
messages

Comment 12 Yuri Obshansky 2021-01-28 21:32:41 UTC
Created attachment 1751856 [details]
qemu

Comment 13 Igal Tsoiref 2021-02-01 19:47:59 UTC
@yobshans do we have vm console logs?
@slavie i think this is the case for new state, if @yobshans code that downloads logs from cluster will wait till must-gather will be added, it will help to debug cause right now i can't say anything regarding the issue.

Comment 14 Yuri Obshansky 2021-02-01 21:02:13 UTC
Unfortunately I never don't have those machines again
Feel free to closed bug as Insufficient data

Comment 15 Yuri Obshansky 2021-03-15 14:34:42 UTC
@itsoiref
@alazar

I collected new data for failed servers using must-gather
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/3ac34054-c9df-4248-8b1f-4532ce1423d9
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/a2fbda7b-738c-4181-8429-2842a952ca4a

3/13/2021, 3:43:45 AM	error Host worker-0-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install)
3/13/2021, 3:43:45 AM	error Host worker-0-1: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install)3/13/2021, 
3:43:45 AM error Host master-0-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install)
3/13/2021, 3:42:11 AM Updated status of cluster ocp-cluster-f04-h15-0 to error
3/13/2021, 3:40:26 AM	error Host master-0-2: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s)
3/13/2021, 3:40:25 AM	error Host master-0-1: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s)

See new attachments.

Comment 16 Yuri Obshansky 2021-03-15 14:35:20 UTC
Created attachment 1763389 [details]
installation logs

Comment 17 Yuri Obshansky 2021-03-15 14:35:56 UTC
Created attachment 1763390 [details]
must-gather

Comment 18 Yuri Obshansky 2021-03-15 14:36:33 UTC
Created attachment 1763391 [details]
installation logs

Comment 19 Yuri Obshansky 2021-03-15 14:37:06 UTC
Created attachment 1763392 [details]
must-gather

Comment 20 Igal Tsoiref 2021-03-19 20:01:45 UTC
worker-2-0 kubelet joined and crs got approve but node didn't become ready.
from the events we can see that its kubelet stopped posting it's status. 
Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "StaticPodsDegraded: pod/kube-controller-manager-master-2-2 container \"kube-controller-manager-recovery-controller\" is terminated: Completed: \nStaticPodsDegraded: pods \"kube-controller-manager-worker-2-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"worker-2-0\" not ready since 2021-03-13 04:21:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "StaticPodsDegraded: pods \"kube-controller-manager-worker-2-0\" not found\nNodeControllerDegraded: The master nodes not ready: node \"worker-2-0\" not ready since 2021-03-13 04:21:49 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
@yobshans we need kubelet log from failed host and we need to check if it was up at this point or not.

Comment 21 Yuri Obshansky 2021-03-26 13:07:00 UTC
@itsoiref
@alazar


I collected new data for failed servers using must-gather/sos-report/kubelet
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/20d87bbb-b56e-4b83-a201-36d79840b555

3/25/2021, 9:49:51 PM	error Host worker-1-1: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install)
3/25/2021, 9:49:51 PM	error Host worker-1-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install)
3/25/2021, 9:49:51 PM	error Host master-1-0: updated status from "installing-in-progress" to "error" (Host is part of a cluster that failed to install)
3/25/2021, 9:47:37 PM	Updated status of cluster ocp-cluster-f33-h19-1 to error
3/25/2021, 9:46:21 PM	error Host master-1-2: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s)
3/25/2021, 9:46:21 PM	error Host master-1-1: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s)

See attachments with NEW prefix

Comment 22 Yuri Obshansky 2021-03-26 13:18:59 UTC
Created attachment 1766629 [details]
NEW installation logs

Comment 23 Yuri Obshansky 2021-03-26 13:19:39 UTC
Created attachment 1766630 [details]
NEW must-gather

Comment 24 Yuri Obshansky 2021-03-26 13:20:15 UTC
Created attachment 1766631 [details]
NEW sos-report master 0

Comment 25 Yuri Obshansky 2021-03-26 13:21:03 UTC
Created attachment 1766632 [details]
NEW sos-report master 1

Comment 26 Yuri Obshansky 2021-03-26 13:21:36 UTC
Created attachment 1766633 [details]
NEW sos-report master 2

Comment 27 Yuri Obshansky 2021-03-26 13:22:15 UTC
Created attachment 1766634 [details]
NEW kubelet master 0

Comment 28 Yuri Obshansky 2021-03-26 13:22:52 UTC
Created attachment 1766635 [details]
NEW kubelet master 1

Comment 29 Yuri Obshansky 2021-03-26 13:23:26 UTC
Created attachment 1766636 [details]
NEW kubelet master 2

Comment 30 Igal Tsoiref 2021-03-29 14:00:19 UTC
latest failure happened due to :
time="2021-03-26T00:44:42Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to create service for dns default: failed to create dns service: Service \"dns-default\" is invalid: spec.clusterIPs: Invalid value: []string{\"172.30.0.10\"}: failed to allocated ip:172.30.0.10 with error:provided IP is already allocated"

we had this issue already. So latest failure is not really relevant to the one that ticket was opened with.
It was duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1891979#c7

Comment 31 Igal Tsoiref 2021-03-29 14:00:35 UTC
latest failure happened due to :
time="2021-03-26T00:44:42Z" level=error msg="failed to reconcile request /default: failed to ensure dns default: failed to create service for dns default: failed to create dns service: Service \"dns-default\" is invalid: spec.clusterIPs: Invalid value: []string{\"172.30.0.10\"}: failed to allocated ip:172.30.0.10 with error:provided IP is already allocated"

we had this issue already. So latest failure is not really relevant to the one that ticket was opened with.
It was duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1891979#c7

Comment 32 Yuri Obshansky 2021-03-29 14:12:08 UTC
Here is it another one failed cluster with the same error
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/af0c427c-6426-4edb-9e65-97747c00bfef

Attached installation logs

Comment 33 Yuri Obshansky 2021-03-29 14:12:32 UTC
Created attachment 1767372 [details]
installation logs

Comment 34 Yuri Obshansky 2021-03-29 14:49:15 UTC
It is not duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1891979#c7
Assisted Installer provides automatic way to deploy OCP.
OCP deployment always will fail on that issue.
Assisted Installer should apply mentioned workaround to forbid failure.
There is no manual way to do it when we use Assisted Installer Service on cloud.

Comment 35 vemporop 2021-05-02 04:39:00 UTC
It looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1943258, which I'm currently working on. @itsoiref please confirm, mark as Duplicate of needed.

Comment 36 Igal Tsoiref 2021-05-02 09:22:28 UTC

*** This bug has been marked as a duplicate of bug 1943258 ***