Bug 1867043

Summary: A subset of bootstrap/master machines sometimes don't start
Product: OpenShift Container Platform Reporter: Jan Zmeskal <jzmeskal>
Component: InstallerAssignee: Gal Zaidman <gzaidman>
Installer sub component: OpenShift on RHV QA Contact: Lucie Leistnerova <lleistne>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: medium CC: gzaidman, hpopal, michal.skrivanek
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-24 09:49:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot of RHV VMs none

Description Jan Zmeskal 2020-08-07 08:42:04 UTC
Created attachment 1710760 [details]
Screenshot of RHV VMs

Description of problem:
While deploying OCP4.6 clusters manually, it sometimes happens to me that either one of more of bootstrap/master machines do not start. The installer waits 10 minutes for them to be started and then times out. It happens to me only occasionally, but often enough to hamper about 30 % of deployments. I don't recall ever encountering such issue in OCP4.4 or 4.5. 

Version-Release number of the following components:
OCP: 4.6.0-0.nightly-2020-08-07-034746
RHV: 4.3.11.2-0.1.el7

How reproducible:
30 %

Steps to Reproduce:
Do some very basic OCP cluster deployment. Unfortunately, reproduction chance is not very high.

Comment 2 Jan Zmeskal 2020-08-07 08:56:25 UTC
One more information that might be useful. When you attempt to destroy such a failed cluster using openshift-install destroy cluster, you'll get this:

# ./openshift-install destroy cluster --dir=resources
INFO Removing Template primary-jtmz6-rhcos        
ERROR Failed to remove template: Fault reason is "Operation Failed". Fault detail is "[Cannot delete Template. Template is being used by the following VMs: primary-jtmz6-bootstrap,primary-jtmz6-master-0,primary-jtmz6-master-1,primary-jtmz6-master-2.]". HTTP response code is "409". HTTP response message is "409 Conflict". 
INFO Time elapsed: 0s

Comment 3 Michal Skrivanek 2020-09-03 07:38:10 UTC
(In reply to Jan Zmeskal from comment #2)
> One more information that might be useful. When you attempt to destroy such
> a failed cluster using openshift-install destroy cluster, you'll get this:
> 
> # ./openshift-install destroy cluster --dir=resources
> INFO Removing Template primary-jtmz6-rhcos        
> ERROR Failed to remove template: Fault reason is "Operation Failed". Fault
> detail is "[Cannot delete Template. Template is being used by the following
> VMs:
> primary-jtmz6-bootstrap,primary-jtmz6-master-0,primary-jtmz6-master-1,
> primary-jtmz6-master-2.]". HTTP response code is "409". HTTP response
> message is "409 Conflict". 
> INFO Time elapsed: 0s

I believe this is a separate issue and worth tracking as a bug. The cleanup indeed doesn't work

Comment 4 Jan Zmeskal 2020-09-03 08:02:50 UTC
Michal, in all the other scenarios, cleanup seems to work OK for me. That's what makes it hard to separate the cleanup issue from the original one.

Comment 5 Gal Zaidman 2020-09-07 12:47:47 UTC
Some questions/observations:
- I never saw it on CI runs, and we are spinning a lot of clusters... how did you install that cluster?
- Something special on oVirt side? anything on the events/logs? can you attach ovirt logs?
- Not sure if the is the required behavior by default from OCP but when a VM goes down for some reason the cluster doesn't try to start it again, I think we should open a bug for that which will resolve this one.
- On CI we have ovirt 4.4 which is the version that is supported for OCP 4.6, I think it is a problem that QE uses a 4.3.11 cluster when we force users to use 4.4

Comment 6 Jan Zmeskal 2020-09-07 13:46:07 UTC
Hi Gal,

- I never saw it on CI runs, and we are spinning a lot of clusters... how did you install that cluster?
I performed very basic three step installation:
1. openshift-install create install-config
2. Edit install-config.yaml to actually grant both masters and workers even more resources than the default
3. openshift-install create cluster

- Something special on oVirt side? anything on the events/logs? can you attach ovirt logs?
Sure, I'll do that when I reproduce the bug

- Not sure if the is the required behavior by default from OCP but when a VM goes down for some reason the cluster doesn't try to start it again, I think we should open a bug for that which will resolve this one.
I definitely think it's worth to re-attempt starting the VM. Don't know about running cluster, but definitely during installation. It might be that in reality sometimes our VMs don't start because of some environment issue - I'm not excluding this option. However, it's for sure not because of lack of memory or CPUs. Maybe there's some very hard to catch network issue. Be as it may, the installer waits 10 minutes for the master VMs to come up. During those 10 minutes, there's plenty of time to try and start the VM, but the installer does not do that.

- On CI we have ovirt 4.4 which is the version that is supported for OCP 4.6, I think it is a problem that QE uses a 4.3.11 cluster when we force users to use 4.4
Recently we've switched to RHV4.4. I deployed one OCP4.6 cluster there and did not see the issue reproduce. However that sample is of course too small.

Comment 7 Sandro Bonazzola 2020-09-14 09:55:05 UTC
No capacity in current sprint

Comment 8 Sandro Bonazzola 2020-10-22 11:34:17 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 9 Gal Zaidman 2020-11-24 09:49:39 UTC
We were unable to reproducen the issue. and we increased the timeout to 20m due to a different issue