1867043 – A subset of bootstrap/master machines sometimes don't start

Bug 1867043 - A subset of bootstrap/master machines sometimes don't start

Summary: A subset of bootstrap/master machines sometimes don't start

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Gal Zaidman
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-07 08:42 UTC by Jan Zmeskal
Modified:	2020-11-24 09:49 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-24 09:49:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshot of RHV VMs (63.99 KB, image/png) 2020-08-07 08:42 UTC, Jan Zmeskal	no flags	Details
View All

Description Jan Zmeskal 2020-08-07 08:42:04 UTC

Created attachment 1710760 [details]
Screenshot of RHV VMs

Description of problem:
While deploying OCP4.6 clusters manually, it sometimes happens to me that either one of more of bootstrap/master machines do not start. The installer waits 10 minutes for them to be started and then times out. It happens to me only occasionally, but often enough to hamper about 30 % of deployments. I don't recall ever encountering such issue in OCP4.4 or 4.5. 

Version-Release number of the following components:
OCP: 4.6.0-0.nightly-2020-08-07-034746
RHV: 4.3.11.2-0.1.el7

How reproducible:
30 %

Steps to Reproduce:
Do some very basic OCP cluster deployment. Unfortunately, reproduction chance is not very high.

Comment 2 Jan Zmeskal 2020-08-07 08:56:25 UTC

One more information that might be useful. When you attempt to destroy such a failed cluster using openshift-install destroy cluster, you'll get this:

# ./openshift-install destroy cluster --dir=resources
INFO Removing Template primary-jtmz6-rhcos        
ERROR Failed to remove template: Fault reason is "Operation Failed". Fault detail is "[Cannot delete Template. Template is being used by the following VMs: primary-jtmz6-bootstrap,primary-jtmz6-master-0,primary-jtmz6-master-1,primary-jtmz6-master-2.]". HTTP response code is "409". HTTP response message is "409 Conflict". 
INFO Time elapsed: 0s

Comment 3 Michal Skrivanek 2020-09-03 07:38:10 UTC

(In reply to Jan Zmeskal from comment #2)
> One more information that might be useful. When you attempt to destroy such
> a failed cluster using openshift-install destroy cluster, you'll get this:
> 
> # ./openshift-install destroy cluster --dir=resources
> INFO Removing Template primary-jtmz6-rhcos        
> ERROR Failed to remove template: Fault reason is "Operation Failed". Fault
> detail is "[Cannot delete Template. Template is being used by the following
> VMs:
> primary-jtmz6-bootstrap,primary-jtmz6-master-0,primary-jtmz6-master-1,
> primary-jtmz6-master-2.]". HTTP response code is "409". HTTP response
> message is "409 Conflict". 
> INFO Time elapsed: 0s

I believe this is a separate issue and worth tracking as a bug. The cleanup indeed doesn't work

Comment 4 Jan Zmeskal 2020-09-03 08:02:50 UTC

Michal, in all the other scenarios, cleanup seems to work OK for me. That's what makes it hard to separate the cleanup issue from the original one.

Comment 5 Gal Zaidman 2020-09-07 12:47:47 UTC

Some questions/observations:
- I never saw it on CI runs, and we are spinning a lot of clusters... how did you install that cluster?
- Something special on oVirt side? anything on the events/logs? can you attach ovirt logs?
- Not sure if the is the required behavior by default from OCP but when a VM goes down for some reason the cluster doesn't try to start it again, I think we should open a bug for that which will resolve this one.
- On CI we have ovirt 4.4 which is the version that is supported for OCP 4.6, I think it is a problem that QE uses a 4.3.11 cluster when we force users to use 4.4

Comment 6 Jan Zmeskal 2020-09-07 13:46:07 UTC

Hi Gal,

- I never saw it on CI runs, and we are spinning a lot of clusters... how did you install that cluster?
I performed very basic three step installation:
1. openshift-install create install-config
2. Edit install-config.yaml to actually grant both masters and workers even more resources than the default
3. openshift-install create cluster

- Something special on oVirt side? anything on the events/logs? can you attach ovirt logs?
Sure, I'll do that when I reproduce the bug

- Not sure if the is the required behavior by default from OCP but when a VM goes down for some reason the cluster doesn't try to start it again, I think we should open a bug for that which will resolve this one.
I definitely think it's worth to re-attempt starting the VM. Don't know about running cluster, but definitely during installation. It might be that in reality sometimes our VMs don't start because of some environment issue - I'm not excluding this option. However, it's for sure not because of lack of memory or CPUs. Maybe there's some very hard to catch network issue. Be as it may, the installer waits 10 minutes for the master VMs to come up. During those 10 minutes, there's plenty of time to try and start the VM, but the installer does not do that.

- On CI we have ovirt 4.4 which is the version that is supported for OCP 4.6, I think it is a problem that QE uses a 4.3.11 cluster when we force users to use 4.4
Recently we've switched to RHV4.4. I deployed one OCP4.6 cluster there and did not see the issue reproduce. However that sample is of course too small.

Comment 7 Sandro Bonazzola 2020-09-14 09:55:05 UTC

No capacity in current sprint

Comment 8 Sandro Bonazzola 2020-10-22 11:34:17 UTC

due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 9 Gal Zaidman 2020-11-24 09:49:39 UTC

We were unable to reproducen the issue. and we increased the timeout to 20m due to a different issue

Note You need to log in before you can comment on or make changes to this bug.