Bug 1867043
| Summary: | A subset of bootstrap/master machines sometimes don't start | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Zmeskal <jzmeskal> | ||||
| Component: | Installer | Assignee: | Gal Zaidman <gzaidman> | ||||
| Installer sub component: | OpenShift on RHV | QA Contact: | Lucie Leistnerova <lleistne> | ||||
| Status: | CLOSED NOTABUG | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | medium | CC: | gzaidman, hpopal, michal.skrivanek | ||||
| Version: | 4.6 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.7.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-11-24 09:49:39 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
One more information that might be useful. When you attempt to destroy such a failed cluster using openshift-install destroy cluster, you'll get this: # ./openshift-install destroy cluster --dir=resources INFO Removing Template primary-jtmz6-rhcos ERROR Failed to remove template: Fault reason is "Operation Failed". Fault detail is "[Cannot delete Template. Template is being used by the following VMs: primary-jtmz6-bootstrap,primary-jtmz6-master-0,primary-jtmz6-master-1,primary-jtmz6-master-2.]". HTTP response code is "409". HTTP response message is "409 Conflict". INFO Time elapsed: 0s (In reply to Jan Zmeskal from comment #2) > One more information that might be useful. When you attempt to destroy such > a failed cluster using openshift-install destroy cluster, you'll get this: > > # ./openshift-install destroy cluster --dir=resources > INFO Removing Template primary-jtmz6-rhcos > ERROR Failed to remove template: Fault reason is "Operation Failed". Fault > detail is "[Cannot delete Template. Template is being used by the following > VMs: > primary-jtmz6-bootstrap,primary-jtmz6-master-0,primary-jtmz6-master-1, > primary-jtmz6-master-2.]". HTTP response code is "409". HTTP response > message is "409 Conflict". > INFO Time elapsed: 0s I believe this is a separate issue and worth tracking as a bug. The cleanup indeed doesn't work Michal, in all the other scenarios, cleanup seems to work OK for me. That's what makes it hard to separate the cleanup issue from the original one. Some questions/observations: - I never saw it on CI runs, and we are spinning a lot of clusters... how did you install that cluster? - Something special on oVirt side? anything on the events/logs? can you attach ovirt logs? - Not sure if the is the required behavior by default from OCP but when a VM goes down for some reason the cluster doesn't try to start it again, I think we should open a bug for that which will resolve this one. - On CI we have ovirt 4.4 which is the version that is supported for OCP 4.6, I think it is a problem that QE uses a 4.3.11 cluster when we force users to use 4.4 Hi Gal, - I never saw it on CI runs, and we are spinning a lot of clusters... how did you install that cluster? I performed very basic three step installation: 1. openshift-install create install-config 2. Edit install-config.yaml to actually grant both masters and workers even more resources than the default 3. openshift-install create cluster - Something special on oVirt side? anything on the events/logs? can you attach ovirt logs? Sure, I'll do that when I reproduce the bug - Not sure if the is the required behavior by default from OCP but when a VM goes down for some reason the cluster doesn't try to start it again, I think we should open a bug for that which will resolve this one. I definitely think it's worth to re-attempt starting the VM. Don't know about running cluster, but definitely during installation. It might be that in reality sometimes our VMs don't start because of some environment issue - I'm not excluding this option. However, it's for sure not because of lack of memory or CPUs. Maybe there's some very hard to catch network issue. Be as it may, the installer waits 10 minutes for the master VMs to come up. During those 10 minutes, there's plenty of time to try and start the VM, but the installer does not do that. - On CI we have ovirt 4.4 which is the version that is supported for OCP 4.6, I think it is a problem that QE uses a 4.3.11 cluster when we force users to use 4.4 Recently we've switched to RHV4.4. I deployed one OCP4.6 cluster there and did not see the issue reproduce. However that sample is of course too small. No capacity in current sprint due to capacity constraints we will be revisiting this bug in the upcoming sprint We were unable to reproducen the issue. and we increased the timeout to 20m due to a different issue |
Created attachment 1710760 [details] Screenshot of RHV VMs Description of problem: While deploying OCP4.6 clusters manually, it sometimes happens to me that either one of more of bootstrap/master machines do not start. The installer waits 10 minutes for them to be started and then times out. It happens to me only occasionally, but often enough to hamper about 30 % of deployments. I don't recall ever encountering such issue in OCP4.4 or 4.5. Version-Release number of the following components: OCP: 4.6.0-0.nightly-2020-08-07-034746 RHV: 4.3.11.2-0.1.el7 How reproducible: 30 % Steps to Reproduce: Do some very basic OCP cluster deployment. Unfortunately, reproduction chance is not very high.