Bug 1918272 - [OCPonRHV] Cluster should be recovered after power outage
Summary: [OCPonRHV] Cluster should be recovered after power outage
Keywords:
Status: CLOSED DUPLICATE of bug 1915902
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.7
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-20 11:20 UTC by Michael Burman
Modified: 2021-01-25 14:39 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-25 13:46:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather (5.93 KB, application/zip)
2021-01-21 10:36 UTC, Janos Bonic
no flags Details

Description Michael Burman 2021-01-20 11:20:00 UTC
Description of problem:
[OCPonRHV] Cluster should be recovered after power outage

Two days ago I managed to install cluster 4.7 with great success.
4.7.0-0.nightly-2021-01-12-150634
Cluster was alive for 24 hours, all was ready and nothing degraded.

Yesterday, we had a major power outage(AC dead). The physical host which was running the engine VM and the master and worker VMs was dead.
After the host recovered(after some hours), I started the master and worker VMs manually in RHV.
One cluster operator is degraded since than:

oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-12-150634   True        False         41h     Error while reconciling 4.7.0-0.nightly-2021-01-12-150634: the cluster operator image-registry is degraded

image-registry                             4.7.0-0.nightly-2021-01-12-150634   True        False         True       18h

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-12-150634

How reproducible:
1/1 

Steps to Reproduce:
1. Install 4.7 cluster on RHV 4.4.4
2. Cluster installed successfully and it's alive
3. Unexpected power outage happens and killing the baremetal host on which the HE VM is running and the masters and worker VMs are running
4. Recover the baremetal host. recover HE VM and engine. Start the master and worker VMs manually. 

Actual results:
All master and worker VMs running as expected, all got IPs.
One cluster operator wasn't recovered properly, image-registry remained as degraded after the power outage recovery. 

version   4.7.0-0.nightly-2021-01-12-150634   True        False         41h     Error while reconciling 4.7.0-0.nightly-2021-01-12-150634: the cluster operator image-registry is degraded


Expected results:
Cluster should recover after power outage and be operational. All cluster operators shouldn't be degraded.

Additional info:
Janos from DEV team has acknowledged this bug and has collected the logs.

Comment 1 Oleg Bulatov 2021-01-20 13:37:33 UTC
Can you provide us with logs? As the description doesn't contain messages from the registry operator, I'd suggest to use must-gather to collect all necessary information.

Comment 2 Michael Burman 2021-01-20 15:20:23 UTC
(In reply to Oleg Bulatov from comment #1)
> Can you provide us with logs? As the description doesn't contain messages
> from the registry operator, I'd suggest to use must-gather to collect all
> necessary information.

Hi Oleg,

Janos from our development team has collected all relevant info. He will add his findings a bit later.
Also, I'm not sure that i opened it on the right component.

Comment 3 Janos Bonic 2021-01-21 10:36:19 UTC
Created attachment 1749344 [details]
must-gather

Comment 4 Oleg Bulatov 2021-01-22 15:17:27 UTC
This must-gather archive is almost empty, it doesn't have cluster-scoped resources nor the openshift-image-registry namespace.

Janos, do you have something else?

Comment 5 Janos Bonic 2021-01-22 21:00:07 UTC
@obulatov no, but the cluster is still up. I can run whatever you need me to run or @Michael Burman can give you access if needed.


Note You need to log in before you can comment on or make changes to this bug.