1882824 – 2 hours later after Disaster recovery , openshift console failed to login.

Bug 1882824 - 2 hours later after Disaster recovery , openshift console failed to login.

Summary: 2 hours later after Disaster recovery , openshift console failed to login.

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ricardo Carrillo Cruz
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1869362
TreeView+	depends on / blocked

Reported:	2020-09-25 19:48 UTC by milei
Modified:	2020-12-02 12:02 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-02 12:02:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description milei 2020-09-25 19:48:23 UTC

Description of problem:

Disaster recovery on OCP 4.4.15. After following the document "https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html", disaster recovery steps completed successfully.  All three masters and three worker nodes are ready.   I could login to openshift console at the first two hours after disaster recover. Then Could not login to openshift console any more with "Application is not available".


Environment:
UPI installed openshift platform 4.3.3, upgraded to 4.4.15 on bare metal with 3 master and 3 worker nodes.  
3 masters and 3 worker, bootstrap node are all in private network, public NICs are disabled. One infra node has dual NICs to access both public and private network. 
Three workers nodes are labeled and configured to install OCS.


Steps to Reproduce:
1. install OCP and OCS 4.4 as above
2. backup following this doc"https://docs.openshift.com/container-platform/4.4/backup_and_restore/backing-up-etcd.html"
3. running reboot / power cycle negative test
4. restore follow "https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html",
5. Login to OCP console, could login to openshift console
6. 2 hours later, try to navigate to openshift console, got "Application is not available".


Actual results:
1. After disaster recovery steps,  login to OCP console, could login to openshift console
2. 2 hours later, try to navigate to openshift console, got "Application is not available".



Expected results:
1.After disaster recovery steps,  always be able to login to OCP console 


Additional info:

1. Before disaster recovery, from openshift console ->overview, cluster's status is green and operator is yellow

Comment 1 milei 2020-09-25 19:51:42 UTC

Here is must-gather tar file:
http://10.8.32.38/str/ocpdebug/must-gather_DR_console_failed.tar.gz

Comment 2 Sam Batschelet 2020-09-25 20:09:52 UTC

Thanks for the report latest 4.4 is 4.4.26 4.4.15 is a few months old now. If possible we would like to be testing latest code but i understand that is not always possible.

Moving to console team as they are experts in their component and the rest of components in the cluster reconciled including etcd.

Comment 4 milei 2020-09-28 13:32:38 UTC

[root@dell-per730-09 ~]# oc get nodes
NAME      STATUS   ROLES           AGE   VERSION
master0   Ready    master,worker   52d   v1.17.1+3288478
master1   Ready    master,worker   52d   v1.17.1+3288478
master2   Ready    master,worker   52d   v1.17.1+3288478
worker0   Ready    worker          52d   v1.17.1+3288478
worker1   Ready    worker          52d   v1.17.1+3288478
worker2   Ready    worker          52d   v1.17.1+3288478
[root@dell-per730-09 ~]# oc get pods -n openshift-authentication
NAME                               READY   STATUS    RESTARTS   AGE
oauth-openshift-657bb565b6-46blv   1/1     Running   0          3d19h
oauth-openshift-657bb565b6-mtlqs   1/1     Running   0          3d19h

Comment 5 Ricardo Carrillo Cruz 2020-11-16 14:13:27 UTC

Tried to repro on Friday and didn't hit this.
Will try a couple more times.

Comment 6 Ricardo Carrillo Cruz 2020-11-27 15:00:41 UTC

I'm afraid I'm unable to reproduce this.
Will try to get someone from QE to see if they are more lucky so I can get an environment for debugging it.

Comment 7 Ricardo Carrillo Cruz 2020-11-27 15:01:24 UTC

Hey Zhanqui

Any chance you could repro this in QE lab?
I have tried a few times and I have not been succesful.

Thanks

Comment 9 Ricardo Carrillo Cruz 2020-12-02 12:02:29 UTC

Ok, then let's close.
If this occurs again, please reopen the bug.

Note You need to log in before you can comment on or make changes to this bug.