Description of problem: Disaster recovery on OCP 4.4.15. After following the document "https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html", disaster recovery steps completed successfully. All three masters and three worker nodes are ready. I could login to openshift console at the first two hours after disaster recover. Then Could not login to openshift console any more with "Application is not available". Environment: UPI installed openshift platform 4.3.3, upgraded to 4.4.15 on bare metal with 3 master and 3 worker nodes. 3 masters and 3 worker, bootstrap node are all in private network, public NICs are disabled. One infra node has dual NICs to access both public and private network. Three workers nodes are labeled and configured to install OCS. Steps to Reproduce: 1. install OCP and OCS 4.4 as above 2. backup following this doc"https://docs.openshift.com/container-platform/4.4/backup_and_restore/backing-up-etcd.html" 3. running reboot / power cycle negative test 4. restore follow "https://docs.openshift.com/container-platform/4.4/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html", 5. Login to OCP console, could login to openshift console 6. 2 hours later, try to navigate to openshift console, got "Application is not available". Actual results: 1. After disaster recovery steps, login to OCP console, could login to openshift console 2. 2 hours later, try to navigate to openshift console, got "Application is not available". Expected results: 1.After disaster recovery steps, always be able to login to OCP console Additional info: 1. Before disaster recovery, from openshift console ->overview, cluster's status is green and operator is yellow
Here is must-gather tar file: http://10.8.32.38/str/ocpdebug/must-gather_DR_console_failed.tar.gz
Thanks for the report latest 4.4 is 4.4.26 4.4.15 is a few months old now. If possible we would like to be testing latest code but i understand that is not always possible. Moving to console team as they are experts in their component and the rest of components in the cluster reconciled including etcd.
[root@dell-per730-09 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master0 Ready master,worker 52d v1.17.1+3288478 master1 Ready master,worker 52d v1.17.1+3288478 master2 Ready master,worker 52d v1.17.1+3288478 worker0 Ready worker 52d v1.17.1+3288478 worker1 Ready worker 52d v1.17.1+3288478 worker2 Ready worker 52d v1.17.1+3288478 [root@dell-per730-09 ~]# oc get pods -n openshift-authentication NAME READY STATUS RESTARTS AGE oauth-openshift-657bb565b6-46blv 1/1 Running 0 3d19h oauth-openshift-657bb565b6-mtlqs 1/1 Running 0 3d19h
Tried to repro on Friday and didn't hit this. Will try a couple more times.
I'm afraid I'm unable to reproduce this. Will try to get someone from QE to see if they are more lucky so I can get an environment for debugging it.
Hey Zhanqui Any chance you could repro this in QE lab? I have tried a few times and I have not been succesful. Thanks
Ok, then let's close. If this occurs again, please reopen the bug.