Created attachment 1791494 [details] metrics Description of problem: The problem arises when a physical node is lost suddenly (not gracefully), i.e., we lose the leader master-N, a worker-N, and a ocs-worker-N. For some reason, the ETCD seems to have low performance and it takes OCP too long (20-25 minutes) to recover and recover Application containers (application container running on the worker-N are lost and not moved to other node until that time elapses!). We have repeated the test of powering off an OCP node (master-2 and worker-2), and the result has been the same: it takes OCP 20 minutes to relaunch our application pods! We attach screenshots on the metrics and information. The problem points clearly to Red Hat OCP: it fails to create the new pods on another node (the new pods remain in state "ContainerCreating" for 20 minutes), and the OCP operator "authentication" turns NOT AVAILABLE when losing the node (as seen in screenshot with "oc get co"). Version-Release number of selected component (if applicable): OCP 4.7.7 The cluster is based on VMware & vSphere virtualization over an onpremise HW cluster (HPE SimpliVity) of 3 physical nodes. Every physical node has 3 virtual machines for OCP & OCS: master-N, ocs-worker-N, and worker-N. The VM datastores are distributed by HPE SimpliVity over the 3 physical nodes. 3x masters (4 CPUs and 16GB RAM), 3x workers and 3x infra.. load is low. masters on different hosts, network latency 75us and msg_rate 13K/s Additional test: We have tested both having the 3 master nodes on the same HW/host and on different hosts (normal situation): the result is the same, so it is not an issue of the physical network/switches. We also increased the vCPUs from 4 to 8 vCPUs in the master VMs.
as must-gather is too big for Bugzilla, please get it from linked case. thanks
Moving to etcd for a first analysis. If etcd has problems to recover, the whole cluster will certainly suffer. Am I right that the API is quickly available again, i.e. you can do `oc get clusteroperators` while the cluster is unstable?
to clarify etcd issue.. customer claims, ETCD performance go bad after node is lost and there are also visible spikes in graphs at that time. Also there is minimal load. As noted by customer last: Our cluster provider -HPE- has checked our SimpliVity-VMware cluster in detail, and has concluded the cluster is working fine and with very high performance for our Red Hat OCP Virtual Machines (disks, networks, memory, CPUs...). The performance is high enough to run ETCD, according to Red Hat official documentation: https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_ The load "M" of ETCD performance test ("etcdctl check perf --load="m") should be much higher than the required for our 3-node OCP cluster, and it passes OK. The latency is really low (~75 us) as we told you, and the disk performance is also very high. We have repeated the test of powering off an OCP node (master-2 and worker-2), and the result has been the same: it takes OCP 20 minutes to relaunch our application pods! We attach screenshots on the metrics and information. The problem points clearly to Red Hat OCP: it fails to create the new pods on another node (the new pods remain in state "ContainerCreating" for 20 minutes), and the OCP operator "authentication" turns NOT AVAILABLE when losing the node (as seen in screenshot with "oc get co").
Our hypothesis is that https://github.com/kubernetes/kubernetes/pull/95981 should allow the registry to discover connectivity problems in 30 seconds. It has been merged into the registry via https://github.com/openshift/image-registry/pull/272. Given that it's supposed to be already fixed in 4.8.0, moving it to ON_QA.
Mark this bug to verified according comment #19. Image registry reports to processing in 30s after openshift-apiserver report unconnect. then reschedule successfully after 5 mins.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
Update from cu: We had some issues on the OCP update to v4.7.22, and some of its services had not been updated completely, so, the update got stuck. After solving the problems in the update, we have checked that v4.7.22 works correctly at last, solving the problem of this case.