Bug 1972565

Summary: performance issues due to lost node, pods taking too long to relaunch
Product: OpenShift Container Platform Reporter: peter ducai <pducai>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: RELEASE_PENDING --- QA Contact: XiuJuan Wang <xiuwang>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: aos-bugs, bjarolim, bleanhar, cldavey, lszaszki, mfojtik, oarribas, obulatov, ppostler, xiuwang, xxia
Target Milestone: ---Keywords: Reopened
Target Release: 4.8.0Flags: ppostler: needinfo? (rmarasch)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:12:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1984094    
Attachments:
Description Flags
metrics none

Description peter ducai 2021-06-16 08:29:42 UTC
Created attachment 1791494 [details]
metrics

Description of problem:

The problem arises when a physical node is lost suddenly (not gracefully), i.e., we lose the leader master-N, a worker-N, and a ocs-worker-N.

For some reason, the ETCD seems to have low performance and it takes OCP too long (20-25 minutes) to recover and recover Application containers (application container running on the worker-N are lost and not moved to other node until that time elapses!).

We have repeated the test of powering off an OCP node (master-2 and worker-2), and the result has been the same: it takes OCP 20 minutes to relaunch our application pods! We attach screenshots on the metrics and information. The problem points clearly to Red Hat OCP: it fails to create the new pods on another node (the new pods remain in state "ContainerCreating" for 20 minutes), and the OCP operator "authentication" turns NOT AVAILABLE when losing the node (as seen in screenshot with "oc get co").


Version-Release number of selected component (if applicable):

OCP 4.7.7

The cluster is based on VMware & vSphere virtualization over an onpremise HW cluster (HPE SimpliVity) of 3 physical nodes. Every physical node has 3 virtual machines for OCP & OCS: master-N, ocs-worker-N, and worker-N. The VM datastores are distributed by HPE SimpliVity over the 3 physical nodes.

3x masters (4 CPUs and 16GB RAM), 3x workers and 3x infra.. load is low.

masters on different hosts, network latency 75us and msg_rate 13K/s

Additional test: 
We have tested both having the 3 master nodes on the same HW/host and on different hosts (normal situation): the result is the same, so it is not an issue of the physical network/switches. We also increased the vCPUs from 4 to 8 vCPUs in the master VMs.

Comment 2 peter ducai 2021-06-16 08:34:21 UTC
as must-gather is too big for Bugzilla, please get it from linked case. thanks

Comment 5 Stefan Schimanski 2021-06-16 09:18:47 UTC
Moving to etcd for a first analysis. If etcd has problems to recover, the whole cluster will certainly suffer.

Am I right that the API is quickly available again, i.e. you can do `oc get clusteroperators` while the cluster is unstable?

Comment 7 peter ducai 2021-06-16 09:28:14 UTC
to clarify etcd issue.. customer claims, ETCD performance go bad after node is lost and there are also visible spikes in graphs at that time. Also there is minimal load.

As noted by customer last:

Our cluster provider -HPE-  has checked our SimpliVity-VMware cluster in detail, and has concluded the cluster is working fine and with very high performance for our Red Hat OCP Virtual Machines (disks, networks, memory, CPUs...). The performance is high enough to run ETCD, according to Red Hat official documentation:

https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_

The load "M" of ETCD performance test ("etcdctl check perf --load="m") should be much higher than the required for our 3-node OCP  cluster, and it passes OK. The latency is really low (~75 us) as we told you, and the disk performance is also very high.

We have repeated the test of powering off an OCP node (master-2 and worker-2), and the result has been the same: it takes OCP 20 minutes to relaunch our application pods! We attach screenshots on the metrics and information. The problem points clearly to Red Hat OCP: it fails to create the new pods on another node (the new pods remain in state "ContainerCreating" for 20 minutes), and the OCP operator "authentication" turns NOT AVAILABLE when losing the node (as seen in screenshot with "oc get co").

Comment 18 Oleg Bulatov 2021-07-20 14:51:34 UTC
Our hypothesis is that https://github.com/kubernetes/kubernetes/pull/95981 should allow the registry to discover connectivity problems in 30 seconds. It has been merged into the registry via https://github.com/openshift/image-registry/pull/272.

Given that it's supposed to be already fixed in 4.8.0, moving it to ON_QA.

Comment 20 XiuJuan Wang 2021-07-22 01:02:41 UTC
Mark this bug to verified according comment #19.
Image registry reports to processing in 30s after openshift-apiserver report unconnect. then reschedule successfully after 5 mins.

Comment 28 errata-xmlrpc 2021-07-27 23:12:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 32 peter ducai 2021-08-10 12:06:44 UTC
Update from cu:


We had some issues on the OCP update to v4.7.22, and some of its services had not been updated completely, so, the update got stuck.

After solving the problems in the update, we have checked that v4.7.22 works correctly at last, solving the problem of this case.