1972565 – performance issues due to lost node, pods taking too long to relaunch

Bug 1972565 - performance issues due to lost node, pods taking too long to relaunch

Summary: performance issues due to lost node, pods taking too long to relaunch

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Oleg Bulatov
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1984094
TreeView+	depends on / blocked

Reported:	2021-06-16 08:29 UTC by peter ducai
Modified:	2024-10-01 18:39 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-04-30 18:04:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
metrics (3.17 MB, application/zip) 2021-06-16 08:29 UTC, peter ducai	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:13:35 UTC

Description peter ducai 2021-06-16 08:29:42 UTC

Created attachment 1791494 [details]
metrics

Description of problem:

The problem arises when a physical node is lost suddenly (not gracefully), i.e., we lose the leader master-N, a worker-N, and a ocs-worker-N.

For some reason, the ETCD seems to have low performance and it takes OCP too long (20-25 minutes) to recover and recover Application containers (application container running on the worker-N are lost and not moved to other node until that time elapses!).

We have repeated the test of powering off an OCP node (master-2 and worker-2), and the result has been the same: it takes OCP 20 minutes to relaunch our application pods! We attach screenshots on the metrics and information. The problem points clearly to Red Hat OCP: it fails to create the new pods on another node (the new pods remain in state "ContainerCreating" for 20 minutes), and the OCP operator "authentication" turns NOT AVAILABLE when losing the node (as seen in screenshot with "oc get co").


Version-Release number of selected component (if applicable):

OCP 4.7.7

The cluster is based on VMware & vSphere virtualization over an onpremise HW cluster (HPE SimpliVity) of 3 physical nodes. Every physical node has 3 virtual machines for OCP & OCS: master-N, ocs-worker-N, and worker-N. The VM datastores are distributed by HPE SimpliVity over the 3 physical nodes.

3x masters (4 CPUs and 16GB RAM), 3x workers and 3x infra.. load is low.

masters on different hosts, network latency 75us and msg_rate 13K/s

Additional test: 
We have tested both having the 3 master nodes on the same HW/host and on different hosts (normal situation): the result is the same, so it is not an issue of the physical network/switches. We also increased the vCPUs from 4 to 8 vCPUs in the master VMs.

Comment 2 peter ducai 2021-06-16 08:34:21 UTC

as must-gather is too big for Bugzilla, please get it from linked case. thanks

Comment 5 Stefan Schimanski 2021-06-16 09:18:47 UTC

Moving to etcd for a first analysis. If etcd has problems to recover, the whole cluster will certainly suffer.

Am I right that the API is quickly available again, i.e. you can do `oc get clusteroperators` while the cluster is unstable?

Comment 7 peter ducai 2021-06-16 09:28:14 UTC

to clarify etcd issue.. customer claims, ETCD performance go bad after node is lost and there are also visible spikes in graphs at that time. Also there is minimal load.

As noted by customer last:

Our cluster provider -HPE-  has checked our SimpliVity-VMware cluster in detail, and has concluded the cluster is working fine and with very high performance for our Red Hat OCP Virtual Machines (disks, networks, memory, CPUs...). The performance is high enough to run ETCD, according to Red Hat official documentation:

https://docs.openshift.com/container-platform/4.7/scalability_and_performance/recommended-host-practices.html#recommended-etcd-practices_

The load "M" of ETCD performance test ("etcdctl check perf --load="m") should be much higher than the required for our 3-node OCP  cluster, and it passes OK. The latency is really low (~75 us) as we told you, and the disk performance is also very high.

We have repeated the test of powering off an OCP node (master-2 and worker-2), and the result has been the same: it takes OCP 20 minutes to relaunch our application pods! We attach screenshots on the metrics and information. The problem points clearly to Red Hat OCP: it fails to create the new pods on another node (the new pods remain in state "ContainerCreating" for 20 minutes), and the OCP operator "authentication" turns NOT AVAILABLE when losing the node (as seen in screenshot with "oc get co").

Comment 18 Oleg Bulatov 2021-07-20 14:51:34 UTC

Our hypothesis is that https://github.com/kubernetes/kubernetes/pull/95981 should allow the registry to discover connectivity problems in 30 seconds. It has been merged into the registry via https://github.com/openshift/image-registry/pull/272.

Given that it's supposed to be already fixed in 4.8.0, moving it to ON_QA.

Comment 20 XiuJuan Wang 2021-07-22 01:02:41 UTC

Mark this bug to verified according comment #19.
Image registry reports to processing in 30s after openshift-apiserver report unconnect. then reschedule successfully after 5 mins.

Comment 28 errata-xmlrpc 2021-07-27 23:12:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 32 peter ducai 2021-08-10 12:06:44 UTC

Update from cu:


We had some issues on the OCP update to v4.7.22, and some of its services had not been updated completely, so, the update got stuck.

After solving the problems in the update, we have checked that v4.7.22 works correctly at last, solving the problem of this case.

Comment 39 Rory Thrasher 2024-04-30 18:04:53 UTC

OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary

Comment 40 Red Hat Bugzilla 2024-08-29 04:25:05 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.