1802764 – etcdserver: request timed out during rsh image-registry mount on vmware

Bug 1802764 - etcdserver: request timed out during rsh image-registry mount on vmware

Summary: etcdserver: request timed out during rsh image-registry mount on vmware

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	installation
Sub Component:
Version:	4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sébastien Han
QA Contact:	Coady LaCroix
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-13 19:37 UTC by Coady LaCroix
Modified:	2020-06-24 15:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-24 15:00:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Coady LaCroix 2020-02-13 19:37:20 UTC

Description of problem (please be detailed as possible and provide log
snippests):

During OCS installation we are attempting to validate the PVC is mounted on the registry pod with the following command which is timing out. This appears to only be happening when installing on vmware.

E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-image-registry --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh image-registry-5f49c458bf-76zgh mount.
E           Error is Error from server: etcdserver: request timed out

A portion of must-gather also failed with the same error(the rest of must gather is in additional info):
21:41:36 - MainThread - ocs_ci.ocs.utils - ERROR - Failed during must gather logs! Error: Error during execution of command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig adm must-gather --image=quay.io/openshift/origin-must-gather --dest-dir=/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1581540953/deployment_ocs_logs/ocp_must_gather.
Error is error: gather did not start for pod must-gather-p58g7: etcdserver: request timed out




Version of all relevant components (if applicable):
quay.io/rhceph-dev/ocs-olm-operator:4.2.2-rc4
OCP 4.3


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Installation failing due to this stopping us from further testing


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
untested


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS cluster in vmware (jenkins: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4543//console)
2. Validate the PVC is mounted on the registry pod
3.


Actual results:
rsh command is timing out

Expected results:
Verification of pvc mounted on registry pod succeeds

Additional info:
Link to must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cs33-t4a/jnk-vu1cs33-t4a_20200212T204539/logs/failed_testcase_ocs_logs_1581540953/deployment_ocs_logs/

Comment 2 Yaniv Kaul 2020-02-14 00:00:00 UTC

- Are you sure your VMware environment is reasonably performing? Looks like etcd is timing out?
- Severity?

Comment 3 Coady LaCroix 2020-02-18 19:14:30 UTC

(In reply to Yaniv Kaul from comment #2)
> - Are you sure your VMware environment is reasonably performing? Looks like
> etcd is timing out?


Not really sure how to answer this. Are there specific requirements that I can verify our environment is meeting with regards to etcd timing out?

Comment 4 Vijay Avuthu 2020-02-19 07:11:19 UTC

This cluster has 256 GB of memory and storage ( VSAN ) of 8.99 TB, Its connected to 1 Gbits/s.

Comment 6 Yaniv Kaul 2020-03-04 09:42:56 UTC

(In reply to Vijay Avuthu from comment #4)
> This cluster has 256 GB of memory and storage ( VSAN ) of 8.99 TB, Its
> connected to 1 Gbits/s.

1Gb is VERY VERY slow.

Comment 9 Michael Adam 2020-05-06 21:30:22 UTC

Has there been any further insight on this?


Anyway it's not 4.4 material...

Comment 10 Yaniv Kaul 2020-06-24 15:00:18 UTC

Closing. If you can reproduce on a reasonable* platform, please re-open.

* reasonable:
- well performing (10g network, for start)
- not overloaded (by other workloads)

It's not always easy to understand what's going on with the underlying platform, but it's a must in these cases. Not much we can do if the setup is either under-performing or overloaded.

Note You need to log in before you can comment on or make changes to this bug.