Bug 1802764

Summary: etcdserver: request timed out during rsh image-registry mount on vmware
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Coady LaCroix <clacroix>
Component: installationAssignee: Sébastien Han <shan>
Status: CLOSED NOTABUG QA Contact: Coady LaCroix <clacroix>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2CC: ebenahar, madam, ocs-bugs, ratamir, shmohan, vavuthu
Target Milestone: ---Keywords: Automation
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-24 15:00:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Coady LaCroix 2020-02-13 19:37:20 UTC
Description of problem (please be detailed as possible and provide log
snippests):

During OCS installation we are attempting to validate the PVC is mounted on the registry pod with the following command which is timing out. This appears to only be happening when installing on vmware.

E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-image-registry --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig rsh image-registry-5f49c458bf-76zgh mount.
E           Error is Error from server: etcdserver: request timed out

A portion of must-gather also failed with the same error(the rest of must gather is in additional info):
21:41:36 - MainThread - ocs_ci.ocs.utils - ERROR - Failed during must gather logs! Error: Error during execution of command: oc --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig adm must-gather --image=quay.io/openshift/origin-must-gather --dest-dir=/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1581540953/deployment_ocs_logs/ocp_must_gather.
Error is error: gather did not start for pod must-gather-p58g7: etcdserver: request timed out




Version of all relevant components (if applicable):
quay.io/rhceph-dev/ocs-olm-operator:4.2.2-rc4
OCP 4.3


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Installation failing due to this stopping us from further testing


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
untested


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS cluster in vmware (jenkins: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4543//console)
2. Validate the PVC is mounted on the registry pod
3.


Actual results:
rsh command is timing out

Expected results:
Verification of pvc mounted on registry pod succeeds

Additional info:
Link to must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-vu1cs33-t4a/jnk-vu1cs33-t4a_20200212T204539/logs/failed_testcase_ocs_logs_1581540953/deployment_ocs_logs/

Comment 2 Yaniv Kaul 2020-02-14 00:00:00 UTC
- Are you sure your VMware environment is reasonably performing? Looks like etcd is timing out?
- Severity?

Comment 3 Coady LaCroix 2020-02-18 19:14:30 UTC
(In reply to Yaniv Kaul from comment #2)
> - Are you sure your VMware environment is reasonably performing? Looks like
> etcd is timing out?


Not really sure how to answer this. Are there specific requirements that I can verify our environment is meeting with regards to etcd timing out?

Comment 4 Vijay Avuthu 2020-02-19 07:11:19 UTC
This cluster has 256 GB of memory and storage ( VSAN ) of 8.99 TB, Its connected to 1 Gbits/s.

Comment 6 Yaniv Kaul 2020-03-04 09:42:56 UTC
(In reply to Vijay Avuthu from comment #4)
> This cluster has 256 GB of memory and storage ( VSAN ) of 8.99 TB, Its
> connected to 1 Gbits/s.

1Gb is VERY VERY slow.

Comment 9 Michael Adam 2020-05-06 21:30:22 UTC
Has there been any further insight on this?


Anyway it's not 4.4 material...

Comment 10 Yaniv Kaul 2020-06-24 15:00:18 UTC
Closing. If you can reproduce on a reasonable* platform, please re-open.

* reasonable:
- well performing (10g network, for start)
- not overloaded (by other workloads)

It's not always easy to understand what's going on with the underlying platform, but it's a must in these cases. Not much we can do if the setup is either under-performing or overloaded.