Bug 1915953

Summary:	Must-gather takes hours to complete if the OCS cluster is not fully deployed, delay seen in ceph command collection step
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Neha Berry <nberry>
Component:	must-gather	Assignee:	Pulkit Kundra <pkundra>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	ebenahar, muagarwa, nobody+410372, ocs-bugs, pkundra, sabose
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	OCS 4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.7.0-731.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-19 09:18:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Neha Berry 2021-01-13 19:33:43 UTC

Description of problem (please be detailed as possible and provide log
snippests):
-==================================================================
There can be situations when OCS install is not fully complete, e.g. due to some issues OSD didnt come up or ceph pods were not completely up.

For such a problematic setup, if must-gather is initiated, it takes hours to complete.

The long response time is seen during the ceph command collection, with each command taking too long to finish . 



note: As the cluster is not in good shape, most probably most of the commands wont even work. Even if I run ceph command in the toolbox pod, most of the commands are taking too long to complete. But must-gather should, in such cases,  somehow timeout it and progress.

As seen from snip below, each of the following commands was taking  too long to progress further.

[must-gather-nfswb] POD 2021-01-13 19:13:20 (16.4 MB/s) - 'jq' saved [497799/497799]
[must-gather-nfswb] POD 
[must-gather-nfswb] POD collecting command output for: ceph auth list
[must-gather-nfswb] POD collecting command output for: ceph balancer dump
[must-gather-nfswb] POD collecting command output for: ceph balancer pool ls
[must-gather-nfswb] POD collecting command output for: ceph balancer status
[must-gather-nfswb] POD collecting command output for: ceph config dump
[must-gather-nfswb] POD collecting command output for: ceph config-key ls
[must-gather-nfswb] POD collecting command output for: ceph crash ls

Setup/issue when this was observed: Bug 1915445

Version of all relevant components (if applicable):
======================================================
OCP = 4.7.0-0.nightly-2021-01-12-203716
OCS = ocs-operator.v4.7.0-230.ci

Must-gather: date --utc; timeoc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7 |tee terminal-must-gather2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
============================================================
Yes, log collection takes 5-6 hours to complete

Is there any workaround available to the best of your knowledge?
====================================================

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
============================================================
4

Can this issue reproducible?
===============================
reproduced 3 times already

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
=============================================================
Not sure

Steps to Reproduce:
======================
===================================================================
1. Install OCP 4.7 on vmware
2. Install OCS 4.7 operator and then click on Create STorage cluster
3. In the configure section - enable cluster-wide encryption and add the KMS details from external vault server. 
4. Click Create in Review and Create Page
5. If you hit Bug 1915202, edit the configmap below to add [VAULT_SKIP_VERIFY: "true"] 
6. See if install succeeds, but it is seen OSD creation still fails due to KMS related permission denied issues
7. The noobaa-db-pg-0 PVC stays in pending state
8. Start must-gather log collection:

#oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7


Actual results:
====================

The must-gather collection on a cluster in bad-shape takes hours to complete

The issue is seen during ceph command collection

Expected results:
======================
the MG collection should not take so long. There should also be an option to skip some log collections if it takes too long




Additional info:
========================

Actual ceph status
-----------------------

sh-4.4# ceph -s
  cluster:
    id:     592ce459-4246-46e6-83bf-f1254ff491f2
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            Reduced data availability: 176 pgs inactive
            OSD count 0 < osd_pool_default_size 3
            clock skew detected on mon.b, mon.c
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 6h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:creating} 1 up:standby-replay
    osd: 0 osds: 0 up, 0 in
 
  data:
    pools:   10 pools, 176 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             176 unknown
 



POD status
--------------

=======PODS ======
NAME                                                              READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
csi-cephfsplugin-d2hrm                                            3/3     Running   0          6h32m   10.1.161.53    compute-0   <none>           <none>
csi-cephfsplugin-provisioner-69786bcc49-9d92p                     6/6     Running   4          6h32m   10.129.2.13    compute-1   <none>           <none>
csi-cephfsplugin-provisioner-69786bcc49-mgw4n                     6/6     Running   0          6h32m   10.131.0.104   compute-2   <none>           <none>
csi-cephfsplugin-qt76h                                            3/3     Running   0          6h32m   10.1.160.137   compute-1   <none>           <none>
csi-cephfsplugin-v6kff                                            3/3     Running   0          6h32m   10.1.160.30    compute-2   <none>           <none>
csi-rbdplugin-4z9k9                                               3/3     Running   0          6h32m   10.1.161.53    compute-0   <none>           <none>
csi-rbdplugin-bf8qp                                               3/3     Running   0          6h32m   10.1.160.137   compute-1   <none>           <none>
csi-rbdplugin-provisioner-5c46b445bb-h24mr                        6/6     Running   2          6h32m   10.128.2.19    compute-0   <none>           <none>
csi-rbdplugin-provisioner-5c46b445bb-smhg5                        6/6     Running   0          6h32m   10.129.2.12    compute-1   <none>           <none>
csi-rbdplugin-r8kh8                                               3/3     Running   0          6h32m   10.1.160.30    compute-2   <none>           <none>
must-gather-btvth-helper                                          1/1     Running   0          5h40m   10.128.2.58    compute-0   <none>           <none>
must-gather-nfswb-helper                                          1/1     Running   0          11m     10.128.2.184   compute-0   <none>           <none>
must-gather-rhkqd-helper                                          1/1     Running   0          4h27m   10.128.2.89    compute-0   <none>           <none>
noobaa-core-0                                                     1/1     Running   0          6h30m   10.128.2.21    compute-0   <none>           <none>
noobaa-db-pg-0                                                    0/1     Pending   0          6h30m   <none>         <none>      <none>           <none>
noobaa-operator-56c5f65769-fx4c5                                  1/1     Running   0          10h     10.128.2.16    compute-0   <none>           <none>
ocs-metrics-exporter-5889875657-hxb8n                             1/1     Running   0          10h     10.128.2.17    compute-0   <none>           <none>
ocs-operator-66867c8876-pw6hl                                     1/1     Running   3          10h     10.128.2.14    compute-0   <none>           <none>
rook-ceph-crashcollector-compute-0-75bc74c444-7nqd6               1/1     Running   0          6h31m   10.128.2.22    compute-0   <none>           <none>
rook-ceph-crashcollector-compute-1-5cdfff6cd7-m9trt               1/1     Running   0          6h31m   10.129.2.18    compute-1   <none>           <none>
rook-ceph-crashcollector-compute-2-99bd58b-kc52f                  1/1     Running   0          6h31m   10.131.0.107   compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-96878744k86gn   1/1     Running   0          6h30m   10.131.0.108   compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-88ffdd45cgvdn   1/1     Running   0          6h30m   10.129.2.17    compute-1   <none>           <none>
rook-ceph-mgr-a-55dc894dfb-wtt2k                                  1/1     Running   0          6h30m   10.129.2.15    compute-1   <none>           <none>
rook-ceph-mon-a-864d575676-nnkkq                                  1/1     Running   0          6h31m   10.129.2.14    compute-1   <none>           <none>
rook-ceph-mon-b-84bd947b59-5pmlk                                  1/1     Running   0          6h31m   10.128.2.20    compute-0   <none>           <none>
rook-ceph-mon-c-5d58bb8454-dd2xl                                  1/1     Running   0          6h31m   10.131.0.106   compute-2   <none>           <none>
rook-ceph-operator-54596895fc-fbhxr                               1/1     Running   0          10h     10.128.2.15    compute-0   <none>           <none>
rook-ceph-tools-69d7bccb5f-lvqv6                                  1/1     Running   0          2m12s   10.1.161.53    compute-0   <none>           <none>

Comment 9 Persona non grata 2021-02-10 06:56:28 UTC

I too faced a similar issue like this, taking a lot of time during ceph cmds collection steps. 

[must-gather-cw76p] OUT gather logs unavailable: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug=""
[must-gather-cw76p] OUT waiting for gather to complete
[must-gather-cw76p] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-tkzd6 deleted
[must-gather      ] OUT namespace/openshift-must-gather-2vmj2 deleted
error: gather never finished for pod must-gather-cw76p: timed out waiting for the condition

In the end, cmd  oc adm must-gather --image="quay.io/rhceph-dev/ocs-must-gather:latest-4.7" failed with the above error

Comment 12 Persona non grata 2021-03-02 09:45:14 UTC

Tested on ocs-operator.v4.7.0-278.ci

NAME                                                              READY   STATUS                  RESTARTS   AGE
csi-cephfsplugin-955fl                                            3/3     Running                 0          178m
csi-cephfsplugin-provisioner-5f84f94c57-2vcft                     6/6     Running                 0          178m
csi-cephfsplugin-provisioner-5f84f94c57-p29vb                     6/6     Running                 3          178m
csi-cephfsplugin-qv9qf                                            3/3     Running                 0          178m
csi-cephfsplugin-rlphb                                            3/3     Running                 0          178m
csi-rbdplugin-7td22                                               3/3     Running                 0          178m
csi-rbdplugin-8cmhv                                               3/3     Running                 0          178m
csi-rbdplugin-jqqx7                                               3/3     Running                 0          178m
csi-rbdplugin-provisioner-68bd88fb68-lzjvn                        6/6     Running                 4          178m
csi-rbdplugin-provisioner-68bd88fb68-qrn7t                        6/6     Running                 0          178m
noobaa-core-0                                                     1/1     Running                 0          175m
noobaa-db-pg-0                                                    0/1     Pending                 0          175m
noobaa-operator-6fb598688b-vxx8d                                  1/1     Running                 0          3h3m
ocs-metrics-exporter-64967ddb76-nxfck                             1/1     Running                 0          3h3m
ocs-operator-6fd8ccdcf5-vmrdf                                     1/1     Running                 1          3h3m
rook-ceph-crashcollector-compute-0-8474776685-2c56z               1/1     Running                 0          177m
rook-ceph-crashcollector-compute-1-5f7f757894-s4h9n               1/1     Running                 0          176m
rook-ceph-crashcollector-compute-2-758fc7df9-656w5                1/1     Running                 0          177m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c4d5547c72srj   2/2     Running                 0          174m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6dffb4749d4wp   2/2     Running                 0          174m
rook-ceph-mgr-a-679dff6dbd-8xbk5                                  2/2     Running                 0          176m
rook-ceph-mon-a-b856bdcb-jn4hk                                    2/2     Running                 0          177m
rook-ceph-mon-b-f6545f4fb-kb8rg                                   2/2     Running                 0          177m
rook-ceph-mon-c-59dd86bf4d-kzpw4                                  2/2     Running                 0          176m
rook-ceph-operator-7778fb54f9-5hfmw                               1/1     Running                 0          3h3m
rook-ceph-osd-0-598d454d8b-v2rz6                                  0/2     Init:1/9                0          57m
rook-ceph-osd-1-f88db587f-drp8t                                   0/2     Init:CrashLoopBackOff   34         154m
rook-ceph-osd-2-869799c9f6-tndl8                                  0/2     Init:CrashLoopBackOff   32         143m
rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0bwwhj-hvnxc      0/1     Completed               0          176m
rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0xr9tm-556wl      0/1     Completed               0          176m
rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0l696c-r7tpx      0/1     Completed               0          176m
rook-ceph-tools-5c5f779f59-9w7gb                                  1/1     Running                 0          76m


============================
[root@compute-2 /]# ceph -s 
  cluster:
    id:     03ed1f0c-6b32-40f6-979a-ca1412f9ef05
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            Reduced data availability: 176 pgs inactive
 
  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:creating} 1 up:standby-replay
    osd: 3 osds: 0 up, 0 in
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 176 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             176 unknown

================================
With the total time is taken to collect logs

2021-03-02 09:10:40.355834012 +0000 UTC m=+0.409726534
2021-03-02 09:16:47.297227082 +0000 UTC m=+367.351119644


Gather debug log: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz1915953/gather-debug.log

Moving the bug to verified

Comment 15 errata-xmlrpc 2021-05-19 09:18:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041