Description of problem (please be detailed as possible and provide log snippests): -================================================================== There can be situations when OCS install is not fully complete, e.g. due to some issues OSD didnt come up or ceph pods were not completely up. For such a problematic setup, if must-gather is initiated, it takes hours to complete. The long response time is seen during the ceph command collection, with each command taking too long to finish . note: As the cluster is not in good shape, most probably most of the commands wont even work. Even if I run ceph command in the toolbox pod, most of the commands are taking too long to complete. But must-gather should, in such cases, somehow timeout it and progress. As seen from snip below, each of the following commands was taking too long to progress further. [must-gather-nfswb] POD 2021-01-13 19:13:20 (16.4 MB/s) - 'jq' saved [497799/497799] [must-gather-nfswb] POD [must-gather-nfswb] POD collecting command output for: ceph auth list [must-gather-nfswb] POD collecting command output for: ceph balancer dump [must-gather-nfswb] POD collecting command output for: ceph balancer pool ls [must-gather-nfswb] POD collecting command output for: ceph balancer status [must-gather-nfswb] POD collecting command output for: ceph config dump [must-gather-nfswb] POD collecting command output for: ceph config-key ls [must-gather-nfswb] POD collecting command output for: ceph crash ls Setup/issue when this was observed: Bug 1915445 Version of all relevant components (if applicable): ====================================================== OCP = 4.7.0-0.nightly-2021-01-12-203716 OCS = ocs-operator.v4.7.0-230.ci Must-gather: date --utc; timeoc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7 |tee terminal-must-gather2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ============================================================ Yes, log collection takes 5-6 hours to complete Is there any workaround available to the best of your knowledge? ==================================================== Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ============================================================ 4 Can this issue reproducible? =============================== reproduced 3 times already Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: ============================================================= Not sure Steps to Reproduce: ====================== =================================================================== 1. Install OCP 4.7 on vmware 2. Install OCS 4.7 operator and then click on Create STorage cluster 3. In the configure section - enable cluster-wide encryption and add the KMS details from external vault server. 4. Click Create in Review and Create Page 5. If you hit Bug 1915202, edit the configmap below to add [VAULT_SKIP_VERIFY: "true"] 6. See if install succeeds, but it is seen OSD creation still fails due to KMS related permission denied issues 7. The noobaa-db-pg-0 PVC stays in pending state 8. Start must-gather log collection: #oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7 Actual results: ==================== The must-gather collection on a cluster in bad-shape takes hours to complete The issue is seen during ceph command collection Expected results: ====================== the MG collection should not take so long. There should also be an option to skip some log collections if it takes too long Additional info: ======================== Actual ceph status ----------------------- sh-4.4# ceph -s cluster: id: 592ce459-4246-46e6-83bf-f1254ff491f2 health: HEALTH_WARN 2 MDSs report slow metadata IOs Reduced data availability: 176 pgs inactive OSD count 0 < osd_pool_default_size 3 clock skew detected on mon.b, mon.c services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 6h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:creating} 1 up:standby-replay osd: 0 osds: 0 up, 0 in data: pools: 10 pools, 176 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 176 unknown POD status -------------- =======PODS ====== NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-d2hrm 3/3 Running 0 6h32m 10.1.161.53 compute-0 <none> <none> csi-cephfsplugin-provisioner-69786bcc49-9d92p 6/6 Running 4 6h32m 10.129.2.13 compute-1 <none> <none> csi-cephfsplugin-provisioner-69786bcc49-mgw4n 6/6 Running 0 6h32m 10.131.0.104 compute-2 <none> <none> csi-cephfsplugin-qt76h 3/3 Running 0 6h32m 10.1.160.137 compute-1 <none> <none> csi-cephfsplugin-v6kff 3/3 Running 0 6h32m 10.1.160.30 compute-2 <none> <none> csi-rbdplugin-4z9k9 3/3 Running 0 6h32m 10.1.161.53 compute-0 <none> <none> csi-rbdplugin-bf8qp 3/3 Running 0 6h32m 10.1.160.137 compute-1 <none> <none> csi-rbdplugin-provisioner-5c46b445bb-h24mr 6/6 Running 2 6h32m 10.128.2.19 compute-0 <none> <none> csi-rbdplugin-provisioner-5c46b445bb-smhg5 6/6 Running 0 6h32m 10.129.2.12 compute-1 <none> <none> csi-rbdplugin-r8kh8 3/3 Running 0 6h32m 10.1.160.30 compute-2 <none> <none> must-gather-btvth-helper 1/1 Running 0 5h40m 10.128.2.58 compute-0 <none> <none> must-gather-nfswb-helper 1/1 Running 0 11m 10.128.2.184 compute-0 <none> <none> must-gather-rhkqd-helper 1/1 Running 0 4h27m 10.128.2.89 compute-0 <none> <none> noobaa-core-0 1/1 Running 0 6h30m 10.128.2.21 compute-0 <none> <none> noobaa-db-pg-0 0/1 Pending 0 6h30m <none> <none> <none> <none> noobaa-operator-56c5f65769-fx4c5 1/1 Running 0 10h 10.128.2.16 compute-0 <none> <none> ocs-metrics-exporter-5889875657-hxb8n 1/1 Running 0 10h 10.128.2.17 compute-0 <none> <none> ocs-operator-66867c8876-pw6hl 1/1 Running 3 10h 10.128.2.14 compute-0 <none> <none> rook-ceph-crashcollector-compute-0-75bc74c444-7nqd6 1/1 Running 0 6h31m 10.128.2.22 compute-0 <none> <none> rook-ceph-crashcollector-compute-1-5cdfff6cd7-m9trt 1/1 Running 0 6h31m 10.129.2.18 compute-1 <none> <none> rook-ceph-crashcollector-compute-2-99bd58b-kc52f 1/1 Running 0 6h31m 10.131.0.107 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-96878744k86gn 1/1 Running 0 6h30m 10.131.0.108 compute-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-88ffdd45cgvdn 1/1 Running 0 6h30m 10.129.2.17 compute-1 <none> <none> rook-ceph-mgr-a-55dc894dfb-wtt2k 1/1 Running 0 6h30m 10.129.2.15 compute-1 <none> <none> rook-ceph-mon-a-864d575676-nnkkq 1/1 Running 0 6h31m 10.129.2.14 compute-1 <none> <none> rook-ceph-mon-b-84bd947b59-5pmlk 1/1 Running 0 6h31m 10.128.2.20 compute-0 <none> <none> rook-ceph-mon-c-5d58bb8454-dd2xl 1/1 Running 0 6h31m 10.131.0.106 compute-2 <none> <none> rook-ceph-operator-54596895fc-fbhxr 1/1 Running 0 10h 10.128.2.15 compute-0 <none> <none> rook-ceph-tools-69d7bccb5f-lvqv6 1/1 Running 0 2m12s 10.1.161.53 compute-0 <none> <none>
I too faced a similar issue like this, taking a lot of time during ceph cmds collection steps. [must-gather-cw76p] OUT gather logs unavailable: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug="" [must-gather-cw76p] OUT waiting for gather to complete [must-gather-cw76p] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-tkzd6 deleted [must-gather ] OUT namespace/openshift-must-gather-2vmj2 deleted error: gather never finished for pod must-gather-cw76p: timed out waiting for the condition In the end, cmd oc adm must-gather --image="quay.io/rhceph-dev/ocs-must-gather:latest-4.7" failed with the above error
Tested on ocs-operator.v4.7.0-278.ci NAME READY STATUS RESTARTS AGE csi-cephfsplugin-955fl 3/3 Running 0 178m csi-cephfsplugin-provisioner-5f84f94c57-2vcft 6/6 Running 0 178m csi-cephfsplugin-provisioner-5f84f94c57-p29vb 6/6 Running 3 178m csi-cephfsplugin-qv9qf 3/3 Running 0 178m csi-cephfsplugin-rlphb 3/3 Running 0 178m csi-rbdplugin-7td22 3/3 Running 0 178m csi-rbdplugin-8cmhv 3/3 Running 0 178m csi-rbdplugin-jqqx7 3/3 Running 0 178m csi-rbdplugin-provisioner-68bd88fb68-lzjvn 6/6 Running 4 178m csi-rbdplugin-provisioner-68bd88fb68-qrn7t 6/6 Running 0 178m noobaa-core-0 1/1 Running 0 175m noobaa-db-pg-0 0/1 Pending 0 175m noobaa-operator-6fb598688b-vxx8d 1/1 Running 0 3h3m ocs-metrics-exporter-64967ddb76-nxfck 1/1 Running 0 3h3m ocs-operator-6fd8ccdcf5-vmrdf 1/1 Running 1 3h3m rook-ceph-crashcollector-compute-0-8474776685-2c56z 1/1 Running 0 177m rook-ceph-crashcollector-compute-1-5f7f757894-s4h9n 1/1 Running 0 176m rook-ceph-crashcollector-compute-2-758fc7df9-656w5 1/1 Running 0 177m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-c4d5547c72srj 2/2 Running 0 174m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6dffb4749d4wp 2/2 Running 0 174m rook-ceph-mgr-a-679dff6dbd-8xbk5 2/2 Running 0 176m rook-ceph-mon-a-b856bdcb-jn4hk 2/2 Running 0 177m rook-ceph-mon-b-f6545f4fb-kb8rg 2/2 Running 0 177m rook-ceph-mon-c-59dd86bf4d-kzpw4 2/2 Running 0 176m rook-ceph-operator-7778fb54f9-5hfmw 1/1 Running 0 3h3m rook-ceph-osd-0-598d454d8b-v2rz6 0/2 Init:1/9 0 57m rook-ceph-osd-1-f88db587f-drp8t 0/2 Init:CrashLoopBackOff 34 154m rook-ceph-osd-2-869799c9f6-tndl8 0/2 Init:CrashLoopBackOff 32 143m rook-ceph-osd-prepare-ocs-deviceset-thin-0-data-0bwwhj-hvnxc 0/1 Completed 0 176m rook-ceph-osd-prepare-ocs-deviceset-thin-1-data-0xr9tm-556wl 0/1 Completed 0 176m rook-ceph-osd-prepare-ocs-deviceset-thin-2-data-0l696c-r7tpx 0/1 Completed 0 176m rook-ceph-tools-5c5f779f59-9w7gb 1/1 Running 0 76m ============================ [root@compute-2 /]# ceph -s cluster: id: 03ed1f0c-6b32-40f6-979a-ca1412f9ef05 health: HEALTH_WARN 2 MDSs report slow metadata IOs Reduced data availability: 176 pgs inactive services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:creating} 1 up:standby-replay osd: 3 osds: 0 up, 0 in task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 176 unknown ================================ With the total time is taken to collect logs 2021-03-02 09:10:40.355834012 +0000 UTC m=+0.409726534 2021-03-02 09:16:47.297227082 +0000 UTC m=+367.351119644 Gather debug log: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz1915953/gather-debug.log Moving the bug to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041