Description of problem (please be detailed as possible and provide log snippets): When collecting Must-gather logs shows /usr/bin/gather_ceph_resources: line 341: jq: command not found Version of all relevant components (if applicable): OCP version:- 4.13.0-0.nightly-2023-05-22-040653 ODF version:- 4.13.0-203 CEPH version:- ceph version 17.2.6-50.el9cp (c202ddb5589554af0ce43432ff07cd7ce8f35243) quincy (stable) ACM version:- 2.8.0-180 SUBMARINER version:- v0.15.0 VOLSYNC version:- volsync-product.v0.7.1 oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather@sha256:10071ddc29383af01d60eadfa4d6f2bd631cfd4c06fcdf7efdb655a84b13a4f1 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Run must-gather over ODF cluster 2. 3. Actual results: [must-gather-gctbc] POD 2023-05-24T14:10:31.584612326Z collecting snapshot info for cephFS subvolumes [must-gather-gctbc] POD 2023-05-24T14:10:31.586366724Z /usr/bin/gather_ceph_resources: line 341: jq: command not found Expected results: Additional info: I tried creating a pod with the must-gather image added above i don't see jq package in it # jq -r bash: jq: command not found
Since OCS 4.8, "jq" was added to the downstream via build, see http://pkgs.devel.redhat.com/cgit/containers/rook-ceph/commit/?h=ocs-4.8-rhel-8 Boris, has something changed in 4.13?
I don't see any change here around jq really. I tried comparing rook-ceph 4.13 and 4.9, both have jq binary in them, same version (1.6) and in the exact same location (/usr/bin/jq). I can confirm that there was no jq binary in ocs-must-gather in e.g. ODF 4.9 either so no change there either. My guess would be that the script is not running the jq binary in the rook-ceph pod anymore for some reason? It could be somehow related to the rhceph image using ubi-minimal as a base nowadays maybe?
I was looking at the script and it looks like I'm right. The error is coming from this line: subvolgrp_names=$(timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "${ceph_command}"| jq --raw-output '.[].name') and the escaping is wrong there so it is trying to run jq in the must-gather container and it is no available there. The line should look like this instead: subvolgrp_names=$(timeout 120 oc -n "${ns}" exec "${HOSTNAME}"-helper -- bash -c "${ceph_command} | jq --raw-output '.[].name'")
Thanks Boris! Yati, please send a patch asap.
Added the link to the patch, will update once merged.
This exists since 4.12 (commit https://github.com/red-hat-storage/ocs-operator/commit/b58ba9b8a8d6f5220842e44c210a6b42f2a6466a) Yati, please clone this bug to 4.12 also. We need to fix it there as well, I don't know why this was never discovered.
Yeah sure, will do that.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742