Created attachment 1797154 [details] secret rrok-ceph-mon.yaml Description of problem (please be detailed as possible and provide log snippests): --------------------------------------------------------------------------- Attempted to collect must-gather on an ODF Managed Services cluster and following are the observations: 1. oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.7 --dest-dir=ocs-must-gather [must-gather-w8h6l] POD 2021-07-02T12:31:55.725333050Z creating helper pod [must-gather-w8h6l] POD 2021-07-02T12:31:56.405245791Z pod/must-gather-w8h6l-helper created [must-gather-w8h6l] POD 2021-07-02T12:31:56.551127937Z pod/must-gather-w8h6l-helper labeled [must-gather-w8h6l] POD 2021-07-02T12:31:56.553593213Z waiting for 78 to terminate [must-gather-w8h6l] POD 2021-07-02T12:36:16.676172942Z error: timed out waiting for the condition on pods/must-gather-w8h6l-helper POD status ============= openshift-storage must-gather-w8h6l-helper 0/1 CreateContainerConfigError 0 22s 10.128.2.92 ip-10-0-220-234.ec2.internal <none> <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 41s default-scheduler Successfully assigned openshift-storage/must-gather-w8h6l-helper to ip-10-0-220-234.ec2.internal Normal AddedInterface 39s multus Add eth0 [10.128.2.92/23] Normal Pulled 13s (x5 over 39s) kubelet Container image "registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:f1a86bf9ed909b8b0efdecc8f43a46c5f8bde1d74a0d219f9ad4abb9a2a75119" already present on machine Warning Failed 13s (x5 over 39s) kubelet Error: couldn't find key admin-secret in Secret openshift-storage/rook-ceph-mon ➜ oc get secret |grep mon rook-ceph-mon kubernetes.io/rook 4 5h28m rook-ceph-mons-keyring kubernetes.io/rook 1 5h28m Version of all relevant components (if applicable): --------------------------------------------------------------------------- NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.2 OpenShift Container Storage 4.7.2 Succeeded ocs-osd-deployer.v0.0.1 OCS OSD Deployer 0.0.1 Succeeded OCP = 4.7.18 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? --------------------------------------------------------------------------- Ceph commands are not collected Is there any workaround available to the best of your knowledge? --------------------------------------------------------------------------- Not sure Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? --------------------------------------------------------------------------- 3 Can this issue reproducible? --------------------------------------------------------------------------- Yes Can this issue reproduce from the UI? --------------------------------------------------------------------------- N/A If this is a regression, please provide more details to justify this: --------------------------------------------------------------------------- Not sure Steps to Reproduce: --------------------------------------------------------------------------- 1. Install OCS Add-on from OCM UI 2. Start must-gather oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.7 --dest-dir=ocs-must-gather Actual results: --------------------------------------------------------------------------- must-gather-helper pod fails to get created and hence the ceph collections are skipped ➜ jul2 ls -l ocs-must-gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-1949179411885858ec719ab052868c734b98b49787498a8297f1a4ace0283eae/ceph total 8 -rw-r--r--. 1 nberry nberry 3336 Jul 2 18:06 event-filter.html drwxr-xr-x. 1 nberry nberry 34 Jul 2 18:07 namespaces -rw-r--r--. 1 nberry nberry 549 Jul 2 18:06 timestamp Expected results: --------------------------------------------------------------------------- helper pod should be created and must-gather collection should be able to collect ceph command outputs Additional info: --------------------------------------------------------------------------- 2. oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7|tee terminal-mg.log This falls back to openshift-must-gather as rhceph-dev is not added to pull-secret(as expected) openshift-must-gather-6kzrv must-gather-dzqwd 0/2 ImagePullBackOff 0 8s 10.129.0.69 ip-10-0-220-238.ec2.internal <none> <none>
We have two templates for must-gather, one to use for pre-4.6 releases where ROOK_ADMIN_SECRET is used, that template is called standard template and then there is latest template for 4.6 and above which uses ROOK_CEPH_USERNAME and the code which chooses the template is pretty straight forward. if [[ $current_version -ge 460 ]]; then apply_latest_helper_pod "$ns" "$operatorImage" else apply_standard_helper_pod "$ns" "$operatorImage" fi This issue is happening because the standard template is being used though the above code is in place. We get the value of $current_version by running this command >> oc get csv --no-headers | awk '{print $5}' | tr -dc '0-9' Now, in this cluster following is the result of "oc get csv" [muagarwa@mudits-workstation ~]$ oc get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.2 OpenShift Container Storage 4.7.2 Succeeded ocs-osd-deployer.v0.0.1 OCS OSD Deployer 0.0.1 Succeeded prometheusoperator.0.47.0 Prometheus Operator 0.47.0 prometheusoperator.0.37.0 Succeeded route-monitor-operator.v0.1.345-ee53105 Route Monitor Operator 0.1.345-ee53105 route-monitor-operator.v0.1.343-e469921 Succeeded [muagarwa@mudits-workstation ~]$ We always expect single line output of this command but because of multiple lines of output, the value of $current_version is 47200103700134553105 :) This is really out of bound and the else logic will come into picture and we will end up in applying the standard template. Neha, if I am not wrong we will have multiple CSVs in case multiple operators are installed? Pulkit, do we still need two templates. It was required for 4.5 and below and now we can remove this logic and keep the latest template only. WDYT? If you agree, I will send a PR to remove the standard template and this logic. Also, we are hitting the same in BZ #1978541 which is a customer case and we need to think of a workaround to unblock the customer. IMO, this is a real blocker and should be backported to 4.6.z and 4.7z too.
Can we just run >> oc get csv --no-headers | grep ocs-operator | awk '{print $5}' | tr -dc '0-9' instead? To me, it seems like the simplest fix and the most robust one
(In reply to Ohad from comment #4) > Can we just run > >> oc get csv --no-headers | grep ocs-operator | awk '{print $5}' | tr -dc '0-9' > instead? > > To me, it seems like the simplest fix and the most robust one Yes, that's the simplest fix but the point here is that the whole logic of finding the current version and choosing the pod template based on that is no longer required. It was required to support pre 4.6 releases only.
*** Bug 1978541 has been marked as a duplicate of this bug. ***
Do we have the issue in 4.8 too? Are we backporting it to 4.7 once 4.8 is verified
(In reply to Neha Berry from comment #7) > Do we have the issue in 4.8 too? We have the problem in 4.6/4.7/4.8 > Are we backporting it to 4.7 once 4.8 is verified Yes, I have created a clone for 4.7 and marked it for 4.7.3. We should backport to 4.6 also as I am not able to think of any workaround.