Bug 1978663 - must-gather-helper pod fails to come up on ODF Managed Services setup, hence no ceph collection succeeds
Summary: must-gather-helper pod fails to come up on ODF Managed Services setup, hence ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: must-gather
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.8.0
Assignee: Gobinda Das
QA Contact: Neha Berry
URL:
Whiteboard:
: 1978541 (view as bug list)
Depends On:
Blocks: 1979155 1979514
TreeView+ depends on / blocked
 
Reported: 2021-07-02 12:42 UTC by Neha Berry
Modified: 2022-04-11 11:30 UTC (History)
7 users (show)

Fixed In Version: v4.8.0-444.ci
Doc Type: Bug Fix
Doc Text:
Cause: must-gather-helper pod chooses wrong deployment yaml Consequence: must-gather-helper pod fails to come up as it tries to found invalid rook secret Fix: Make sure must-gather-helper pod is deployed with the correct deployment yaml Result: It comes up and run properly
Clone Of:
: 1979155 1979514 (view as bug list)
Environment:
Last Closed: 2022-02-09 17:29:23 UTC
Embargoed:


Attachments (Terms of Use)
secret rrok-ceph-mon.yaml (752 bytes, text/plain)
2021-07-02 12:42 UTC, Neha Berry
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1255 0 None closed must-gather: remove the standard pod template 2021-07-02 17:20:00 UTC
Github openshift ocs-operator pull 1256 0 None open Bug 1978663: [release-4.8] must-gather: remove the standard pod template 2021-07-02 17:20:00 UTC

Description Neha Berry 2021-07-02 12:42:29 UTC
Created attachment 1797154 [details]
secret rrok-ceph-mon.yaml

Description of problem (please be detailed as possible and provide log
snippests):
---------------------------------------------------------------------------
Attempted to collect must-gather on an ODF Managed Services cluster and following are the observations:

1. oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.7 --dest-dir=ocs-must-gather

[must-gather-w8h6l] POD 2021-07-02T12:31:55.725333050Z creating helper pod
[must-gather-w8h6l] POD 2021-07-02T12:31:56.405245791Z pod/must-gather-w8h6l-helper created
[must-gather-w8h6l] POD 2021-07-02T12:31:56.551127937Z pod/must-gather-w8h6l-helper labeled
[must-gather-w8h6l] POD 2021-07-02T12:31:56.553593213Z waiting for 78 to terminate
[must-gather-w8h6l] POD 2021-07-02T12:36:16.676172942Z error: timed out waiting for the condition on pods/must-gather-w8h6l-helper


POD status
=============
openshift-storage                                  must-gather-w8h6l-helper                                          0/1     CreateContainerConfigError   0          22s     10.128.2.92    ip-10-0-220-234.ec2.internal   <none>           <none>


Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       41s                default-scheduler  Successfully assigned openshift-storage/must-gather-w8h6l-helper to ip-10-0-220-234.ec2.internal
  Normal   AddedInterface  39s                multus             Add eth0 [10.128.2.92/23]
  Normal   Pulled          13s (x5 over 39s)  kubelet            Container image "registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:f1a86bf9ed909b8b0efdecc8f43a46c5f8bde1d74a0d219f9ad4abb9a2a75119" already present on machine
  Warning  Failed          13s (x5 over 39s)  kubelet            Error: couldn't find key admin-secret in Secret openshift-storage/rook-ceph-mon

➜  oc get secret |grep mon
rook-ceph-mon                                               kubernetes.io/rook                    4      5h28m
rook-ceph-mons-keyring                                      kubernetes.io/rook                    1      5h28m



Version of all relevant components (if applicable):
---------------------------------------------------------------------------
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
ocs-operator.v4.7.2                       OpenShift Container Storage   4.7.2                                                       Succeeded
ocs-osd-deployer.v0.0.1                   OCS OSD Deployer              0.0.1                                                       Succeeded

OCP = 4.7.18

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
---------------------------------------------------------------------------
Ceph commands are not collected

Is there any workaround available to the best of your knowledge?
---------------------------------------------------------------------------
Not sure

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
---------------------------------------------------------------------------
3


Can this issue reproducible?
---------------------------------------------------------------------------
Yes

Can this issue reproduce from the UI?
---------------------------------------------------------------------------
N/A

If this is a regression, please provide more details to justify this:
---------------------------------------------------------------------------
Not sure

Steps to Reproduce:
---------------------------------------------------------------------------
1. Install OCS Add-on from OCM UI
2. Start must-gather

oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.7 --dest-dir=ocs-must-gather




Actual results:
---------------------------------------------------------------------------
must-gather-helper pod fails to get created and hence the ceph collections are skipped

➜  jul2 ls -l ocs-must-gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-1949179411885858ec719ab052868c734b98b49787498a8297f1a4ace0283eae/ceph
total 8
-rw-r--r--. 1 nberry nberry 3336 Jul  2 18:06 event-filter.html
drwxr-xr-x. 1 nberry nberry   34 Jul  2 18:07 namespaces
-rw-r--r--. 1 nberry nberry  549 Jul  2 18:06 timestamp



Expected results:
---------------------------------------------------------------------------
helper pod should be created and must-gather collection should be able to collect ceph command outputs


Additional info:
---------------------------------------------------------------------------

2. oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7|tee terminal-mg.log
This falls back to openshift-must-gather as rhceph-dev is not added to pull-secret(as expected)


openshift-must-gather-6kzrv                        must-gather-dzqwd                                                 0/2     ImagePullBackOff   0          8s      10.129.0.69    ip-10-0-220-238.ec2.internal   <none>           <none>

Comment 3 Mudit Agarwal 2021-07-02 14:26:13 UTC
We have two templates for must-gather, one to use for pre-4.6 releases where ROOK_ADMIN_SECRET is used, that template is called standard template
and then there is latest template for 4.6 and above which uses ROOK_CEPH_USERNAME and the code which chooses the template is pretty straight forward.

 if [[ $current_version -ge 460 ]]; then
              apply_latest_helper_pod "$ns" "$operatorImage"
          else
              apply_standard_helper_pod "$ns" "$operatorImage"
          fi

This issue is happening because the standard template is being used though the above code is in place.

We get the value of $current_version by running this command
>> oc get csv --no-headers | awk '{print $5}' | tr -dc '0-9'

Now, in this cluster following is the result of "oc get csv"

[muagarwa@mudits-workstation ~]$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
ocs-operator.v4.7.2                       OpenShift Container Storage   4.7.2                                                       Succeeded
ocs-osd-deployer.v0.0.1                   OCS OSD Deployer              0.0.1                                                       Succeeded
prometheusoperator.0.47.0                 Prometheus Operator           0.47.0            prometheusoperator.0.37.0                 Succeeded
route-monitor-operator.v0.1.345-ee53105   Route Monitor Operator        0.1.345-ee53105   route-monitor-operator.v0.1.343-e469921   Succeeded
[muagarwa@mudits-workstation ~]$

We always expect single line output of this command but because of multiple lines of output, the value of $current_version is 47200103700134553105   :) 
This is really out of bound and the else logic will come into picture and we will end up in applying the standard template.

Neha, if I am not wrong we will have multiple CSVs in case multiple operators are installed?

Pulkit, do we still need two templates. It was required for 4.5 and below and now we can remove this logic and keep the latest template only. 
WDYT? If you agree, I will send a PR to remove the standard template and this logic.

Also, we are hitting the same in BZ #1978541 which is a customer case and we need to think of a workaround to unblock the customer.
IMO, this is a real blocker and should be backported to 4.6.z and 4.7z too.

Comment 4 Ohad 2021-07-02 19:56:47 UTC
Can we just run 
>> oc get csv --no-headers | grep ocs-operator | awk '{print $5}' | tr -dc '0-9'
instead?

To me, it seems like the simplest fix and the most robust one

Comment 5 Mudit Agarwal 2021-07-03 14:25:50 UTC
(In reply to Ohad from comment #4)
> Can we just run 
> >> oc get csv --no-headers | grep ocs-operator | awk '{print $5}' | tr -dc '0-9'
> instead?
> 
> To me, it seems like the simplest fix and the most robust one

Yes, that's the simplest fix but the point here is that the whole logic of finding the current version and choosing the pod template based on that is no longer required.
It was required to support pre 4.6 releases only.

Comment 6 Mudit Agarwal 2021-07-05 06:40:03 UTC
*** Bug 1978541 has been marked as a duplicate of this bug. ***

Comment 7 Neha Berry 2021-07-05 06:59:12 UTC
Do we have the issue in 4.8 too?
Are we backporting it to 4.7 once 4.8 is verified

Comment 10 Mudit Agarwal 2021-07-05 10:32:54 UTC
(In reply to Neha Berry from comment #7)
> Do we have the issue in 4.8 too?

We have the problem in 4.6/4.7/4.8

> Are we backporting it to 4.7 once 4.8 is verified
Yes, I have created a clone for 4.7 and marked it for 4.7.3. 
We should backport to 4.6 also as I am not able to think of any workaround.


Note You need to log in before you can comment on or make changes to this bug.