Bug 1978663

Summary: must-gather-helper pod fails to come up on ODF Managed Services setup, hence no ceph collection succeeds
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Neha Berry <nberry>
Component: must-gatherAssignee: Gobinda Das <godas>
Status: CLOSED CURRENTRELEASE QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: csharpe, muagarwa, ocs-bugs, omitrani, rcyriac, sabose, tdesala
Target Milestone: ---   
Target Release: OCS 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.8.0-444.ci Doc Type: Bug Fix
Doc Text:
Cause: must-gather-helper pod chooses wrong deployment yaml Consequence: must-gather-helper pod fails to come up as it tries to found invalid rook secret Fix: Make sure must-gather-helper pod is deployed with the correct deployment yaml Result: It comes up and run properly
Story Points: ---
Clone Of:
: 1979155 1979514 (view as bug list) Environment:
Last Closed: 2022-02-09 17:29:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1979155, 1979514    
Attachments:
Description Flags
secret rrok-ceph-mon.yaml none

Description Neha Berry 2021-07-02 12:42:29 UTC
Created attachment 1797154 [details]
secret rrok-ceph-mon.yaml

Description of problem (please be detailed as possible and provide log
snippests):
---------------------------------------------------------------------------
Attempted to collect must-gather on an ODF Managed Services cluster and following are the observations:

1. oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.7 --dest-dir=ocs-must-gather

[must-gather-w8h6l] POD 2021-07-02T12:31:55.725333050Z creating helper pod
[must-gather-w8h6l] POD 2021-07-02T12:31:56.405245791Z pod/must-gather-w8h6l-helper created
[must-gather-w8h6l] POD 2021-07-02T12:31:56.551127937Z pod/must-gather-w8h6l-helper labeled
[must-gather-w8h6l] POD 2021-07-02T12:31:56.553593213Z waiting for 78 to terminate
[must-gather-w8h6l] POD 2021-07-02T12:36:16.676172942Z error: timed out waiting for the condition on pods/must-gather-w8h6l-helper


POD status
=============
openshift-storage                                  must-gather-w8h6l-helper                                          0/1     CreateContainerConfigError   0          22s     10.128.2.92    ip-10-0-220-234.ec2.internal   <none>           <none>


Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       41s                default-scheduler  Successfully assigned openshift-storage/must-gather-w8h6l-helper to ip-10-0-220-234.ec2.internal
  Normal   AddedInterface  39s                multus             Add eth0 [10.128.2.92/23]
  Normal   Pulled          13s (x5 over 39s)  kubelet            Container image "registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:f1a86bf9ed909b8b0efdecc8f43a46c5f8bde1d74a0d219f9ad4abb9a2a75119" already present on machine
  Warning  Failed          13s (x5 over 39s)  kubelet            Error: couldn't find key admin-secret in Secret openshift-storage/rook-ceph-mon

➜  oc get secret |grep mon
rook-ceph-mon                                               kubernetes.io/rook                    4      5h28m
rook-ceph-mons-keyring                                      kubernetes.io/rook                    1      5h28m



Version of all relevant components (if applicable):
---------------------------------------------------------------------------
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
ocs-operator.v4.7.2                       OpenShift Container Storage   4.7.2                                                       Succeeded
ocs-osd-deployer.v0.0.1                   OCS OSD Deployer              0.0.1                                                       Succeeded

OCP = 4.7.18

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
---------------------------------------------------------------------------
Ceph commands are not collected

Is there any workaround available to the best of your knowledge?
---------------------------------------------------------------------------
Not sure

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
---------------------------------------------------------------------------
3


Can this issue reproducible?
---------------------------------------------------------------------------
Yes

Can this issue reproduce from the UI?
---------------------------------------------------------------------------
N/A

If this is a regression, please provide more details to justify this:
---------------------------------------------------------------------------
Not sure

Steps to Reproduce:
---------------------------------------------------------------------------
1. Install OCS Add-on from OCM UI
2. Start must-gather

oc adm must-gather --image=registry.redhat.io/ocs4/ocs-must-gather-rhel8:v4.7 --dest-dir=ocs-must-gather




Actual results:
---------------------------------------------------------------------------
must-gather-helper pod fails to get created and hence the ceph collections are skipped

➜  jul2 ls -l ocs-must-gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-1949179411885858ec719ab052868c734b98b49787498a8297f1a4ace0283eae/ceph
total 8
-rw-r--r--. 1 nberry nberry 3336 Jul  2 18:06 event-filter.html
drwxr-xr-x. 1 nberry nberry   34 Jul  2 18:07 namespaces
-rw-r--r--. 1 nberry nberry  549 Jul  2 18:06 timestamp



Expected results:
---------------------------------------------------------------------------
helper pod should be created and must-gather collection should be able to collect ceph command outputs


Additional info:
---------------------------------------------------------------------------

2. oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.7|tee terminal-mg.log
This falls back to openshift-must-gather as rhceph-dev is not added to pull-secret(as expected)


openshift-must-gather-6kzrv                        must-gather-dzqwd                                                 0/2     ImagePullBackOff   0          8s      10.129.0.69    ip-10-0-220-238.ec2.internal   <none>           <none>

Comment 3 Mudit Agarwal 2021-07-02 14:26:13 UTC
We have two templates for must-gather, one to use for pre-4.6 releases where ROOK_ADMIN_SECRET is used, that template is called standard template
and then there is latest template for 4.6 and above which uses ROOK_CEPH_USERNAME and the code which chooses the template is pretty straight forward.

 if [[ $current_version -ge 460 ]]; then
              apply_latest_helper_pod "$ns" "$operatorImage"
          else
              apply_standard_helper_pod "$ns" "$operatorImage"
          fi

This issue is happening because the standard template is being used though the above code is in place.

We get the value of $current_version by running this command
>> oc get csv --no-headers | awk '{print $5}' | tr -dc '0-9'

Now, in this cluster following is the result of "oc get csv"

[muagarwa@mudits-workstation ~]$ oc get csv
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
ocs-operator.v4.7.2                       OpenShift Container Storage   4.7.2                                                       Succeeded
ocs-osd-deployer.v0.0.1                   OCS OSD Deployer              0.0.1                                                       Succeeded
prometheusoperator.0.47.0                 Prometheus Operator           0.47.0            prometheusoperator.0.37.0                 Succeeded
route-monitor-operator.v0.1.345-ee53105   Route Monitor Operator        0.1.345-ee53105   route-monitor-operator.v0.1.343-e469921   Succeeded
[muagarwa@mudits-workstation ~]$

We always expect single line output of this command but because of multiple lines of output, the value of $current_version is 47200103700134553105   :) 
This is really out of bound and the else logic will come into picture and we will end up in applying the standard template.

Neha, if I am not wrong we will have multiple CSVs in case multiple operators are installed?

Pulkit, do we still need two templates. It was required for 4.5 and below and now we can remove this logic and keep the latest template only. 
WDYT? If you agree, I will send a PR to remove the standard template and this logic.

Also, we are hitting the same in BZ #1978541 which is a customer case and we need to think of a workaround to unblock the customer.
IMO, this is a real blocker and should be backported to 4.6.z and 4.7z too.

Comment 4 Ohad 2021-07-02 19:56:47 UTC
Can we just run 
>> oc get csv --no-headers | grep ocs-operator | awk '{print $5}' | tr -dc '0-9'
instead?

To me, it seems like the simplest fix and the most robust one

Comment 5 Mudit Agarwal 2021-07-03 14:25:50 UTC
(In reply to Ohad from comment #4)
> Can we just run 
> >> oc get csv --no-headers | grep ocs-operator | awk '{print $5}' | tr -dc '0-9'
> instead?
> 
> To me, it seems like the simplest fix and the most robust one

Yes, that's the simplest fix but the point here is that the whole logic of finding the current version and choosing the pod template based on that is no longer required.
It was required to support pre 4.6 releases only.

Comment 6 Mudit Agarwal 2021-07-05 06:40:03 UTC
*** Bug 1978541 has been marked as a duplicate of this bug. ***

Comment 7 Neha Berry 2021-07-05 06:59:12 UTC
Do we have the issue in 4.8 too?
Are we backporting it to 4.7 once 4.8 is verified

Comment 10 Mudit Agarwal 2021-07-05 10:32:54 UTC
(In reply to Neha Berry from comment #7)
> Do we have the issue in 4.8 too?

We have the problem in 4.6/4.7/4.8

> Are we backporting it to 4.7 once 4.8 is verified
Yes, I have created a clone for 4.7 and marked it for 4.7.3. 
We should backport to 4.6 also as I am not able to think of any workaround.