Bug 2182881

Summary: Vmware LSO - ceph osd df command is stuck with not output after Performance suite run
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yuli Persky <ypersky>
Component: cephAssignee: Nitzan mordechai <nmordech>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bniver, jopinto, kramdoss, muagarwa, nojha, odf-bz-bot, rperiyas, rzarzyns, sostapov
Version: 4.12Keywords: Automation, Performance
Target Milestone: ---Flags: muagarwa: needinfo? (ypersky)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuli Persky 2023-03-29 21:08:59 UTC
Description of problem (please be detailed as possible and provide log
snippests):

ceph osd df command is stuck with not output ( after timeout of 600 sec) after Performance suite run. 

At the same time all the pods are up and running. 

At the same time ceph health seems fine : 

[ypersky@ypersky ocs-ci]$ oc rsh rook-ceph-tools-565ffdb78c-94479
sh-4.4$ ceph status
  cluster:
    id:     d09a6fea-757b-4dd9-a574-629b102cd74e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 7d)
    mgr: a(active, since 7d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 7d), 3 in (since 7d)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 353 pgs
    objects: 41.22k objects, 158 GiB
    usage:   709 GiB used, 3.7 TiB / 4.4 TiB avail
    pgs:     353 active+clean
 
sh-4.4$ 

sh-4.4$ ceph osd df
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1558, in send_command
    cluster.mgr_command, cmd, inbuf, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1494, in run_in_thread
    raise Exception("timed out")
Exception: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1657, in json_command
    inbuf, timeout, verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1504, in send_command_retry
    return send_command(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1608, in send_command
    raise RuntimeError('"{0}": exception {1}'.format(cmd, e))
RuntimeError: "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ceph", line 1310, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1237, in main
    verbose)
  File "/usr/bin/ceph", line 631, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 582, in do_command
    argdict=valid_dict, inbuf=inbuf, verbose=verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1661, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': 'osd df', 'target': ('mon-mgr', '')}": exception "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out
sh-4.4$ 



Version of all relevant components (if applicable):

       OCS versions
        ==============

                NAME                              DISPLAY                       VERSION   REPLACES                          PHASE
                mcg-operator.v4.12.1              NooBaa Operator               4.12.1    mcg-operator.v4.12.0              Succeeded
                ocs-operator.v4.12.1              OpenShift Container Storage   4.12.1    ocs-operator.v4.12.0              Succeeded
                odf-csi-addons-operator.v4.12.1   CSI Addons                    4.12.1    odf-csi-addons-operator.v4.12.0   Succeeded
                odf-operator.v4.12.1              OpenShift Data Foundation     4.12.1    odf-operator.v4.12.0              Succeeded
                
                ODF (OCS) build :                     full_version: 4.12.1-19
                
        Rook versions
        ===============

                rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace
                go: go1.18.9
                
        Ceph versions
        ===============

                ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)
                


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, osd df command does not return any output


Is there any workaround available to the best of your knowledge?
No 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3


Can this issue reproducible?

Not sure. 


Can this issue reproduce from the UI?

No.


If this is a regression, please provide more details to justify this:



Steps to Reproduce:
1. Deploy 4.12 VMware LSO cluster 
2. Run Performance suite ( marker performance) 
3. Login to rook-ceph-tools pod 
4. Run ceph osd df command 


Actual results:

The command is stuck with no output, after 600 sec an error appears : 

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1558, in send_command
    cluster.mgr_command, cmd, inbuf, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1494, in run_in_thread
    raise Exception("timed out")
Exception: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1657, in json_command
    inbuf, timeout, verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1504, in send_command_retry
    return send_command(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1608, in send_command
    raise RuntimeError('"{0}": exception {1}'.format(cmd, e))
RuntimeError: "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ceph", line 1310, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1237, in main
    verbose)
  File "/usr/bin/ceph", line 631, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 582, in do_command
    argdict=valid_dict, inbuf=inbuf, verbose=verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1661, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': 'osd df', 'target': ('mon-mgr', '')}": exception "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out


Expected results:

ceph osd df should return output.


Additional info:

Relevant Jenkins job : 

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/22396/

The setup still exits. 
Cluster name: ypersky-lso12a
kubeconfig:  http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso12a/ypersky-lso12a_20230321T235552/openshift-cluster-dir/auth/kubeconfig

Must gather log link will be uploaded in the next comment.

Comment 3 Yuli Persky 2023-03-29 21:34:03 UTC
Please note that must gather logs are located at : 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182881/

Comment 4 Yuli Persky 2023-03-29 21:35:19 UTC
Also please note that the cluster setup still exists . 
Cluster name :ypersky-lso12a 
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso12a/ypersky-lso12a_20230321T235552/openshift-cluster-dir/auth/kubeconfig

Comment 5 Yuli Persky 2023-03-30 10:12:02 UTC
Tests that run on the cluster prior to the start of the described problem : 

tests.e2e.performance.csi_tests.test_bulk_pod_attachtime_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pod_attachtime	(Passed)
tests.e2e.performance.csi_tests.test_pod_reattachtime	(Passed) 
tests.e2e.performance.csi_tests.test_pvc_bulk_clone_performance	(Passed)
tests.e2e.performance.csi_tests.test_pvc_bulk_creation_deletion_performance	(Passsed)
tests.e2e.performance.csi_tests.test_pvc_clone_performance (Failed because problem in the test code, already fixed). 

And the first test during which the problem appeared is : tests.e2e.performance.csi_tests.test_pvc_creation_deletion_performance

Comment 8 Yuli Persky 2023-04-17 20:11:11 UTC
I do not have any other logs except must gather logs : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182881/
Also must gather probably did not collect osd logs due to : https://bugzilla.redhat.com/show_bug.cgi?id=2168849