Description of problem (please be detailed as possible and provide log snippests): ceph osd df command is stuck with not output ( after timeout of 600 sec) after Performance suite run. At the same time all the pods are up and running. At the same time ceph health seems fine : [ypersky@ypersky ocs-ci]$ oc rsh rook-ceph-tools-565ffdb78c-94479 sh-4.4$ ceph status cluster: id: d09a6fea-757b-4dd9-a574-629b102cd74e health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 7d) mgr: a(active, since 7d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 7d), 3 in (since 7d) data: volumes: 1/1 healthy pools: 12 pools, 353 pgs objects: 41.22k objects, 158 GiB usage: 709 GiB used, 3.7 TiB / 4.4 TiB avail pgs: 353 active+clean sh-4.4$ sh-4.4$ ceph osd df Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1558, in send_command cluster.mgr_command, cmd, inbuf, timeout=timeout) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1494, in run_in_thread raise Exception("timed out") Exception: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1657, in json_command inbuf, timeout, verbose) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1504, in send_command_retry return send_command(*args, **kwargs) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1608, in send_command raise RuntimeError('"{0}": exception {1}'.format(cmd, e)) RuntimeError: "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/bin/ceph", line 1310, in <module> retval = main() File "/usr/bin/ceph", line 1237, in main verbose) File "/usr/bin/ceph", line 631, in new_style_command ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose) File "/usr/bin/ceph", line 582, in do_command argdict=valid_dict, inbuf=inbuf, verbose=verbose) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1661, in json_command raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) RuntimeError: "{'prefix': 'osd df', 'target': ('mon-mgr', '')}": exception "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out sh-4.4$ Version of all relevant components (if applicable): OCS versions ============== NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.12.1 NooBaa Operator 4.12.1 mcg-operator.v4.12.0 Succeeded ocs-operator.v4.12.1 OpenShift Container Storage 4.12.1 ocs-operator.v4.12.0 Succeeded odf-csi-addons-operator.v4.12.1 CSI Addons 4.12.1 odf-csi-addons-operator.v4.12.0 Succeeded odf-operator.v4.12.1 OpenShift Data Foundation 4.12.1 odf-operator.v4.12.0 Succeeded ODF (OCS) build : full_version: 4.12.1-19 Rook versions =============== rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace go: go1.18.9 Ceph versions =============== ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, osd df command does not return any output Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Not sure. Can this issue reproduce from the UI? No. If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy 4.12 VMware LSO cluster 2. Run Performance suite ( marker performance) 3. Login to rook-ceph-tools pod 4. Run ceph osd df command Actual results: The command is stuck with no output, after 600 sec an error appears : Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1558, in send_command cluster.mgr_command, cmd, inbuf, timeout=timeout) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1494, in run_in_thread raise Exception("timed out") Exception: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1657, in json_command inbuf, timeout, verbose) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1504, in send_command_retry return send_command(*args, **kwargs) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1608, in send_command raise RuntimeError('"{0}": exception {1}'.format(cmd, e)) RuntimeError: "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/bin/ceph", line 1310, in <module> retval = main() File "/usr/bin/ceph", line 1237, in main verbose) File "/usr/bin/ceph", line 631, in new_style_command ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose) File "/usr/bin/ceph", line 582, in do_command argdict=valid_dict, inbuf=inbuf, verbose=verbose) File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1661, in json_command raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) RuntimeError: "{'prefix': 'osd df', 'target': ('mon-mgr', '')}": exception "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out Expected results: ceph osd df should return output. Additional info: Relevant Jenkins job : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/22396/ The setup still exits. Cluster name: ypersky-lso12a kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso12a/ypersky-lso12a_20230321T235552/openshift-cluster-dir/auth/kubeconfig Must gather log link will be uploaded in the next comment.
Please note that must gather logs are located at : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182881/
Also please note that the cluster setup still exists . Cluster name :ypersky-lso12a kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso12a/ypersky-lso12a_20230321T235552/openshift-cluster-dir/auth/kubeconfig
Tests that run on the cluster prior to the start of the described problem : tests.e2e.performance.csi_tests.test_bulk_pod_attachtime_performance ( Passed) tests.e2e.performance.csi_tests.test_pod_attachtime (Passed) tests.e2e.performance.csi_tests.test_pod_reattachtime (Passed) tests.e2e.performance.csi_tests.test_pvc_bulk_clone_performance (Passed) tests.e2e.performance.csi_tests.test_pvc_bulk_creation_deletion_performance (Passsed) tests.e2e.performance.csi_tests.test_pvc_clone_performance (Failed because problem in the test code, already fixed). And the first test during which the problem appeared is : tests.e2e.performance.csi_tests.test_pvc_creation_deletion_performance
I do not have any other logs except must gather logs : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182881/ Also must gather probably did not collect osd logs due to : https://bugzilla.redhat.com/show_bug.cgi?id=2168849
@Mudit - I did not see this on the latest 4.13 runs ( neither on AWS nor on Vmware LSO). However, will keep an eye on all the executions.
Also did not see this in 4.14 runs ( However, 4.14 has many other problems while running Performance tests).