2182881 – Vmware LSO - ceph osd df command is stuck with not output after Performance suite run

Bug 2182881 - Vmware LSO - ceph osd df command is stuck with not output after Performance suite run

Summary: Vmware LSO - ceph osd df command is stuck with not output after Performance s...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Nitzan mordechai
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-03-29 21:08 UTC by Yuli Persky
Modified:	2024-01-19 08:21 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-01-19 08:21:28 UTC
Embargoed:

Attachments	(Terms of Use)

Description Yuli Persky 2023-03-29 21:08:59 UTC

Description of problem (please be detailed as possible and provide log
snippests):

ceph osd df command is stuck with not output ( after timeout of 600 sec) after Performance suite run. 

At the same time all the pods are up and running. 

At the same time ceph health seems fine : 

[ypersky@ypersky ocs-ci]$ oc rsh rook-ceph-tools-565ffdb78c-94479
sh-4.4$ ceph status
  cluster:
    id:     d09a6fea-757b-4dd9-a574-629b102cd74e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 7d)
    mgr: a(active, since 7d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 7d), 3 in (since 7d)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 353 pgs
    objects: 41.22k objects, 158 GiB
    usage:   709 GiB used, 3.7 TiB / 4.4 TiB avail
    pgs:     353 active+clean
 
sh-4.4$ 

sh-4.4$ ceph osd df
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1558, in send_command
    cluster.mgr_command, cmd, inbuf, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1494, in run_in_thread
    raise Exception("timed out")
Exception: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1657, in json_command
    inbuf, timeout, verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1504, in send_command_retry
    return send_command(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1608, in send_command
    raise RuntimeError('"{0}": exception {1}'.format(cmd, e))
RuntimeError: "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ceph", line 1310, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1237, in main
    verbose)
  File "/usr/bin/ceph", line 631, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 582, in do_command
    argdict=valid_dict, inbuf=inbuf, verbose=verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1661, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': 'osd df', 'target': ('mon-mgr', '')}": exception "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out
sh-4.4$ 



Version of all relevant components (if applicable):

       OCS versions
        ==============

                NAME                              DISPLAY                       VERSION   REPLACES                          PHASE
                mcg-operator.v4.12.1              NooBaa Operator               4.12.1    mcg-operator.v4.12.0              Succeeded
                ocs-operator.v4.12.1              OpenShift Container Storage   4.12.1    ocs-operator.v4.12.0              Succeeded
                odf-csi-addons-operator.v4.12.1   CSI Addons                    4.12.1    odf-csi-addons-operator.v4.12.0   Succeeded
                odf-operator.v4.12.1              OpenShift Data Foundation     4.12.1    odf-operator.v4.12.0              Succeeded
                
                ODF (OCS) build :                     full_version: 4.12.1-19
                
        Rook versions
        ===============

                rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace
                go: go1.18.9
                
        Ceph versions
        ===============

                ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)
                


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, osd df command does not return any output


Is there any workaround available to the best of your knowledge?
No 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3


Can this issue reproducible?

Not sure. 


Can this issue reproduce from the UI?

No.


If this is a regression, please provide more details to justify this:



Steps to Reproduce:
1. Deploy 4.12 VMware LSO cluster 
2. Run Performance suite ( marker performance) 
3. Login to rook-ceph-tools pod 
4. Run ceph osd df command 


Actual results:

The command is stuck with no output, after 600 sec an error appears : 

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1558, in send_command
    cluster.mgr_command, cmd, inbuf, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1494, in run_in_thread
    raise Exception("timed out")
Exception: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1657, in json_command
    inbuf, timeout, verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1504, in send_command_retry
    return send_command(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1608, in send_command
    raise RuntimeError('"{0}": exception {1}'.format(cmd, e))
RuntimeError: "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ceph", line 1310, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1237, in main
    verbose)
  File "/usr/bin/ceph", line 631, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs, sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 582, in do_command
    argdict=valid_dict, inbuf=inbuf, verbose=verbose)
  File "/usr/lib/python3.6/site-packages/ceph_argparse.py", line 1661, in json_command
    raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': 'osd df', 'target': ('mon-mgr', '')}": exception "{"prefix": "osd df", "target": ["mon-mgr", ""]}": exception timed out


Expected results:

ceph osd df should return output.


Additional info:

Relevant Jenkins job : 

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/22396/

The setup still exits. 
Cluster name: ypersky-lso12a
kubeconfig:  http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso12a/ypersky-lso12a_20230321T235552/openshift-cluster-dir/auth/kubeconfig

Must gather log link will be uploaded in the next comment.

Comment 3 Yuli Persky 2023-03-29 21:34:03 UTC

Please note that must gather logs are located at : 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182881/

Comment 4 Yuli Persky 2023-03-29 21:35:19 UTC

Also please note that the cluster setup still exists . 
Cluster name :ypersky-lso12a 
kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso12a/ypersky-lso12a_20230321T235552/openshift-cluster-dir/auth/kubeconfig

Comment 5 Yuli Persky 2023-03-30 10:12:02 UTC

Tests that run on the cluster prior to the start of the described problem : 

tests.e2e.performance.csi_tests.test_bulk_pod_attachtime_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pod_attachtime	(Passed)
tests.e2e.performance.csi_tests.test_pod_reattachtime	(Passed) 
tests.e2e.performance.csi_tests.test_pvc_bulk_clone_performance	(Passed)
tests.e2e.performance.csi_tests.test_pvc_bulk_creation_deletion_performance	(Passsed)
tests.e2e.performance.csi_tests.test_pvc_clone_performance (Failed because problem in the test code, already fixed). 

And the first test during which the problem appeared is : tests.e2e.performance.csi_tests.test_pvc_creation_deletion_performance

Comment 8 Yuli Persky 2023-04-17 20:11:11 UTC

I do not have any other logs except must gather logs : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182881/
Also must gather probably did not collect osd logs due to : https://bugzilla.redhat.com/show_bug.cgi?id=2168849

Comment 11 Yuli Persky 2023-09-12 10:13:32 UTC

@Mudit - I did not see this on the latest 4.13 runs ( neither on AWS nor on Vmware LSO). 
However, will keep an eye on all the executions.

Comment 12 Yuli Persky 2023-09-12 10:14:07 UTC

Also did not see this in 4.14 runs ( However, 4.14 has many other problems while running Performance tests).

Comment 13 Mudit Agarwal 2024-01-19 08:21:28 UTC

No update for a long time, please reopen if this still needs engineering attention.

Note You need to log in before you can comment on or make changes to this bug.