Bug 2007566

Summary: [IBM Z] ceph osd heap profiler fails with "not using tcmalloc" error
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Abdul Kandathil (IBM) <akandath>
Component: cephAssignee: Scott Ostapovicz <sostapov>
Status: CLOSED WORKSFORME QA Contact: Raz Tamir <ratamir>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: bhubbard, bniver, madam, muagarwa, ocs-bugs, odf-bz-bot, pbalogh
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-14 02:33:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Abdul Kandathil (IBM) 2021-09-24 09:41:28 UTC
Description of problem (please be detailed as possible and provide log
snippets):
ocs-ci test "tests/manage/z_cluster/test_osd_heap_profile.py::TestOSDHeapProfile::test_osd_heap_profile" fails with with "could not issue heap profiler command -- not using tcmalloc!"

Error:

E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage rsh rook-ceph-tools-65f5c5798c-zm4t7 ceph tell osd.0 heap start_profiler.
E           Error is Error ENOTSUP: could not issue heap profiler command -- not using tcmalloc!
E           command terminated with exit code 95


Version of all relevant components (if applicable):
ocs 4.9 (tested with 4.9.0-154.ci)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ocp cluster
2. Deploy ODF along with LSO
3. executed the ocs-ci test or issue command "oc -n openshift-storage rsh rook-ceph-tools-65f5c5798c-zm4t7 ceph tell osd.2 heap start_profiler"


Actual results:

```
E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage rsh rook-ceph-tools-65f5c5798c-zm4t7 ceph tell osd.2 heap start_profiler.
E           Error is Error ENOTSUP: could not issue heap profiler command -- not using tcmalloc!
E           command terminated with exit code 95
```

Expected results:

command executes without any errors.

Additional info:
Test & must-gather logs : https://drive.google.com/file/d/1Ya-vU3cPD9hnNfBdw71TNSUH_ZzOFIRX/view?usp=sharing

Comment 3 Scott Ostapovicz 2021-09-28 13:48:54 UTC
This looks like a CI build problem to me.

Comment 4 Mudit Agarwal 2021-10-06 08:21:36 UTC
Petr, can you please take a look if this is a ci issue?

Comment 5 Petr Balogh 2021-10-06 11:12:29 UTC
Hello Abdul, is this issue consistently reproducible? 

Is this error coming from `oc rhs` command itself or this is really returned output from:

ceph tell osd.2 heap start_profiler

From toolbox pod?


If it's constantly reproducible, can you please just RSH to toolbox pod and try to run command locally there in the pod?


This is the first time I see such error, so not sure what can be problem, but if this is the output coming from the command (ceph tell osd.2 heap start_profiler) itself, it doesn't look like issue in OCS-CI if this is the valid command.

If it's returned from oc command, then it can be some OCP issue to run RSH command on pod. Which can be temporary glitch or bug, not sure.

Comment 6 Brad Hubbard 2021-10-11 23:00:26 UTC
I'd suggest you check whether the 'z' build disables tcmalloc. If so this error is totally expected.

https://github.com/ceph/ceph/blob/29bda6fd2aabcb37cf1c46a6edddf004d28bb164/src/osd/OSD.cc#L11509-L11513

Comment 7 Abdul Kandathil (IBM) 2021-10-12 10:36:00 UTC
With the newer version (odf 4.9.0-164.ci), I am not able to reproduce this issue. 

sh-4.4$ ceph tell osd.0 heap start_profiler
osd.0 started profiler
sh-4.4$

Comment 8 Brad Hubbard 2021-10-14 00:48:54 UTC
I think what probably happened here is the original ceph build you tested for 4.9.0-154.ci had tcmalloc disabled (I remember hearing something about this happening on some earlier builds) but that the ceph build for 4.9.0-164.ci now has tcmalloc enabled.

Comment 9 Mudit Agarwal 2021-10-14 02:33:16 UTC
Please reopen if this still exists.