Bug 2010226

Summary: [IBM Z] ocs-ci tier2 test fails due to ineffective ROOK_LOG_LEVEL log level change
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Abdul Kandathil (IBM) <akandath>
Component: ocs-operatorAssignee: Jose A. Rivera <jrivera>
Status: CLOSED NOTABUG QA Contact: Raz Tamir <ratamir>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: brgardne, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-11 08:31:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Abdul Kandathil (IBM) 2021-10-04 09:11:56 UTC
Created attachment 1828874 [details]
logs

Description of problem (please be detailed as possible and provide log
snippests):
ocs-ci test "tests/manage/z_cluster/test_rook_ceph_operator_log_type.py::TestRookCephOperatorLogType::test_rook_ceph_operator_log_type" is failing due to ineffective loglevel change.

E           ValueError: OSD INFO Log does not exist or DEBUG Log exist on INFO mode

Version of all relevant components (if applicable):

odf 4.9.0-164.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ocp cluster
2. Deploy ODF along with LSO
3. execute the ocs-ci test.
"tests/manage/z_cluster/test_rook_ceph_operator_log_type.py::TestRookCephOperatorLogType::test_rook_ceph_operator_log_type"


Actual results:


Expected results:


Additional info:

Comment 2 Blaine Gardner 2021-10-04 15:24:40 UTC
There is no ocs must-gather to show whether the cluster itself is healthy or not.
Additionally, I can't really be sure what the test mentioned is looking for without a link to see the test's code. 

However, the log level code in Rook has not changed in release-4.9. My best guess at the purpose of this test is that it looking for the existence of Rook Ceph Operator logs at a particular log level. Given that I am sure the the logging code has not changed and is not bugged, I think the bug can exist in 3 places:
1. the ocs-operator (ODF operator) component is not setting the log level requested by the test
2. the test is not setting the log level properly
3. the test is not validating the log level properly

I am putting this bug onto ocs-operator.

For @akandath, please attach an ocs-must gather so that this issue can be debugged. As it is now, the attachment to this BZ does not have adequate information to determine any causality.

Comment 3 Abdul Kandathil (IBM) 2021-10-05 06:21:40 UTC
Please find the must-gather logs in gdrive link : https://drive.google.com/file/d/1Fh_NBaI8ouaHreMA4vb2XFAp8qts_gGW/view?usp=sharing

Comment 4 Mudit Agarwal 2021-11-09 13:42:19 UTC
Not a 4.9 blocker, moving it out

Comment 5 Blaine Gardner 2021-11-09 20:13:57 UTC
The Rook operator logs show logs at level Error, Warning, Info, and Debug. There is nothing wrong with the behavior of ROOK_LOG_LEVEL.

I also don't see the provided text "ValueError: OSD INFO Log does not exist or DEBUG Log exist on INFO mode" in the Rook operator log nor the logs from the failing test in the must-gather provided.

The best assumptions I can make are...
1. the test may not be related to ROOK_LOG_LEVEL, which controls the rook operator log level
2. the test is performing a specific test that is hidden under a broad name (and thus, I don't have the specific problem statement)
3. the test is not using the right heuristic to determine the log level of the Rook operator pod
4. the test is confusing ROOK_LOG_LEVEL with Ceph OSD log levels

@muagarwa please advise on next step(s). Should I investigate the integration test itself to see what is happening? If so, @akandath will have to link me to the source code of the test.

Comment 6 Abdul Kandathil (IBM) 2021-11-10 14:13:47 UTC
@brgardne, 
source code : https://github.com/red-hat-storage/ocs-ci

Test : tests/manage/z_cluster/test_rook_ceph_operator_log_type.py::TestRookCephOperatorLogType::test_rook_ceph_operator_log_type

Comment 7 Blaine Gardner 2021-11-10 23:06:44 UTC
I don't think this test is valid. To be honest, I'm not sure what it is trying to test. It changes the ROOK_LOG_LEVEL, then deletes an OSD pod, but deleting the OSD pod will have no effect on what is logged in the Rook operator pod since the Kubernetes Deployment will just restart the OSD pod, potentially without the Rook operator ever being aware of anything changing.

Instead, this test should be changing the Rook log level, then restarting the OPERATOR pod, waiting a little while for it to run, then verifying that Debug logs are present if desired (or missing if not desired).

@muagarwa please advise as to whether this BZ should be moved, closed, or some other process.

Comment 8 Mudit Agarwal 2021-11-11 08:31:05 UTC
This should be closed and Abdul can open a issue at https://github.com/red-hat-storage/ocs-ci/issues