Bug 2010408

Summary:

[IBM Z]ocs-ci test related to test_mon_log_trimming failing due to timeout

Product:

[Red Hat Storage] Red Hat OpenShift Data Foundation

Reporter:

Abdul Kandathil (IBM) <akandath>

Component:

ceph

Assignee:

Scott Ostapovicz <sostapov>

Status:

CLOSED WORKSFORME

QA Contact:

Raz Tamir <ratamir>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.9

CC:

bniver, madam, muagarwa, ocs-bugs, odf-bz-bot, pdhange, ybenshim

Target Milestone:

---

Target Release:

---

Hardware:

s390x

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-10-14 16:20:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
test_mon_log_trimming.py	none

Description Abdul Kandathil (IBM) 2021-10-04 15:41:50 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Tier2 test "tests/manage/z_cluster/test_mon_log_trimming.py::TestMonLogTrimming::test_mon_log_trimming" failing due to timeoutExpired exception. 

Version of all relevant components (if applicable):

odf 4.9


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ocp cluster
2. Deploy ODF along with LSO
3. execute the ocs-ci test.
"tests/manage/z_cluster/test_mon_log_trimming.py::TestMonLogTrimming::test_mon_log_trimming"

Actual results:


Expected results:


Additional info:
Logs including must-gather are shared in google drive: https://drive.google.com/file/d/1M-k70kvIw3TQXNw9kOImdh2R7XCjS1ln/view?usp=sharing

Comment 2 Prashant Dhange 2021-10-05 01:59:58 UTC

@akandath Not sure what is the success criteria for TestMonLogTrimming::test_mon_log_trimming here and why it is timing out. Would it be possible to attach test_mon_log_trimming.py to BZ ? Based on the logs provided we are slowing down the paxos trimming here before starting of the test and restoring it to default values at the end of the test.

2021-09-24 14:41:01,157 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.setup.56 - Selected mon pod is: rook-ceph-mon-c-77ffcc9d8d-5bv2j
2021-09-24 14:41:01,158 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.setup.57 - Setting values: paxos_service_trim_min=10, paxos_service_trim_max=100 and osd_op_complaint_time=0.000001
2021-09-24 14:41:05,791 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.test_mon_log_trimming.162 - Initial db size: 61000K
2021-09-24 14:41:34,252 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.109 - Monitoring mon-c db size: 61000K
...
...
2021-09-24 14:51:33,973 - Thread-9 - INFO - tests.manage.z_cluster.test_mon_log_trimming.restart_osd_pod.130 - Deleting osd pod rook-ceph-osd-1-7446cb749b-t5rd5. Deletion #191
2021-09-24 14:51:34,327 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.109 - Monitoring mon-c db size: 66000K
2021-09-24 14:51:34,332 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.120 - Number of trims performed: 7
2021-09-24 14:51:36,454 - Thread-9 - INFO - tests.manage.z_cluster.test_mon_log_trimming.restart_osd_pod.136 - Number of osd deletions: 191
2021-09-24 14:56:50,359 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.finalizer.82 - Setting default values for paxos_service_trim_min(250), paxos_service_trim_max(500) and osd_op_complaint_time(30.0)

Comment 3 Abdul Kandathil (IBM) 2021-10-05 06:05:38 UTC

hi pdhange,
may I know which information I need to share? Or am I missing permission to view comments?

Comment 4 Prashant Dhange 2021-10-05 06:43:32 UTC

(In reply to Abdul Kandathil (IBM) from comment #3)
> hi pdhange,
> may I know which information I need to share? Or am I missing permission to
> view comments?
My comment was private and only visible within Red Hat. Can you provide more information on comment#2 ?

Comment 5 Abdul Kandathil (IBM) 2021-10-05 07:51:35 UTC

Created attachment 1829271 [details]
test_mon_log_trimming.py

Attached the requested script

Comment 6 Prashant Dhange 2021-10-05 08:40:35 UTC

The monstore is getting trimmed as per your testcase logs :
$ cat ./ocs-ci-logs-1632487201/tests/manage/z_cluster/test_mon_log_trimming.py/TestMonLogTrimming/test_mon_log_trimming/logs|grep -e trims -e trimmed -e "Initial db size"
2021-09-24 14:41:05,791 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.test_mon_log_trimming.162 - Initial db size: 61000K
2021-09-24 14:44:05,497 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 65000 to 34000. Trim #1
2021-09-24 14:44:17,975 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 70000 to 64000. Trim #2
2021-09-24 14:44:48,328 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 94000 to 62000. Trim #3
2021-09-24 14:46:21,652 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 62000 to 29000. Trim #4
2021-09-24 14:48:31,939 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 64000 to 33000. Trim #5
2021-09-24 14:49:18,425 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 68000 to 65000. Trim #6
2021-09-24 14:50:44,781 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 65000 to 31000. Trim #7
2021-09-24 14:51:34,332 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.120 - Number of trims performed: 7

The TimeoutExpired was for fio run and nothing related to ceph. You need to determine why fio is taking more than 480 seconds and exceeding 600 timeout set for subprocess.run (https://github.com/red-hat-storage/ocs-ci/blob/de85058255b6a50c65889b8077da326be28bdcc7/ocs_ci/utility/utils.py#L489), may be associated with cephfs pv mounted on /var/lib/www/html/ dir (pvc pvc-test-15e284767b4e417fb533856d37feaec) or slow IOs on cephfs mount path.

14:51:33 - MainThread - ocs_ci.ocs.resources.pod - ERROR - Found Exception: Command '['oc', '-n', 'namespace-test-2d515ce05d0241ad9d16a3794', 'rsh', 'pod-test-cephfs-c54436c68cae4c06ae0f71a5', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=480', '--size=100M', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds
Traceback (most recent call last):
  File "/root/ocs-ci/ocs_ci/ocs/resources/pod.py", line 139, in get_fio_results
    result = self.fio_thread.result(timeout)
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/ocs-ci/ocs_ci/utility/workloads/fio.py", line 90, in run
    return io_pod.exec_cmd_on_pod(fio_cmd, out_yaml_format=False, timeout=timeout)
  File "/root/ocs-ci/ocs_ci/ocs/resources/pod.py", line 172, in exec_cmd_on_pod
    return self.ocp.exec_oc_cmd(
  File "/root/ocs-ci/ocs_ci/ocs/ocp.py", line 144, in exec_oc_cmd
    out = run_cmd(
  File "/root/ocs-ci/ocs_ci/utility/utils.py", line 456, in run_cmd
    completed_process = exec_cmd(cmd, secrets, timeout, ignore_error, **kwargs)
  File "/root/ocs-ci/ocs_ci/utility/utils.py", line 489, in exec_cmd
    completed_process = subprocess.run(
  File "/usr/lib64/python3.8/subprocess.py", line 491, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib64/python3.8/subprocess.py", line 1024, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib64/python3.8/subprocess.py", line 1867, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib64/python3.8/subprocess.py", line 1068, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '['oc', '-n', 'namespace-test-2d515ce05d0241ad9d16a3794', 'rsh', 'pod-test-cephfs-c54436c68cae4c06ae0f71a5', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=480', '--size=100M', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds

Comment 7 Yosi Ben Shimon 2021-10-13 10:38:25 UTC

Hi,
Just giving my two cents regarding this bz.
This test runs these things in parallel:
 - fio in the bg for 480s (runtime)
 - Deletes a randomly picked OSD pod in a loop
 - Checks that mod db get trimmed (by its size)

I ran this test a few times using a cluster deployed on ibm cloud.
All the executions went well and passed.

From my POV, the issue is in fio that was supposed to be running 480s and not reach the 600s timeout as Prashant mentioned in comment #6.

Comment 8 Mudit Agarwal 2021-10-14 16:20:34 UTC

Seems like an environment issue, please reopen if this is consistently reproducible.