Description of problem (please be detailed as possible and provide log snippests): Tier2 test "tests/manage/z_cluster/test_mon_log_trimming.py::TestMonLogTrimming::test_mon_log_trimming" failing due to timeoutExpired exception. Version of all relevant components (if applicable): odf 4.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? yes Can this issue reproduce from the UI? no If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install ocp cluster 2. Deploy ODF along with LSO 3. execute the ocs-ci test. "tests/manage/z_cluster/test_mon_log_trimming.py::TestMonLogTrimming::test_mon_log_trimming" Actual results: Expected results: Additional info: Logs including must-gather are shared in google drive: https://drive.google.com/file/d/1M-k70kvIw3TQXNw9kOImdh2R7XCjS1ln/view?usp=sharing
@akandath Not sure what is the success criteria for TestMonLogTrimming::test_mon_log_trimming here and why it is timing out. Would it be possible to attach test_mon_log_trimming.py to BZ ? Based on the logs provided we are slowing down the paxos trimming here before starting of the test and restoring it to default values at the end of the test. 2021-09-24 14:41:01,157 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.setup.56 - Selected mon pod is: rook-ceph-mon-c-77ffcc9d8d-5bv2j 2021-09-24 14:41:01,158 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.setup.57 - Setting values: paxos_service_trim_min=10, paxos_service_trim_max=100 and osd_op_complaint_time=0.000001 2021-09-24 14:41:05,791 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.test_mon_log_trimming.162 - Initial db size: 61000K 2021-09-24 14:41:34,252 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.109 - Monitoring mon-c db size: 61000K ... ... 2021-09-24 14:51:33,973 - Thread-9 - INFO - tests.manage.z_cluster.test_mon_log_trimming.restart_osd_pod.130 - Deleting osd pod rook-ceph-osd-1-7446cb749b-t5rd5. Deletion #191 2021-09-24 14:51:34,327 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.109 - Monitoring mon-c db size: 66000K 2021-09-24 14:51:34,332 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.120 - Number of trims performed: 7 2021-09-24 14:51:36,454 - Thread-9 - INFO - tests.manage.z_cluster.test_mon_log_trimming.restart_osd_pod.136 - Number of osd deletions: 191 2021-09-24 14:56:50,359 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.finalizer.82 - Setting default values for paxos_service_trim_min(250), paxos_service_trim_max(500) and osd_op_complaint_time(30.0)
hi pdhange, may I know which information I need to share? Or am I missing permission to view comments?
(In reply to Abdul Kandathil (IBM) from comment #3) > hi pdhange, > may I know which information I need to share? Or am I missing permission to > view comments? My comment was private and only visible within Red Hat. Can you provide more information on comment#2 ?
Created attachment 1829271 [details] test_mon_log_trimming.py Attached the requested script
The monstore is getting trimmed as per your testcase logs : $ cat ./ocs-ci-logs-1632487201/tests/manage/z_cluster/test_mon_log_trimming.py/TestMonLogTrimming/test_mon_log_trimming/logs|grep -e trims -e trimmed -e "Initial db size" 2021-09-24 14:41:05,791 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.test_mon_log_trimming.162 - Initial db size: 61000K 2021-09-24 14:44:05,497 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 65000 to 34000. Trim #1 2021-09-24 14:44:17,975 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 70000 to 64000. Trim #2 2021-09-24 14:44:48,328 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 94000 to 62000. Trim #3 2021-09-24 14:46:21,652 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 62000 to 29000. Trim #4 2021-09-24 14:48:31,939 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 64000 to 33000. Trim #5 2021-09-24 14:49:18,425 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 68000 to 65000. Trim #6 2021-09-24 14:50:44,781 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 65000 to 31000. Trim #7 2021-09-24 14:51:34,332 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.120 - Number of trims performed: 7 The TimeoutExpired was for fio run and nothing related to ceph. You need to determine why fio is taking more than 480 seconds and exceeding 600 timeout set for subprocess.run (https://github.com/red-hat-storage/ocs-ci/blob/de85058255b6a50c65889b8077da326be28bdcc7/ocs_ci/utility/utils.py#L489), may be associated with cephfs pv mounted on /var/lib/www/html/ dir (pvc pvc-test-15e284767b4e417fb533856d37feaec) or slow IOs on cephfs mount path. 14:51:33 - MainThread - ocs_ci.ocs.resources.pod - ERROR - Found Exception: Command '['oc', '-n', 'namespace-test-2d515ce05d0241ad9d16a3794', 'rsh', 'pod-test-cephfs-c54436c68cae4c06ae0f71a5', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=480', '--size=100M', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds Traceback (most recent call last): File "/root/ocs-ci/ocs_ci/ocs/resources/pod.py", line 139, in get_fio_results result = self.fio_thread.result(timeout) File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/root/ocs-ci/ocs_ci/utility/workloads/fio.py", line 90, in run return io_pod.exec_cmd_on_pod(fio_cmd, out_yaml_format=False, timeout=timeout) File "/root/ocs-ci/ocs_ci/ocs/resources/pod.py", line 172, in exec_cmd_on_pod return self.ocp.exec_oc_cmd( File "/root/ocs-ci/ocs_ci/ocs/ocp.py", line 144, in exec_oc_cmd out = run_cmd( File "/root/ocs-ci/ocs_ci/utility/utils.py", line 456, in run_cmd completed_process = exec_cmd(cmd, secrets, timeout, ignore_error, **kwargs) File "/root/ocs-ci/ocs_ci/utility/utils.py", line 489, in exec_cmd completed_process = subprocess.run( File "/usr/lib64/python3.8/subprocess.py", line 491, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/lib64/python3.8/subprocess.py", line 1024, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/usr/lib64/python3.8/subprocess.py", line 1867, in _communicate self._check_timeout(endtime, orig_timeout, stdout, stderr) File "/usr/lib64/python3.8/subprocess.py", line 1068, in _check_timeout raise TimeoutExpired( subprocess.TimeoutExpired: Command '['oc', '-n', 'namespace-test-2d515ce05d0241ad9d16a3794', 'rsh', 'pod-test-cephfs-c54436c68cae4c06ae0f71a5', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=480', '--size=100M', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds
Hi, Just giving my two cents regarding this bz. This test runs these things in parallel: - fio in the bg for 480s (runtime) - Deletes a randomly picked OSD pod in a loop - Checks that mod db get trimmed (by its size) I ran this test a few times using a cluster deployed on ibm cloud. All the executions went well and passed. From my POV, the issue is in fio that was supposed to be running 480s and not reach the 600s timeout as Prashant mentioned in comment #6.
Seems like an environment issue, please reopen if this is consistently reproducible.