Bug 2010408
| Summary: | [IBM Z]ocs-ci test related to test_mon_log_trimming failing due to timeout | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Abdul Kandathil (IBM) <akandath> | ||||
| Component: | ceph | Assignee: | Scott Ostapovicz <sostapov> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Raz Tamir <ratamir> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.9 | CC: | bniver, madam, muagarwa, ocs-bugs, odf-bz-bot, pdhange, ybenshim | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | s390x | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-10-14 16:20:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Abdul Kandathil (IBM)
2021-10-04 15:41:50 UTC
@akandath Not sure what is the success criteria for TestMonLogTrimming::test_mon_log_trimming here and why it is timing out. Would it be possible to attach test_mon_log_trimming.py to BZ ? Based on the logs provided we are slowing down the paxos trimming here before starting of the test and restoring it to default values at the end of the test. 2021-09-24 14:41:01,157 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.setup.56 - Selected mon pod is: rook-ceph-mon-c-77ffcc9d8d-5bv2j 2021-09-24 14:41:01,158 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.setup.57 - Setting values: paxos_service_trim_min=10, paxos_service_trim_max=100 and osd_op_complaint_time=0.000001 2021-09-24 14:41:05,791 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.test_mon_log_trimming.162 - Initial db size: 61000K 2021-09-24 14:41:34,252 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.109 - Monitoring mon-c db size: 61000K ... ... 2021-09-24 14:51:33,973 - Thread-9 - INFO - tests.manage.z_cluster.test_mon_log_trimming.restart_osd_pod.130 - Deleting osd pod rook-ceph-osd-1-7446cb749b-t5rd5. Deletion #191 2021-09-24 14:51:34,327 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.109 - Monitoring mon-c db size: 66000K 2021-09-24 14:51:34,332 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.120 - Number of trims performed: 7 2021-09-24 14:51:36,454 - Thread-9 - INFO - tests.manage.z_cluster.test_mon_log_trimming.restart_osd_pod.136 - Number of osd deletions: 191 2021-09-24 14:56:50,359 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.finalizer.82 - Setting default values for paxos_service_trim_min(250), paxos_service_trim_max(500) and osd_op_complaint_time(30.0) hi pdhange, may I know which information I need to share? Or am I missing permission to view comments? (In reply to Abdul Kandathil (IBM) from comment #3) > hi pdhange, > may I know which information I need to share? Or am I missing permission to > view comments? My comment was private and only visible within Red Hat. Can you provide more information on comment#2 ? Created attachment 1829271 [details]
test_mon_log_trimming.py
Attached the requested script
The monstore is getting trimmed as per your testcase logs : $ cat ./ocs-ci-logs-1632487201/tests/manage/z_cluster/test_mon_log_trimming.py/TestMonLogTrimming/test_mon_log_trimming/logs|grep -e trims -e trimmed -e "Initial db size" 2021-09-24 14:41:05,791 - MainThread - INFO - tests.manage.z_cluster.test_mon_log_trimming.test_mon_log_trimming.162 - Initial db size: 61000K 2021-09-24 14:44:05,497 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 65000 to 34000. Trim #1 2021-09-24 14:44:17,975 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 70000 to 64000. Trim #2 2021-09-24 14:44:48,328 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 94000 to 62000. Trim #3 2021-09-24 14:46:21,652 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 62000 to 29000. Trim #4 2021-09-24 14:48:31,939 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 64000 to 33000. Trim #5 2021-09-24 14:49:18,425 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 68000 to 65000. Trim #6 2021-09-24 14:50:44,781 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.115 - Mon db trimmed from 65000 to 31000. Trim #7 2021-09-24 14:51:34,332 - Thread-8 - INFO - tests.manage.z_cluster.test_mon_log_trimming.check_mon_db_trim.120 - Number of trims performed: 7 The TimeoutExpired was for fio run and nothing related to ceph. You need to determine why fio is taking more than 480 seconds and exceeding 600 timeout set for subprocess.run (https://github.com/red-hat-storage/ocs-ci/blob/de85058255b6a50c65889b8077da326be28bdcc7/ocs_ci/utility/utils.py#L489), may be associated with cephfs pv mounted on /var/lib/www/html/ dir (pvc pvc-test-15e284767b4e417fb533856d37feaec) or slow IOs on cephfs mount path. 14:51:33 - MainThread - ocs_ci.ocs.resources.pod - ERROR - Found Exception: Command '['oc', '-n', 'namespace-test-2d515ce05d0241ad9d16a3794', 'rsh', 'pod-test-cephfs-c54436c68cae4c06ae0f71a5', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=480', '--size=100M', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds Traceback (most recent call last): File "/root/ocs-ci/ocs_ci/ocs/resources/pod.py", line 139, in get_fio_results result = self.fio_thread.result(timeout) File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/usr/lib64/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception File "/usr/lib64/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/root/ocs-ci/ocs_ci/utility/workloads/fio.py", line 90, in run return io_pod.exec_cmd_on_pod(fio_cmd, out_yaml_format=False, timeout=timeout) File "/root/ocs-ci/ocs_ci/ocs/resources/pod.py", line 172, in exec_cmd_on_pod return self.ocp.exec_oc_cmd( File "/root/ocs-ci/ocs_ci/ocs/ocp.py", line 144, in exec_oc_cmd out = run_cmd( File "/root/ocs-ci/ocs_ci/utility/utils.py", line 456, in run_cmd completed_process = exec_cmd(cmd, secrets, timeout, ignore_error, **kwargs) File "/root/ocs-ci/ocs_ci/utility/utils.py", line 489, in exec_cmd completed_process = subprocess.run( File "/usr/lib64/python3.8/subprocess.py", line 491, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/lib64/python3.8/subprocess.py", line 1024, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/usr/lib64/python3.8/subprocess.py", line 1867, in _communicate self._check_timeout(endtime, orig_timeout, stdout, stderr) File "/usr/lib64/python3.8/subprocess.py", line 1068, in _check_timeout raise TimeoutExpired( subprocess.TimeoutExpired: Command '['oc', '-n', 'namespace-test-2d515ce05d0241ad9d16a3794', 'rsh', 'pod-test-cephfs-c54436c68cae4c06ae0f71a5', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=480', '--size=100M', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds Hi, Just giving my two cents regarding this bz. This test runs these things in parallel: - fio in the bg for 480s (runtime) - Deletes a randomly picked OSD pod in a loop - Checks that mod db get trimmed (by its size) I ran this test a few times using a cluster deployed on ibm cloud. All the executions went well and passed. From my POV, the issue is in fio that was supposed to be running 480s and not reach the 600s timeout as Prashant mentioned in comment #6. Seems like an environment issue, please reopen if this is consistently reproducible. |