Created attachment 1928759 [details] RBD test logs Description of problem (please be detailed as possible and provide log snippests): On executing the performance test - test_pvc_multiple_snapshot_performance[CephBlockPool-512], the test runs for few hours and fails with timeout error ' TimeoutError: Snapshot was not created on time' while creating snapshot #452. Cluster Configuration: Platform: Baremetal OCP version: 4.11 (installed via UPI) Node details: 3 masters and 3 workers ODF version: 4.11 PV count: 12 OSD count: 3 Drive details: Each worker has 2 slower drives(nvme) and 1 faster drive (Optane) Version of all relevant components (if applicable): OCP: 4.11, ODF: 4.11 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Steps to Reproduce: 1. Create an OCP 4.11 cluster on baremetal servers via UPI method 2. Install ODF 4.11 and ensure ceph is healthy with 3 OSDs 3. Install OCS-CI repo and run the performance test - test_pvc_multiple_snapshot_performance[CephBlockPool-512] 4. Command line used - : run-ci --cluster-name ocs-storagecluster --cluster-path /root/ocpcluster/ tests2/ tests2/e2e/performance/csi_tests/test_pvc_multi_snapshot_performance.py:: TestPvcMultiSnapshotPerformance::test_pvc_multiple_snapshot_performance[CephBlockPool-512] 2>&1 | tee /tmp/perf_multi_snap_rbd_logs.txt Attached the log file. Actual results: After running for few hours, the test fails with timeout error 'TimeoutError: Snapshot was not created on time' while creating the snapshot #452. It is observed that the creation time increases from snapshot #260 and it took more than 600 secs to create snapshot #452 and hence the test failed. Expected results: The test should pass without issues Additional info: Ensured that there was no other load/activity performed while the performance test was running. The bastion node on which the test was run and other nodes in the cluster was left undisturbed while the test was running.
Created attachment 1928761 [details] default config yaml used
Created attachment 1929045 [details] ocp_odf_version
OCP must gather log is collected and placed in dropbox location - https://www.dropbox.com/s/szd0gy1e6490q9p/ocp_must-gather.tar.gz?dl=0 ODF must gather log link - https://www.dropbox.com/s/4tkn1ms2jiuezxz/odf_must-gather.tar.gz?dl=0 Attached the OCP and ODF version screenshot and also the snapshot creation time alone for all the snapshots in a seperate file.
Created attachment 1929046 [details] snapshots creation time
As I see the results - the problem is not in the snapshot#452 creation failure ( due to timeout) but in the fact that starting from snapshot number ~260 the snapshot creation times continuously grow up to more than 18 secs.
Please note that we do not see this behaviour on AWS . See AWS comparison ( 4.12.0-145 vs 4.11.0-137) on the Perf Dashboard (choose PVC Multiple Snapshots Creation test results) http://10.0.78.167:8080/index.php?version1=20&build1=89&platform1=1&az_topology1=1&test_name%5B%5D=1&test_name%5B%5D=2&test_name%5B%5D=3&test_name%5B%5D=4&test_name%5B%5D=6&test_name%5B%5D=8&test_name%5B%5D=9&test_name%5B%5D=10&test_name%5B%5D=11&test_name%5B%5D=15&test_name%5B%5D=16&test_name%5B%5D=17&test_name%5B%5D=18&test_name%5B%5D=20&test_name%5B%5D=21&test_name%5B%5D=23&version2=26&build2=101&platform2=11&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options