Description of problem (please be detailed as possible and provide log snippests): Running the teardown after script tests/manage/z_cluster/test_ceph_pg_log_dups_trim.py::TestCephPgLogDupsTrimming::test_ceph_pg_log_dups_trim found that after the PVC was removed and when PVs have persistentVolumeReclaimPolicy: Delete, the PVs have not been removed automatically Logs, start line: 2022-12-02 23:50:31,900 - MainThread - INFO - ocs_ci.ocs.ocp.wait_for_delete.777 - PersistentVolumeClaim pvc-test-924955ab992a4c3a9214a5b9ba596a1 got deleted successfully Full log: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-003bu1cni30-t2/j-003bu1cni30-t2_20221202T102339/logs/ocs-ci-logs-1669980540/by_outcome/failed/tests/manage/z_cluster/test_ceph_pg_log_dups_trim.py/TestCephPgLogDupsTrimming/test_ceph_pg_log_dups_trim/logs Version of all relevant components (if applicable): Test failed on BAREMETAL-UPI-1AZ-RHCOS-NVME-INTEL-COMPACT-MODE-3M-0W and vSphere7-DC-CP_VC1-upi_1az_rhcos_vsan_3m_3w on both ODF 4.11 and 4.12 --------------- OC version: Client Version: 4.12.0-202208031327 Kustomize Version: v4.5.4 Server Version: 4.12.0-0.nightly-2022-12-05-194559 Kubernetes Version: v1.25.2+5533733 OCS verison: ocs-operator.v4.12.0-128.stable OpenShift Container Storage 4.12.0-128.stable Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-12-05-194559 True False 14d Error while reconciling 4.12.0-0.nightly-2022-12-05-194559: the cluster operator monitoring is not available Rook version: rook: v4.12.0-0.1457e0daa9d2d838b687e1703e78d723c40ea8c7 go: go1.18.7 Ceph version: ceph version 16.2.10-75.el8cp (68de1f204d3c34ec62bd59fae7a9814accf1ff25) pacific (stable) --------------- Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? yes, issue is consistent Can this issue reproduce from the UI? no If this is a regression, please provide more details to justify this: Steps to Reproduce: ) Deployed a ODF cluster 2) Selected the default "ocs-storagecluster-cephblockpool" for injecting dups 3) Identified the PG where we want to inject dups. 4) Make the OSD daemon down for running the ceph-objectstore-tool commands, a) oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage b) oc scale deployment ocs-operator --replicas=0 -n openshift-storage c) oc rsh -n openshift-storage `oc get po -l app=rook-ceph-tools -oname` ceph ceph> osd set noout noout is set ceph> osd set pause pauserd,pausewr is set d) oc get deployment rook-ceph-osd-$i -oyaml > osd_$i.yaml e) oc patch deployment/rook-ceph-osd-$i -n openshift-storage -p '{"spec": {"template": {"spec":{"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args":[]}]}}}}' f) Wait till the osd reaches running state 5) rsh to the osd pod and Inject Corrupted dups into the pg via COT : CEPH_ARGS='--no_mon_config --osd_pg_log_dups_tracked=999999999999' ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$i/ --pgid 1.13 --op pg-log-inject-dups --file text.json cat text.json file [{"reqid": "client.4177.0:0", "version": "111'999999999", "user_version": "0", "generate": "100", "return_code": "0"},] 6) Verify dups injection via COT : sh-4.4# CEPH_ARGS='--no_mon_config --osd_pg_log_dups_tracked=999999999999' ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$i/ --op log --pgid 1.13 > /var/log /ceph/pg_log_1-13.txt yum install jq -y cat /var/log/ceph/pg_log_1-13.txt | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' sh-4.4# cat /var/log/ceph/pg_log_1-13.txt | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' 1766 1405 -> dups count 7) Perform the injections on all the remaining OSDs - Repeat steps from 4 to 6 for the remaining OSDs 8) Bring the OSD daemon up and unset the pause and noout flags. oc replace --force -f osd_$i.yaml oc scale deployment rook-ceph-operator --replicas=1 -n openshift-storage oc scale deployment ocs-operator --replicas=1 -n openshift-storage oc rsh -n openshift-storage `oc get po -l app=rook-ceph-tools -oname` ceph osd unset noout osd unset pause 9) Now start IO and we can see that the dups have been accumulating on top of the corrupt ones that were injected. 10) Make the OSD daemon down for running the ceph-objectstore-tool commands, a) oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage b) oc scale deployment ocs-operator --replicas=0 -n openshift-storage c) oc rsh -n openshift-storage `oc get po -l app=rook-ceph-tools -oname` ceph ceph> osd set noout noout is set ceph> osd set pause pauserd,pausewr is set d) oc get deployment rook-ceph-osd-$i -oyaml > osd_$i.yaml e) oc patch deployment/rook-ceph-osd-$i -n openshift-storage -p '{"spec": {"template": {"spec":{"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args":[]}]}}}}' f) Wait till the osd reaches running state 11) Checked for the expected dups message in the osd logs expected_log = ( "num of dups exceeded 6000. You can be hit by THE DUPS BUG " "https://tracker.ceph.com/issues/53729. Consider ceph-objectstore-tool --op trim-pg-log-dups" ) 11) The expected log messages should be generated for all the osd pods 12) Verify that the dups are trimmed using the COT command sh-4.4# CEPH_ARGS='--no_mon_config --osd_pg_log_dups_tracked=999999999999' ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$i/ --op log --pgid 1.13 > /var/log /ceph/pg_log_1-13.txt yum install jq -y 12) cat /var/log/ceph/pg_log_1-13.txt | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' sh-4.4# cat /var/log/ceph/pg_log_1-13.txt | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)' All the osd's pg log dups are trimmed to the default tracked value: 3000 Actual results: fail to remove PV. PV left released/not bounded even though persistentVolumeReclaimPolicy: Delete Expected results: All PVs should be automatically deleted after bounded PVC deleted Additional info: must-gather logs: https://drive.google.com/file/d/13iyVV0-FoTNK-E0Ds8AyRwQucga3f_8b/view?usp=share_link
I can confirm that `oc cp` does not work. As far as I understand it is legit that `tar` is not a part of ceph pod anymore due to implementation of https://issues.redhat.com/browse/RHSTOR-3411 Saying that, `oc cp` does not work even before the leftovers check now (with 4.12.0-0.nightly-2023-02-18-121434). So it had to fail earlier in the body of the test, where we're copying file to the pod. This failure did not happen in the logged bug. And now with 4.12.0-0.nightly-2023-02-18-121434 it happens in every place of the test. @rar can I help with any additional info?
Test has been updated (oc cp, yum replaced). Test Passes successfully OC version: Client Version: 4.12.0-202208031327 Kustomize Version: v4.5.4 Server Version: 4.12.0-0.nightly-2023-02-18-121434 Kubernetes Version: v1.25.4+a34b9e9 OCS verison: ocs-operator.v4.12.1 OpenShift Container Storage 4.12.1 ocs-operator.v4.12.0 Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2023-02-18-121434 True False 2d10h Cluster version is 4.12.0-0.nightly-2023-02-18-121434 Rook version: rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace go: go1.18.7 Ceph version: ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)
(In reply to Daniel Osypenko from comment #20) > Test has been updated (oc cp, yum replaced). Test Passes successfully > > > OC version: > Client Version: 4.12.0-202208031327 > Kustomize Version: v4.5.4 > Server Version: 4.12.0-0.nightly-2023-02-18-121434 > Kubernetes Version: v1.25.4+a34b9e9 > > OCS verison: > ocs-operator.v4.12.1 OpenShift Container Storage 4.12.1 > ocs-operator.v4.12.0 Succeeded > > Cluster version > NAME VERSION AVAILABLE PROGRESSING > SINCE STATUS > version 4.12.0-0.nightly-2023-02-18-121434 True False > 2d10h Cluster version is 4.12.0-0.nightly-2023-02-18-121434 > > Rook version: > rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace > go: go1.18.7 > > Ceph version: > ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) > pacific (stable) Thanks, closing this BZ as NOTABUG