Bug 2155500 - dups.size logging causes PV leftovers
Summary: dups.size logging causes PV leftovers
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Rakshith
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-21 11:09 UTC by Daniel Osypenko
Modified: 2023-08-09 16:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-02-24 05:00:28 UTC
Embargoed:


Attachments (Terms of Use)

Description Daniel Osypenko 2022-12-21 11:09:29 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Running the teardown after script tests/manage/z_cluster/test_ceph_pg_log_dups_trim.py::TestCephPgLogDupsTrimming::test_ceph_pg_log_dups_trim found that after the PVC was removed and when PVs have persistentVolumeReclaimPolicy: Delete, the PVs have not been removed automatically


Logs, start line: 
2022-12-02 23:50:31,900 - MainThread - INFO - ocs_ci.ocs.ocp.wait_for_delete.777 - PersistentVolumeClaim pvc-test-924955ab992a4c3a9214a5b9ba596a1 got deleted successfully

Full log: 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-003bu1cni30-t2/j-003bu1cni30-t2_20221202T102339/logs/ocs-ci-logs-1669980540/by_outcome/failed/tests/manage/z_cluster/test_ceph_pg_log_dups_trim.py/TestCephPgLogDupsTrimming/test_ceph_pg_log_dups_trim/logs



Version of all relevant components (if applicable):

Test failed on
BAREMETAL-UPI-1AZ-RHCOS-NVME-INTEL-COMPACT-MODE-3M-0W
and
vSphere7-DC-CP_VC1-upi_1az_rhcos_vsan_3m_3w
on both ODF 4.11 and 4.12
---------------
OC version:
Client Version: 4.12.0-202208031327
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2022-12-05-194559
Kubernetes Version: v1.25.2+5533733

OCS verison:
ocs-operator.v4.12.0-128.stable              OpenShift Container Storage   4.12.0-128.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-12-05-194559   True        False         14d     Error while reconciling 4.12.0-0.nightly-2022-12-05-194559: the cluster operator monitoring is not available

Rook version:
rook: v4.12.0-0.1457e0daa9d2d838b687e1703e78d723c40ea8c7
go: go1.18.7

Ceph version:
ceph version 16.2.10-75.el8cp (68de1f204d3c34ec62bd59fae7a9814accf1ff25) pacific (stable)
---------------

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
yes, issue is consistent

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
) Deployed a ODF cluster	
2) Selected the default "ocs-storagecluster-cephblockpool" for injecting dups	
3) Identified the PG where we want to inject dups.	
4) Make the OSD daemon down for running the ceph-objectstore-tool commands,
    a) oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
    b) oc scale deployment ocs-operator --replicas=0 -n openshift-storage
    c) oc rsh -n openshift-storage `oc get po -l app=rook-ceph-tools -oname` ceph
       ceph> osd set noout
             noout is set
       ceph> osd set pause
         pauserd,pausewr is set
    d) oc get deployment rook-ceph-osd-$i -oyaml > osd_$i.yaml
    e) oc patch deployment/rook-ceph-osd-$i -n openshift-storage -p '{"spec": {"template": {"spec":{"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args":[]}]}}}}'
    f) Wait till the osd reaches running state	
5)  rsh to the osd pod and Inject Corrupted dups into the pg via COT :
   CEPH_ARGS='--no_mon_config --osd_pg_log_dups_tracked=999999999999' ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$i/ --pgid 1.13 --op pg-log-inject-dups --file text.json

   cat text.json file
   [{"reqid": "client.4177.0:0", "version": "111'999999999", "user_version": "0", "generate": "100", "return_code": "0"},]	
6) Verify dups injection via COT :
   sh-4.4# CEPH_ARGS='--no_mon_config --osd_pg_log_dups_tracked=999999999999' ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$i/ --op log --pgid 1.13  > /var/log    
   /ceph/pg_log_1-13.txt
   yum install jq -y
   cat /var/log/ceph/pg_log_1-13.txt | jq  '(.pg_log_t.log|length),(.pg_log_t.dups|length)'
   sh-4.4#  cat /var/log/ceph/pg_log_1-13.txt | jq  '(.pg_log_t.log|length),(.pg_log_t.dups|length)'
   1766
   1405 -> dups count	
7) Perform the injections on all the remaining OSDs - Repeat steps from 4 to 6 for the remaining OSDs
8) Bring the OSD daemon up and unset the pause and noout flags.
   oc replace --force -f osd_$i.yaml
   oc scale deployment rook-ceph-operator --replicas=1 -n openshift-storage
   oc scale deployment ocs-operator --replicas=1 -n openshift-storage
   oc rsh -n openshift-storage `oc get po -l app=rook-ceph-tools -oname` ceph
   osd unset noout
   osd unset pause	
9) Now start IO and we can see that the dups have been accumulating on top of the corrupt ones that were injected.	
10) Make the OSD daemon down for running the ceph-objectstore-tool commands,
    a) oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
    b) oc scale deployment ocs-operator --replicas=0 -n openshift-storage
    c) oc rsh -n openshift-storage `oc get po -l app=rook-ceph-tools -oname` ceph
       ceph> osd set noout
             noout is set
       ceph> osd set pause
         pauserd,pausewr is set
    d) oc get deployment rook-ceph-osd-$i -oyaml > osd_$i.yaml
    e) oc patch deployment/rook-ceph-osd-$i -n openshift-storage -p '{"spec": {"template": {"spec":{"containers": [{"name": "osd", "command": ["sleep", "infinity"], "args":[]}]}}}}'
    f) Wait till the osd reaches running state
11)
Checked for the expected dups message in the osd logs
        expected_log = (
            "num of dups exceeded 6000. You can be hit by THE DUPS BUG "
            "https://tracker.ceph.com/issues/53729. Consider ceph-objectstore-tool --op trim-pg-log-dups"
        )	11) The expected log messages should be generated for all the osd pods
12) Verify that the dups are trimmed using the COT command
   sh-4.4# CEPH_ARGS='--no_mon_config --osd_pg_log_dups_tracked=999999999999' ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$i/ --op log --pgid 1.13  > /var/log    
   /ceph/pg_log_1-13.txt
   yum install jq -y
12)   cat /var/log/ceph/pg_log_1-13.txt | jq  '(.pg_log_t.log|length),(.pg_log_t.dups|length)'
   sh-4.4#  cat /var/log/ceph/pg_log_1-13.txt | jq  '(.pg_log_t.log|length),(.pg_log_t.dups|length)'
All the osd's pg log dups are trimmed to the default tracked value: 3000


Actual results:
fail to remove PV. PV left released/not bounded even though persistentVolumeReclaimPolicy: Delete

Expected results:
All PVs should be automatically deleted after bounded PVC deleted


Additional info:
must-gather logs: https://drive.google.com/file/d/13iyVV0-FoTNK-E0Ds8AyRwQucga3f_8b/view?usp=share_link

Comment 19 Daniel Osypenko 2023-02-20 17:59:38 UTC
I can confirm that `oc cp` does not work. As far as I understand it is legit that `tar` is not a part of ceph pod anymore due to implementation of https://issues.redhat.com/browse/RHSTOR-3411 Saying that, `oc cp` does not work even before the leftovers check now (with 4.12.0-0.nightly-2023-02-18-121434). So it had to fail earlier in the body of the test, where we're copying file to the pod. This failure did not happen in the logged bug. And now with 
4.12.0-0.nightly-2023-02-18-121434 it happens in every place of the test.
@rar can I help with any additional info?

Comment 20 Daniel Osypenko 2023-02-21 21:11:39 UTC
Test has been updated (oc cp, yum replaced). Test Passes successfully 


OC version:
Client Version: 4.12.0-202208031327
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2023-02-18-121434
Kubernetes Version: v1.25.4+a34b9e9

OCS verison:
ocs-operator.v4.12.1              OpenShift Container Storage   4.12.1    ocs-operator.v4.12.0              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2023-02-18-121434   True        False         2d10h   Cluster version is 4.12.0-0.nightly-2023-02-18-121434

Rook version:
rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace
go: go1.18.7

Ceph version:
ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23) pacific (stable)

Comment 21 Rakshith 2023-02-24 05:00:28 UTC
(In reply to Daniel Osypenko from comment #20)
> Test has been updated (oc cp, yum replaced). Test Passes successfully 
> 
> 
> OC version:
> Client Version: 4.12.0-202208031327
> Kustomize Version: v4.5.4
> Server Version: 4.12.0-0.nightly-2023-02-18-121434
> Kubernetes Version: v1.25.4+a34b9e9
> 
> OCS verison:
> ocs-operator.v4.12.1              OpenShift Container Storage   4.12.1   
> ocs-operator.v4.12.0              Succeeded
> 
> Cluster version
> NAME      VERSION                              AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.12.0-0.nightly-2023-02-18-121434   True        False        
> 2d10h   Cluster version is 4.12.0-0.nightly-2023-02-18-121434
> 
> Rook version:
> rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace
> go: go1.18.7
> 
> Ceph version:
> ceph version 16.2.10-94.el8cp (48ce8ed67474ea50f10c019b9445be7f49749d23)
> pacific (stable)


Thanks, 
closing this BZ as NOTABUG


Note You need to log in before you can comment on or make changes to this bug.