Bug 2028757

Summary:	[RBD][Encryption] Data integrity check failed on cloned PVC
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Jilju Joy <jijoy>
Component:	csi-driver	Assignee:	yati padia <ypadia>
Status:	CLOSED NOTABUG	QA Contact:	Elad <ebenahar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	cblum, madam, mmuench, muagarwa, ndevos, ocs-bugs, odf-bz-bot, ypadia
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-06 11:25:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jilju Joy 2021-12-03 07:53:43 UTC

Description of problem:
The test case given below failed due to md5sum mismatch in the cloned PVC.
This test case verifies PVC to PVC clone feature for encrypted RBD Block VolumeMode PVCs.

tests/manage/pv_services/pvc_clone/test_encrypted_rbd_pvc_clone.py::TestEncryptedRbdClone::test_pvc_to_pvc_clone[v2]

Test case error:

        current_md5sum = cal_md5sum(pod_obj, file_name, block)
        logger.info(f"Original md5sum of file: {original_md5sum}")
        logger.info(f"Current md5sum of file: {current_md5sum}")
>       assert current_md5sum == original_md5sum, "Data corruption found"
E       AssertionError: Data corruption found

ocs_ci/ocs/resources/pod.py:785: AssertionError


The test case is creating two RBD PVC's with volume mode Block. Access mode of one PVCs will be RWO and RWX for the second PVC. Data corruption was found for one of them which is having access mode RWO(both parent and cloned PVC).

The test case failure is very inconsistent. But still raising this bug because the issue is about data corruption. The test case failed only twice due to this error. One in internal mode and one time in external mode cluster.


The details given below are from the failure in internal mode cluster.
 
Test run:
ocs-ci results for OCS4-9-Downstream-OCP4-9-VSPHERE6-UPI-FIPS-1AZ-RHCOS-VSAN-3M-3W-tier1 (BUILD ID: 4.9.0-249.ci RUN ID: 1637806889)

must-gather logs: 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-106vuf1cs33-t1/j-106vuf1cs33-t1_20211125T012557/logs/failed_testcase_ocs_logs_1637806889/test_pvc_to_pvc_clone%5bv2%5d_ocs_logs/


Test case :
tests/manage/pv_services/pvc_clone/test_encrypted_rbd_pvc_clone.py::TestEncryptedRbdClone::test_pvc_to_pvc_clone[v2]

Test case logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-106vuf1cs33-t1/j-106vuf1cs33-t1_20211125T012557/logs/ocs-ci-logs-1637806889/tests/manage/pv_services/pvc_clone/test_encrypted_rbd_pvc_clone.py/TestEncryptedRbdClone/test_pvc_to_pvc_clone-v2/


Cloned PVC name : clone-pvc-test-7074f875a85e4b-400a9faf87
Parent PVC name : pvc-test-7074f875a85e4bf0b6705ee11d6c0ab


Relevant logs from the test case:
 
2021-11-25 14:25:16  08:55:16 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-51aeaa179720451e97fca455a rsh pod-test-rbd-45de36cd6f8d4ec6ac48eb73bcb bash -c "md5sum /dev/rbdblock"
2021-11-25 14:25:21  08:55:20 - MainThread - ocs_ci.ocs.resources.pod - INFO - md5sum of file /dev/rbdblock: becd4348581bc6d93510cf01b65e2d54
.........
.........
2021-11-25 14:25:53  08:55:53 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n namespace-test-51aeaa179720451e97fca455a rsh pod-test-rbd-4e8811f3701f4a6a872391c70fe bash -c "md5sum /dev/rbdblock"
2021-11-25 14:26:02  08:56:02 - MainThread - ocs_ci.ocs.resources.pod - INFO - md5sum of file /dev/rbdblock: 325a4597930ed0935d39a94567f2b902
2021-11-25 14:26:02  08:56:02 - MainThread - ocs_ci.ocs.resources.pod - INFO - Original md5sum of file: becd4348581bc6d93510cf01b65e2d54
2021-11-25 14:26:02  08:56:02 - MainThread - ocs_ci.ocs.resources.pod - INFO - Current md5sum of file: 325a4597930ed0935d39a94567f2b902


=============================================================================

Version-Release number of selected component (if applicable):
ODF 4.9.0-249.ci
OCP 4.9.0-0.nightly-2021-11-24-185059
Ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)


How reproducible:
Not easily reproducible. Less than 1/12.

Steps to Reproduce:
The issue was never seen in manual tests. The only known reproducer is by running the ocs-ci test case given below. The failure is very inconsistent.
tests/manage/pv_services/pvc_clone/test_encrypted_rbd_pvc_clone.py::TestEncryptedRbdClone::test_pvc_to_pvc_clone[v2]

Actual results:
md5sum mismatch between cloned and parent PVC


Expected results:
md5sum should be same in parent and cloned PVC

Additional info:

Comment 2 yati padia 2021-12-06 09:41:51 UTC

Hi Jijlu,

I have a question before we can get to the root cause for the mismatch of the md5sum values.
How are the read and write operations done here?
Because sometimes just running md5sum maynot be enough. In block mode, it's important to use Direct-io for the I/O to avoid caching.
On checking the test case I don't see Direct-IO being used for write operation( I maybe wrong). 
Something like this, `dd oflag=directio -f=/dev/rbdblock | md5sum` may help.

Here, we need to make sure that the writes are completely flushed to the cluster before the clone is done. Because if all caches are not flushed, the md5sum of the clone will result in checking incomplete sets of writes (excluding unflushed writes from cache)
And thereby resulting in a mismatched md5sum value.

Also, on checking the logs we see that the unmap(NodeUnstageVolume) was done 10mins after the creation of the clone, which means the writes were most likely not flushed.

Logs with timestamp of cloning:
```
2021-11-25T08:55:20.916652229Z I1125 08:55:20.916609       1 connection.go:183] GRPC call: /csi.v1.Controller/CreateVolume
2021-11-25T08:55:20.916887317Z I1125 08:55:20.916702       1 connection.go:184] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-bf1945ac-794b-466c-a983-ce98f12f5e2e","parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-bf1945ac-794b-466c-a983-ce98f12f5e2e","csi.storage.k8s.io/pvc/name":"clone-pvc-test-7074f875a85e4b-400a9faf87","csi.storage.k8s.io/pvc/namespace":"namespace-test-51aeaa179720451e97fca455a","encrypted":"true","encryptionKMSID":"vault-test-af2553dd7e594e659b543976457a1","imageFeatures":"layering","imageFormat":"2","pool":"ocs-storagecluster-cephblockpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Block":{}},"access_mode":{"mode":1}}],"volume_content_source":{"Type":{"Volume":{"volume_id":"0001-0011-openshift-storage-0000000000000001-108d3b2a-4dcd-11ec-aa1e-0a580a810221"}}}}
```

Logs of NodeUnstageVolume:
```
2021-11-25T09:06:50.410287166Z I1125 09:06:50.410260   19603 utils.go:177] ID: 5180 Req-ID: 0001-0011-openshift-storage-0000000000000001-108d3b2a-4dcd-11ec-aa1e-0a580a810221 GRPC call: /csi.v1.Node/NodeUnstageVolume
2021-11-25T09:06:50.410375913Z I1125 09:06:50.410363   19603 utils.go:181] ID: 5180 Req-ID: 0001-0011-openshift-storage-0000000000000001-108d3b2a-4dcd-11ec-aa1e-0a580a810221 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/pvc-02f315dd-19c2-4e99-a825-bc0cb1b9b199","volume_id":"0001-0011-openshift-storage-0000000000000001-108d3b2a-4dcd-11ec-aa1e-0a580a810221"}
```

The writes can be flushed either by using Direct-IO or rbd unmap. Is any of these done to flush the writes before the clone is done?

Comment 3 Chris Blum 2021-12-06 10:41:21 UTC

Is the PV attached (mounted) to a Pod for this test or is it just "created" without attaching it? Mounting the PV can change things in the file system header, which would change the checksum

Comment 4 Elad 2021-12-06 14:25:37 UTC

I can see that the test used to pass in the past. For instance, in this run:
OCS4-9-Downstream-OCP4-9-IBM-Cloud-managed-1AZ-RHEL-3W-tier1 (BUILD ID: 4.9.0-249.ci RUN ID: 1638210117)

The run is from Tue, Nov 30, 9:55 AM (6 days ago)

Comment 5 Niels de Vos 2021-12-06 15:02:04 UTC

(In reply to Elad from comment #4)
> I can see that the test used to pass in the past. For instance, in this run:
> OCS4-9-Downstream-OCP4-9-IBM-Cloud-managed-1AZ-RHEL-3W-tier1 (BUILD ID:
> 4.9.0-249.ci RUN ID: 1638210117)
> 
> The run is from Tue, Nov 30, 9:55 AM (6 days ago)

Jilju mentioned in comment #0:

> How reproducible:
> Not easily reproducible. Less than 1/12.

That means, 11 out of 12 times the test does not return an error.

There is a strong suggestion that the testing is racy. There is a possibility of unflushed cached data when the clone is created, flushing the caches before creating the clone would likely result in more stable test results. Yati explained this quite well in comment #2.

Comment 6 Jilju Joy 2021-12-06 18:07:49 UTC

(In reply to yati padia from comment #2)
> Hi Jijlu,
> 
> I have a question before we can get to the root cause for the mismatch of
> the md5sum values.
> How are the read and write operations done here?
fio is used in the test case for I/O.

fio --name=fio-rand-write --filename=/dev/rbdblock --rw=randwrite --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=500M --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson --output-format=json


> Here, we need to make sure that the writes are completely flushed to the
> cluster before the clone is done.
Should we add some additional steps after I/O to perform this before creating the clone ?

Comment 7 Jilju Joy 2021-12-06 18:18:26 UTC

(In reply to Chris Blum from comment #3)
> Is the PV attached (mounted) to a Pod for this test or is it just "created"
> without attaching it? Mounting the PV can change things in the file system
> header, which would change the checksum
Yes, it is attached to a pod. This issue is not seen usually. There are other tests which verifies data integrity using md5sum after taking clone and restoring snapshots. Those tests are passing, that means the change in checksum after attaching to a pod might be race condition. If there is a better alternative to verify the data integrity we can change the test accordingly.

Comment 8 yati padia 2021-12-07 04:29:44 UTC

(In reply to Jilju Joy from comment #6)
> fio --name=fio-rand-write --filename=/dev/rbdblock --rw=randwrite --bs=4K
> --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=500M
> --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson
> --output-format=json

Does the direct=0 represent the direct-IO operation?
> 
> > Here, we need to make sure that the writes are completely flushed to the
> > cluster before the clone is done.
> Should we add some additional steps after I/O to perform this before
> creating the clone ?
You can try using the dd command as mentioned in comment 2.

Comment 9 Niels de Vos 2021-12-07 07:50:59 UTC

(In reply to yati padia from comment #8)
> (In reply to Jilju Joy from comment #6)
> > fio --name=fio-rand-write --filename=/dev/rbdblock --rw=randwrite --bs=4K
> > --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=500M
> > --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson
> > --output-format=json
> 
> Does the direct=0 represent the direct-IO operation?

Indeed, enable direct-io with this option. Or maybe fio offers an option to flush caches once it is finished?


(In reply to Jilju Joy from comment #7)
> (In reply to Chris Blum from comment #3)
> > Is the PV attached (mounted) to a Pod for this test or is it just "created"
> > without attaching it? Mounting the PV can change things in the file system
> > header, which would change the checksum
> Yes, it is attached to a pod. This issue is not seen usually. There are
> other tests which verifies data integrity using md5sum after taking clone
> and restoring snapshots. Those tests are passing, that means the change in
> checksum after attaching to a pod might be race condition. If there is a
> better alternative to verify the data integrity we can change the test
> accordingly.

Detaching from the Pod (and waiting for the NodeUnstageVolume CSI operation to finish) is an alternative to flushing cached writes.

Comment 10 Jilju Joy 2021-12-07 08:57:13 UTC

Added direct=1 in fio command for the test as suggested by Yati in our discussion. We will check if the issue is reproducible in upcoming runs.

Comment 11 Chris Blum 2021-12-07 13:37:42 UTC

btw - ripsaw, our benchmark operator, supports running fio and has the option for dropping caches:
https://github.com/cloud-bulldozer/benchmark-operator/blob/master/docs/fio_distributed.md

drop_cache_kernel: (optional, default false) If set to True, kernel cache will be dropped on the labeled hosts before each sample. See here for how to set up kernel cache dropping
drop_cache_rook_ceph: (optional, default false) The IP address of the pod hosting the Rook-Ceph cache drop URL -- See here for how to set up Ceph OSD cache dropping
Technical Note: If you are running kube/openshift on VMs make sure the diskimage or volume is preallocated.

Comment 12 Jilju Joy 2021-12-08 10:57:48 UTC

Executed the test case 20 times after updating the fio parameters. All 20 are success. We will closely watch the production runs of the test case.

Comment 13 Mudit Agarwal 2021-12-12 10:49:09 UTC

Moving it out from 4.9 based on the above comment

Comment 14 yati padia 2022-01-06 11:25:19 UTC

Closing this bug as there is nothing to be done from our side.
Assuming https://bugzilla.redhat.com/show_bug.cgi?id=2028757#c10 to be proposed solution.