Bug 2292619 - [IBM Power] [ODF 4.16] When running IO from one pod to a RWX block mode PVC, the data is not synced on other pods which have the same PVC mounted [NEEDINFO]
Summary: [IBM Power] [ODF 4.16] When running IO from one pod to a RWX block mode PVC, ...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.16
Hardware: ppc64le
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Niels de Vos
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-06-17 04:42 UTC by Pooja Soni
Modified: 2024-11-07 06:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
mrajanna: needinfo? (posoni)
ndevos: needinfo? (posoni)


Attachments (Terms of Use)

Description Pooja Soni 2024-06-17 04:42:40 UTC
Description of problem (please be detailed as possible and provide log
snippests):
tests/functional/pv/pv_services/test_rbd_rwx_pvc.py::TestRbdBlockPvc::test_rbd_block_rwx_pvc is failing on IBM Power on ODF 4.16. Error being - 
            # Verify md5sum has changed after IO
            log.info("Verify md5sum has changed after IO. Verify from all pods.")
            for pod_obj in self.pod_objs:
                # Find md5sum                                                                                                                        pod_obj.md5sum_after_io = pod_obj.exec_sh_cmd_on_pod(
                    command=f"dd iflag=direct if={pod_obj.get_storage_path(storage_type='block')} | md5sum"
                )
>               assert pod_obj.md5sum_after_io != md5sum_value_initial, (
                    f"md5sum obtained from the pod {pod_obj.name} has not changed after IO. "
                    f"IO was run from pod {io_pod.name}"
                )
E               AssertionError: md5sum obtained from the pod pod-test-rbd-21b0670da23f483a8f5619730ea has not changed after IO. IO was run from pod pod-test-rbd-27928049ec6045608dbfe88c953
E               assert 'f1c9645dbc14efddc7d8a322685f26eb  -\n' != 'f1c9645dbc14efddc7d8a322685f26eb  -\n'                            E                +  where 'f1c9645dbc14efddc7d8a322685f26eb  -\n' = <ocs_ci.ocs.resources.pod.Pod object at 0x7ffe9b2787c0>.md5sum_after_io                                                                                                                                                                                                                                                                    tests/functional/pv/pv_services/test_rbd_rwx_pvc.py:92: AssertionError

Version of all relevant components (if applicable):
# oc version
Client Version: 4.16.0-rc.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.0-rc.4
Kubernetes Version: v1.29.5+f6419fb

ODF : 4.16.0

This issue is seen from 4.16 version of ODF. 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible? Yes , this issue is consistent and occurred every time the test case is run.


Can this issue reproduce from the UI? 


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 5 Santosh Pillai 2024-06-17 06:04:47 UTC
Hi
- What is `test_rbd_block_rwx_pvc`?
- What are the steps to reproduce?
- What is the expected and actual behavior?

Comment 6 Santosh Pillai 2024-06-17 06:05:08 UTC
Hi
- What is `test_rbd_block_rwx_pvc`?
- What are the steps to reproduce?
- What is the expected and actual behavior?

Comment 7 Santosh Pillai 2024-06-17 06:05:37 UTC
Hi
- What is `test_rbd_block_rwx_pvc`?
- What are the steps to reproduce?
- What is the expected and actual behavior?

Comment 8 Santosh Pillai 2024-06-17 06:07:07 UTC
Some comment got posted multiple times due to some network issues at my end.

Comment 9 Jilju Joy 2024-06-17 16:50:36 UTC
Automated test case steps:

1. Create RBD PVC of volume mode Block and access mode RWX.
2. Create 1 pod on each worker node to consume the same PVC.
3. Verify md5sum from the pods and ensure that the value obtained from all pods are the same before starting the I/O.
Example command to obtain the md5sum where "/dev/rbdblock" is the devicePath: 

oc -n namespace-test-3379d12f369a4da0b6d20ad40 exec pod-test-rbd-d0f6155489e149e6b33ad93caf0 -- bash -c "dd iflag=direct if=/dev/rbdblock | md5sum"

4. Write data using fio from one of the pods.

Example:
oc -n namespace-test-3379d12f369a4da0b6d20ad40 rsh pod-test-rbd-d0f6155489e149e6b33ad93caf0 fio --name=fio-rand-write --filename=/dev/rbdblock --rw=randwrite --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=5 --size=5M --ioengine=libaio --invalidate=0 --iodepth=4 --rate=1m --rate_process=poisson --end_fsync=1 --output-format=json

5. Verify that md5sum changed on all pods (command in step 3).


Expected result:

In the step 5, md5sum obtained from all pods should be different from that obtained in the step 3. And the value should be same on all pods.


Actual result:

The value of md5sum changed on the pod from where fio was executed. Other pods still have the md5sum obtained before fio.

Comment 13 Jilju Joy 2024-06-18 09:06:42 UTC
(In reply to Niels de Vos from comment #12)
> (In reply to Madhu Rajanna from comment #11)
> > >oc -n namespace-test-3379d12f369a4da0b6d20ad40 rsh pod-test-rbd-d0f6155489e149e6b33ad93caf0 fio --name=fio-rand-write --filename=/dev/rbdblock --rw=randwrite --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=5 --size=5M --ioengine=libaio --invalidate=0 --iodepth=4 --rate=1m --rate_process=poisson --end_fsync=1 --output-format=json
> > 
> > @Niels, do you see any problem with the fio command due to this checksum
> > mismatch can happen?
> 
> I do not immediately see an issue with that command, the important options
> seem to be set ("direct" and/or "end_fsync").
> 
> @Pooja, can you confirm that the md5sum is different on the pod where fio
> was executed? That should confirm the fio command did write to the block
> device.
> 
On the pod from where the fio was ran, the md5sum changed.
> 
> The actual output of the failure reported by the test should ideally be
> improved. When comparing two checksums, both checksums should be logged.
> This output makes it impossible to see what the 2nd checksum is:
> 
> E               AssertionError: md5sum obtained from the pod
> pod-test-rbd-21b0670da23f483a8f5619730ea has not changed after IO. IO was
> run from pod pod-test-rbd-27928049ec6045608dbfe88c953
> E               assert 'f1c9645dbc14efddc7d8a322685f26eb  -\n' !=
> 'f1c9645dbc14efddc7d8a322685f26eb  -\n'                            E        
> +  where 'f1c9645dbc14efddc7d8a322685f26eb  -\n' =
> <ocs_ci.ocs.resources.pod.Pod object at 0x7ffe9b2787c0>.md5sum_after_io     
> tests/functional/pv/pv_services/test_rbd_rwx_pvc.py:92: AssertionError
> 
Yeah, previous (before fio) and current (after fio) are not logged in the message. But it is visible in the error given in the assert statement. 
assert 'f1c9645dbc14efddc7d8a322685f26eb  -\n' !=
 'f1c9645dbc14efddc7d8a322685f26eb
Both values are same.
> 
> Or, could this be a bug that should be redirected to the test-case, as
> comparing a checksum against <ocs_ci.ocs.resources.pod.Pod object at
> 0x7ffe9b2787c0>.md5sum_after_io will never be true?
May be debugging on the same cluster will be helpful because this issue is not observed in our regression runs on other platforms.

Comment 17 Sunil Kumar Acharya 2024-06-21 06:05:23 UTC
I think as @Niels said, it is probably a bug in test-case.

Could we please get this verified?


Note You need to log in before you can comment on or make changes to this bug.