1973317 – libceph: read_partial_message and bad crc/signature errors

Bug 1973317 - libceph: read_partial_message and bad crc/signature errors

Summary: libceph: read_partial_message and bad crc/signature errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.11.0
Assignee:	Ilya Dryomov
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	2024725 2051525
Blocks:	2109455 2190519
TreeView+	depends on / blocked

Reported:	2021-06-17 15:47 UTC by Jenifer Abrams
Modified:	2024-08-24 08:53 UTC (History)
CC List:	28 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-24 13:48:17 UTC
Embargoed:

Attachments	(Terms of Use)
Node SDN drops (272.93 KB, image/png) 2021-06-17 15:52 UTC, Jenifer Abrams	no flags	Details
example script to update parameters in a PV (1.26 KB, text/plain) 2022-07-25 11:29 UTC, Niels de Vos	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:6156	0	None	None	None	2022-08-24 13:48:51 UTC

Description Jenifer Abrams 2021-06-17 15:47:55 UTC

Description of problem (please be detailed as possible and provide log
snippests):

We are performance testing the new OCS rootdisk OSD functionality using primaryAffinity changes, more background here: 
https://issues.redhat.com/browse/CNV-9885?focusedCommentId=16329205&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16329205
and BZ1924946

And starting to see some general "slow" behavior that takes a long time to recover, for ex. snapshots have taken hours to complete. 


Version of all relevant components (if applicable):
# oc version
Client Version: 4.7.9
Server Version: 4.7.9
Kubernetes Version: v1.20.0+7d0a2b2

Red Hat Enterprise Linux CoreOS 47.83.202104250838-0

# oc get csv -A
NAMESPACE                              NAME                                           DISPLAY                       VERSION                 REPLACES                                       PHASE
openshift-cnv                          kubevirt-hyperconverged-operator.v2.6.5        OpenShift Virtualization      2.6.5                   kubevirt-hyperconverged-operator.v2.6.4        Succeeded
openshift-local-storage                local-storage-operator.4.7.0-202105210300.p0   Local Storage                 4.7.0-202105210300.p0   local-storage-operator.4.7.0-202104250659.p0   Succeeded
openshift-operator-lifecycle-manager   packageserver                                  Package Server                0.17.0                                                                 Succeeded
openshift-storage                      ocs-operator.v4.8.0-398.ci                     OpenShift Container Storage   4.8.0-398.ci                                                           Succeeded

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, it seems we may have to reinstall to get a more usable cluster for perf & scale testing.


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4


Can this issue reproducible?
Yes it is repeating after multiple days even when there is not much load on the cluster


Can this issue reproduce from the UI?
No


If this is a regression, please provide more details to justify this:
Not clear.

Steps to Reproduce:
1. Start 100s of DV clones and see that many hang waiting on a snapshot (see BZ1972264)
2. Check node dmesgs and see many crc / read_partial errors

Actual results:
Snapshots can take hours to complete or never complete and nodes report many osd errors. 

Expected results:
Snapshots complete quickly and the node logs are not filled with osd errors. 

Additional info:

Ex node dmesg errors:

[Thu Jun 17 15:06:28 2021] libceph: read_partial_message 00000000b741a8c0 data crc 1432058981 != exp. 1838065674
[Thu Jun 17 15:06:28 2021] libceph: read_partial_message 00000000e0bb0742 data crc 313509134 != exp. 3101144424
[Thu Jun 17 15:06:28 2021] libceph: osd91 (1)192.168.231.152:6801 bad crc/signature
[Thu Jun 17 15:06:28 2021] libceph: read_partial_message 00000000bca064d2 data crc 422122972 != exp. 259329067
[Thu Jun 17 15:06:28 2021] libceph: osd28 (1)192.168.231.79:6801 bad crc/signature
[Thu Jun 17 15:06:28 2021] libceph: read_partial_message 00000000a0f96a48 data crc 1292406045 != exp. 574600921
[Thu Jun 17 15:06:28 2021] libceph: osd4 (1)192.168.231.104:6801 bad crc/signature
[Thu Jun 17 15:06:28 2021] libceph: read_partial_message 000000009a3481bd data crc 2961361773 != exp. 3208433361
[Thu Jun 17 15:06:28 2021] libceph: osd86 (1)192.168.231.146:6801 bad crc/signature
[Thu Jun 17 15:06:28 2021] libceph: osd62 (1)192.168.231.122:6801 bad crc/signature
[Thu Jun 17 15:06:28 2021] libceph: osd47 (1)192.168.231.91:6801 bad crc/signature
[Thu Jun 17 15:06:28 2021] libceph: read_partial_message 00000000b55213f1 data crc 2103443437 != exp. 412102605
[Thu Jun 17 15:06:28 2021] libceph: read_partial_message 000000003bff3ada data crc 813328526 != exp. 3202008766
[Thu Jun 17 15:06:28 2021] libceph: osd47 (1)192.168.231.91:6801 bad crc/signature
[Thu Jun 17 15:06:28 2021] libceph: osd1 (1)192.168.231.99:6801 bad crc/signature

Comment 2 Jenifer Abrams 2021-06-17 15:51:20 UTC

Some potential clues.. 
Some of the nodes that have very fresh dmesg errors show significant osd cpu throttling (note: compression is enabled).

CPU:
container_cpu_cfs_throttled_seconds_total
osd
https-metrics
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod01b14aae_e487_4a77_b2f2_b2b45fdd1026.slice/crio-6966a15b8502c7b55c9905ce9975488122f2caae60f360763b51148294c8fddd.scope
quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509
192.168.222.40:10250
kubelet
/metrics/cadvisor
k8s_osd_rook-ceph-osd-89-77667f57c5-qdnw4_openshift-storage_01b14aae-e487-4a77-b2f2-b2b45fdd1026_0
openshift-storage
worker11
rook-ceph-osd-89-77667f57c5-qdnw4
openshift-monitoring/k8s
kubelet
3797.510377909

container_cpu_cfs_throttled_seconds_total
osd
https-metrics
/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod47dedbfa_38d8_40e0_8580_2ba383735641.slice/crio-8282bd7a526887bd3e14aadf2d5db99d42dd94a9ca919a9c814e5ab2d8511511.scope
quay.io/rhceph-dev/rhceph@sha256:d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509
192.168.222.39:10250
kubelet
/metrics/cadvisor
k8s_osd_rook-ceph-osd-16-69f6c698d5-fc8wk_openshift-storage_47dedbfa-38d8-40e0-8580-2ba383735641_0
openshift-storage
worker10
rook-ceph-osd-16-69f6c698d5-fc8wk
openshift-monitoring/k8s
kubelet
2444.631962122

Comment 3 Jenifer Abrams 2021-06-17 15:52:54 UTC

Created attachment 1791857 [details]
Node SDN drops

As well as some network drops across the cluster, especially in the SDN.

Comment 4 Orit Wasserman 2021-06-17 17:00:03 UTC

(In reply to Jenifer Abrams from comment #2)
> Some potential clues.. 
> Some of the nodes that have very fresh dmesg errors show significant osd cpu
> throttling (note: compression is enabled).
> 
> CPU:
> container_cpu_cfs_throttled_seconds_total
> osd
> https-metrics
> /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> pod01b14aae_e487_4a77_b2f2_b2b45fdd1026.slice/crio-
> 6966a15b8502c7b55c9905ce9975488122f2caae60f360763b51148294c8fddd.scope
> quay.io/rhceph-dev/rhceph@sha256:
> d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509
> 192.168.222.40:10250
> kubelet
> /metrics/cadvisor
> k8s_osd_rook-ceph-osd-89-77667f57c5-qdnw4_openshift-storage_01b14aae-e487-
> 4a77-b2f2-b2b45fdd1026_0
> openshift-storage
> worker11
> rook-ceph-osd-89-77667f57c5-qdnw4
> openshift-monitoring/k8s
> kubelet
> 3797.510377909
> 
> container_cpu_cfs_throttled_seconds_total
> osd
> https-metrics
> /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> pod47dedbfa_38d8_40e0_8580_2ba383735641.slice/crio-
> 8282bd7a526887bd3e14aadf2d5db99d42dd94a9ca919a9c814e5ab2d8511511.scope
> quay.io/rhceph-dev/rhceph@sha256:
> d2e99edf733960244256ad82e761acffe9f09e76749bb769469b4b929b25c509
> 192.168.222.39:10250
> kubelet
> /metrics/cadvisor
> k8s_osd_rook-ceph-osd-16-69f6c698d5-fc8wk_openshift-storage_47dedbfa-38d8-
> 40e0-8580-2ba383735641_0
> openshift-storage
> worker10
> rook-ceph-osd-16-69f6c698d5-fc8wk
> openshift-monitoring/k8s
> kubelet
> 2444.631962122

This can explain slowness but not the bad crc errors.
The high cpu usage can be related to the use of compression.

Comment 5 Orit Wasserman 2021-06-17 17:02:55 UTC

Ilya,
Moving here as it is not related to OSD affinity.
Can the "bad crc" be related to compression?
I didn't find any similar issue upstream

Comment 11 Sébastien Han 2021-06-22 08:34:15 UTC

If we believe there is a networking problem we should move to OCP and the network team for further investigation. OCS is just consuming the networking functionality. Nothing more.
Jenifer, can you maybe change the title with a more networking-oriented text and move it to OCP?

Thanks.

Comment 13 Jose A. Rivera 2021-06-29 16:17:21 UTC

Looking this over it seems we are still doing an RCA. That said, best I can tell this has nothing to do with ocs-operator at this time. If it turns out we need some sort of high-level configuration change then I guess it would apply. But it could just as easily be a change in automation in Rook or a bug fix in Ceph. As such, moving this to unclassified and pushing it out to ODF 4.9.

Comment 14 Jenifer Abrams 2021-06-29 18:05:28 UTC

We saw these messages again recently. It appears in all nodes dmesgs and also all recovered right around the same time, it has not happened since in the last few hours. Around this time there has been ~500 pod churn from VMs starting and stopping and there are some net drops in the Dashboards but not that many.. 

If there is a network drop on a couple nodes would we be expected to see read_partial_messages across ALL nodes of the cluster? 

In all nodes the messages stop around 16:00 or so:

[Tue Jun 29 15:58:57 2021] libceph: read_partial_message 00000000ef8f94e8 data crc 4082478763 != exp. 2475296746
[Tue Jun 29 15:58:57 2021] libceph: osd41 (1)192.168.231.72:6801 bad crc/signature
[Tue Jun 29 15:58:58 2021] libceph: read_partial_message 00000000ef8f94e8 data crc 4278525306 != exp. 2586139736
[Tue Jun 29 15:58:58 2021] libceph: read_partial_message 0000000086dcc32a data crc 995629959 != exp. 1090424828
[Tue Jun 29 15:58:58 2021] libceph: osd71 (1)192.168.231.128:6801 bad crc/signature
[Tue Jun 29 15:58:58 2021] libceph: osd82 (1)192.168.231.139:6801 bad crc/signature
[Tue Jun 29 16:05:14 2021] k6t-eth0: port 2(tap0) entered disabled state
[Tue Jun 29 16:05:15 2021] device vethbde32f3f left promiscuous mode
[Tue Jun 29 16:05:15 2021] k6t-eth0: port 2(tap0) entered disabled state
[Tue Jun 29 16:05:16 2021] device vethb1fe70c5 left promiscuous mode
[Tue Jun 29 16:05:16 2021] k6t-eth0: port 2(tap0) entered disabled state
[Tue Jun 29 16:05:16 2021] device veth55046c2e left promiscuous mode
[Tue Jun 29 16:05:17 2021] device veth77d44497 left promiscuous mode

[Tue Jun 29 16:01:55 2021] libceph: read_partial_message 00000000ee20a483 data crc 3452712560 != exp. 1420416917
[Tue Jun 29 16:01:55 2021] libceph: osd26 (1)192.168.231.76:6801 bad crc/signature
[Tue Jun 29 16:01:58 2021] libceph: read_partial_message 000000004aa8af3a data crc 1332188897 != exp. 1246963503
[Tue Jun 29 16:01:58 2021] libceph: read_partial_message 0000000009e73126 data crc 4047557739 != exp. 1083099952
[Tue Jun 29 16:01:58 2021] libceph: osd47 (1)192.168.231.90:6801 bad crc/signature
[Tue Jun 29 16:01:58 2021] libceph: osd83 (1)192.168.231.140:6801 bad crc/signature
[Tue Jun 29 16:04:59 2021] k6t-eth0: port 2(tap0) entered disabled state
[Tue Jun 29 16:05:00 2021] device veth4270a71b left promiscuous mode
[Tue Jun 29 16:05:00 2021] k6t-eth0: port 2(tap0) entered disabled state
[Tue Jun 29 16:05:01 2021] device vethb7d52873 left promiscuous mode
[Tue Jun 29 16:05:01 2021] k6t-eth0: port 2(tap0) entered disabled state

[Tue Jun 29 15:59:06 2021] libceph: osd46 (1)192.168.231.71:6801 bad crc/signature
[Tue Jun 29 15:59:06 2021] libceph: osd44 (1)192.168.231.82:6801 bad crc/signature
[Tue Jun 29 15:59:38 2021] libceph: read_partial_message 00000000a4cd3c20 data crc 4190995167 != exp. 1246963503
[Tue Jun 29 15:59:38 2021] libceph: read_partial_message 00000000230511e5 data crc 3469798536 != exp. 1420416917
[Tue Jun 29 15:59:38 2021] libceph: osd37 (1)192.168.231.63:6801 bad crc/signature
[Tue Jun 29 15:59:38 2021] libceph: osd40 (1)192.168.231.68:6801 bad crc/signature
[Tue Jun 29 16:03:04 2021] k6t-eth0: port 2(tap0) entered disabled state
[Tue Jun 29 16:03:06 2021] device veth10b1bea9 left promiscuous mode
[Tue Jun 29 16:03:14 2021] k6t-eth0: port 2(tap0) entered disabled state
[Tue Jun 29 16:03:15 2021] device vethc5bc8670 left promiscuous mode
 
[etc...]

Comment 19 Yaniv Kaul 2021-07-18 07:55:49 UTC

Can we see how many packet drops we are getting? Just look at the physical NICs counters? If there's a good correlation, at least it can explain the issue. I'm not happy about it, but it can explain it.
If there aren't packet drops on the network layer, then we have more digging to do. For example, provide more cores to the OSDs, to ensure they are not saturated?

Comment 20 Jenifer Abrams 2021-07-20 16:22:54 UTC

We reinstalled OCS w/out multus this time, on a node that was recently drained and rebooted Boaz ran more I/O load and I see the following:

[core@worker02 ~]$ uptime
 16:17:09 up 23:02,  1 user,  load average: 0.85, 1.04, 0.86

SDN iface:
ens1f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.222.31  netmask 255.255.255.0  broadcast 192.168.222.255
        inet6 fe80::edc6:a78:32b7:4731  prefixlen 64  scopeid 0x20<link>
        ether ac:1f:6b:7a:bc:06  txqueuelen 1000  (Ethernet)
        RX packets 892469775  bytes 1211058653041 (1.1 TiB)
        RX errors 0  dropped 40245  overruns 0  frame 0
        TX packets 738633287  bytes 968075984652 (901.5 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Still getting some crc errors:
[...]
[70214.237494] libceph: osd71 (1)10.128.10.174:6801 bad crc/signature
[70214.246411] libceph: osd40 (1)10.129.5.122:6801 bad crc/signature
[70231.172542] libceph: read_partial_message 0000000098781edb data crc 100499156 != exp. 622943667
[70231.172563] libceph: read_partial_message 00000000fcfd4655 data crc 879440080 != exp. 2738169811
[70231.181446] libceph: osd45 (1)10.130.9.119:6801 bad crc/signature
[70231.190077] libceph: osd73 (1)10.129.5.124:6801 bad crc/signature


Error counts:
[core@worker02 ~]$ dmesg | grep crc | wc -l
1374
[core@worker02 ~]$ dmesg | grep "crc/signature" | wc -l
687
[core@worker02 ~]$ dmesg | grep "read_partial" | wc -l
687

However we have been on OCP 4.7.9, going to upgrade to the latest 4.8 to confirm the behavior.

Comment 21 Orit Wasserman 2021-07-26 12:37:48 UTC

@Jeniffer do you still see drops in the SDN?

Comment 22 Jenifer Abrams 2021-07-27 14:29:07 UTC

Yes drops and crc errors continue when heavy I/O load is applied -- this is now OCP 4.8.0 + OCS 4.8.0-rc4 w/out multus, ex:

[core@worker15 ~]$ uptime
 14:27:09 up 6 days, 12:42,  1 user,  load average: 2.94, 1.93, 1.17

[core@worker15 ~]$ dmesg | grep "crc/signature" | wc -l
3062
[core@worker15 ~]$ dmesg | grep "read_partial" | wc -l
3062

ens1f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.222.44  netmask 255.255.255.0  broadcast 192.168.222.255
        inet6 fe80::5cff:38a8:f710:1f  prefixlen 64  scopeid 0x20<link>
        ether 00:25:90:5f:5f:f6  txqueuelen 1000  (Ethernet)
        RX packets 23952404963  bytes 31069217590176 (28.2 TiB)
        RX errors 0  dropped 178964  overruns 0  frame 0
        TX packets 23021407815  bytes 29612868047231 (26.9 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Comment 23 Jenifer Abrams 2021-07-29 20:23:44 UTC

Digging through some older data, this does appear to be a side effect of overwhelming the network w/ the I/O workload.. found similar behavior on a smaller cluster in a different lab w/out rootdisk OSDs, etc.

High net drops on SDN iface  (23day uptime):
2: ens7f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:c4:f5:e0 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9702 addrgenmode none numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535 portid 3cfdfec4f5e0 
    RX: bytes  packets  errors  dropped overrun mcast   
    327544833605281 242323675897 0       8176055 0       85708694 
    TX: bytes  packets  errors  dropped carrier collsns 
    328676126145208 242940305394 0       0       0       0  

[2024438.049041] libceph: read_partial_message 00000000354306df data crc 1915878661 != exp. 476088224
[2024438.049044] libceph: osd0 (1)10.131.1.218:6801 bad crc/signature
[2024438.049050] libceph: read_partial_message 0000000004a82b97 data crc 2547421837 != exp. 2622913009
[2024438.049052] libceph: osd16 (1)10.130.1.249:6801 bad crc/signature
[2024438.059777] libceph: read_partial_message 0000000050a1747d data crc 1466885076 != exp. 3640044599
[2024438.059783] libceph: read_partial_message 0000000012a377c2 data crc 3523841465 != exp. 2906553516
[2024438.059786] libceph: osd19 (1)10.130.1.250:6801 bad crc/signature
[2024438.130341] libceph: osd22 (1)10.130.1.251:6801 bad crc/signature

Also just want to note, the original behavior w/ snapshots described in the first comment is no longer reproducible, covered in BZ1976936

Comment 33 Ben England 2021-08-16 12:42:11 UTC

in response to Vadim Rosenfeld's question: "Is there any way to log the IO sequence (write/flush/read) to make sure that commands are coming in the right order?"

Ceph Messenger is layered upon TCP/IP in OCS (OpenShift Container Storage), and TCP transport guarantees message delivery in order of transmission.    For block storage, a block device is a sequence of 4-MiB RADOS objects, each within its assigned placement group.   So there may be multiple Ceph RADOS "objects" involved in completing a write that crosses 4-MiB boundaries, and hence multiple TCP connections involved with that write.   It would be useful to compare a Windows VM and Linux VM block trace to see if some writes cross 4-MiB boundaries with Windows but not with Linux, or if there are other alignment differences - could use scripting to search for this.   From the Ceph RBD perspective, AFAIK it does not know that we're dealing with a Windows VM, it is just another block device that happens to be in use by a Windows VM.  

To Dave Gilbert's questions: this is a "hyperconverged" cluster so Ceph is scattered across the hosts including the 1 hosting the VM, and the other statements are correct - Windows VM, qemu using /dev/rbdX devices, kernel RBD.  There have been problems in the past where running a VM on the same host as an OSD induced some different behavior (I think when we used XFS instead of ext4, which is part of why we switched from xfs to ext4 for OCS PVs as I recall?), I can't remember which bz/tracker this was.   Can we shut down Ceph OSDs on some node X and then boot the VM on node X and see if the problem still occurs?

Comment 34 Ben England 2021-08-16 12:42:11 UTC

in response to Vadim Rosenfeld's question: "Is there any way to log the IO sequence (write/flush/read) to make sure that commands are coming in the right order?"

Ceph Messenger is layered upon TCP/IP in OCS (OpenShift Container Storage), and TCP transport guarantees message delivery in order of transmission.    For block storage, a block device is a sequence of 4-MiB RADOS objects, each within its assigned placement group.   So there may be multiple Ceph RADOS "objects" involved in completing a write that crosses 4-MiB boundaries, and hence multiple TCP connections involved with that write.   It would be useful to compare a Windows VM and Linux VM block trace to see if some writes cross 4-MiB boundaries with Windows but not with Linux, or if there are other alignment differences - could use scripting to search for this.   From the Ceph RBD perspective, AFAIK it does not know that we're dealing with a Windows VM, it is just another block device that happens to be in use by a Windows VM.  

To Dave Gilbert's questions: this is a "hyperconverged" cluster so Ceph is scattered across the hosts including the 1 hosting the VM, and the other statements are correct - Windows VM, qemu using /dev/rbdX devices, kernel RBD.  There have been problems in the past where running a VM on the same host as an OSD induced some different behavior (I think when we used XFS instead of ext4, which is part of why we switched from xfs to ext4 for OCS PVs as I recall?), I can't remember which bz/tracker this was.   Can we shut down Ceph OSDs on some node X and then boot the VM on node X and see if the problem still occurs?

Comment 50 Michey Mehta 2021-08-23 15:23:40 UTC

Thanks Stefan!

Looking at the Windows guest using some of the Sysinternals tools, I have a possible explanation for why Windows is doing this target buffer overlap (not that it really matters). Most of these read target buffer overlaps happen during system boot, although some happen well after boot: most of the time I see this is being done by svchost.exe to memory map in DLL files for services which have not yet been mapped. To speed up the loading of just the initial working set pages for these services, Windows maintains a list of page offsets initially accessed by each executable in C:\Windows\prefetch. To load this initial working set it is using the weird scatter gather target buffer pattern we see, where it ends up also reading the data for pages it does not need in memory yet, and diverts all of these as yet unneeded pages to a dummy target page (probably trying to keep the memory footprint low by not cluttering it with unneeded pages).

Comment 70 Ilya Dryomov 2021-09-29 11:05:11 UTC

In addition to https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85)?redirectedfrom=MSDN, there is a MVP blog article on this from 2005: https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/.  It suggests that the dummy page concept was introduced in Windows XP (2001) and started being heavily used with what is referred to as "prefetch-style clustering" in Windows Vista (2007).  Here is a slide from "The Memory Manager in Windows Server 2003 and Windows Vista" deck:

Windows Vista – I/O Section Access Improvements

- Pervasive prefetch-style clustering for all types of page faults and system cache read ahead
- Major benefits over previous clustering
  - Infinite size read ahead instead of 64k max
  - Dummy page usage
    - So a single large I/O is always issued regardless of valid pages encountered in the cluster
  - Pages for the I/O are put in transition (not valid) 
  - No VA space is required
    - If the pages are not subsequently referenced, no working set trim and TLB flush is needed either
- Further emphasizes that driver writers must be aware that MDL pages can have their contents change!

Comment 83 Ben England 2021-10-19 20:33:37 UTC

does this fix only impact kernel RBD?  If not, is an equivalent fix needed for librbd (i.e. openstack) with Windows VMs there?  I made comment 70 public instead of private because it seems like a really useful explanation of the problem with nothing that shouldn't be public knowledge.

Comment 84 Yaniv Kaul 2021-11-08 10:12:37 UTC

(In reply to Ben England from comment #83)
> does this fix only impact kernel RBD?  If not, is an equivalent fix needed
> for librbd (i.e. openstack) with Windows VMs there?  I made comment 70
> public instead of private because it seems like a really useful explanation
> of the problem with nothing that shouldn't be public knowledge.

Ilya?

Comment 109 Mudit Agarwal 2022-05-24 05:37:01 UTC

This can be moved to ON_QA, please provide qa_ack

Comment 110 Elad 2022-06-21 08:19:46 UTC

Since ODF QE don't have the means of testing this scenario, it is going to be verified based on regression testing results only.

Comment 114 Elad 2022-07-19 14:49:42 UTC

Moving to VERIFIED based on regression testing using 4.11.0-113.
For reference - ocs-ci results for OCS4-11-Downstream-OCP4-11-RHV-IPI-1AZ-RHCOS-3M-3W-tier1 (BUILD ID: 4.11.0-113 RUN ID: 1657953608)

Comment 117 Niels de Vos 2022-07-25 11:29:03 UTC

Created attachment 1899166 [details]
example script to update parameters in a PV

Bug 2109455 will be used to explain the need for "rxbounce" in the documentation, and how a StorageClass needs to be created/modified so that the option is enabled.

Attached is an example script that shows the steps to change the parameters of a PV. The example uses a simple Pod workload. When VMs are used, the deletion/recreation of the Pod needs to be replaced by stopping/starting the VM.

Comment 121 errata-xmlrpc 2022-08-24 13:48:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156

Note You need to log in before you can comment on or make changes to this bug.

akamra
bbenshab
bniver
danken
dgilbert
dhellmann
dholler
fdeutsch
gmeno
idryomov
jarrpa
jlayton
krister
madam
mbukatov
mimehta
muagarwa
nberry
ndevos
ocs-bugs
odf-bz-bot
owasserm
pdhiran
pnataraj
rhale
shan
sostapov
stefanha