1934092 – Ceph iSCSI: fallocate(PUNCH_HOLE) succeeds, storage is not zeroed if length not aligned

Bug 1934092 - Ceph iSCSI: fallocate(PUNCH_HOLE) succeeds, storage is not zeroed if length not aligned

Summary: Ceph iSCSI: fallocate(PUNCH_HOLE) succeeds, storage is not zeroed if length n...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	iSCSI
Sub Component:
Version:	4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.2z2
Assignee:	Xiubo Li
QA Contact:	Gopi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1933983
TreeView+	depends on / blocked

Reported:	2021-03-02 13:51 UTC by Nir Soffer
Modified:	2021-06-15 17:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:	tcmu-runner-1.5.2-4.el8cp, tcmu-runner-1.5.2-4.el7cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-15 17:13:43 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Trace from qemu-img convert (1.60 MB, application/x-xz) 2021-03-02 13:51 UTC, Nir Soffer	no flags	Details
Output of "btrace /dev/sdar" (31.82 KB, text/plain) 2021-03-03 16:01 UTC, Nir Soffer	no flags	Details
tcpdump for sg_write_same --unmap --num=128 (24.78 KB, application/x-xz) 2021-03-03 17:21 UTC, Nir Soffer	no flags	Details
tcpdump for sg_write_same --unmap --num=127 (63.76 KB, application/x-xz) 2021-03-03 17:21 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-iscsi tcmu-runner issues 650	0	None	open	WRITESAME w/ UNMAP flag is not the same as UNMAP	2021-03-03 18:57:09 UTC
Red Hat Product Errata	RHSA-2021:2445	0	None	None	None	2021-06-15 17:14:08 UTC

Description Nir Soffer 2021-03-02 13:51:56 UTC

Created attachment 1760204 [details]
Trace from qemu-img convert

Description of problem:

Using fallocate(PUNCH_HOLE) with Ceph iSCSI device succeeds, but storage is
not zeroed if the length of the request is not aligned (to 1m?).

Applications using fallocate() expect that that the range will be zeroed after
the call, as promised by fallocate(2):

       Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
       in mode deallocates space (i.e., creates a  hole)  in  the  byte  range
       starting  at offset and continuing for len bytes.  Within the specified
       range, partial filesystem  blocks  are  zeroed,  and  whole  filesystem
       blocks  are removed from the file.  After a successful call, subsequent
       reads from this range will return zeroes.

       The FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE  in
       mode;  in  other words, even when punching off the end of the file, the
       file size (as reported by stat(2)) does not change.

       Not all  filesystems  support  FALLOC_FL_PUNCH_HOLE;  if  a  filesystem
       doesn't  support the operation, an error is returned.  The operation is
       supported on at least the following filesystems:

       *  XFS (since Linux 2.6.38)

       *  ext4 (since Linux 3.0)

       *  Btrfs (since Linux 3.7)

       *  tmpfs(5) (since Linux 3.5)

If the call is not supported the application expects the call to fail with:

       EOPNOTSUPP
              The filesystem containing the file referred to by  fd  does  not
              support this operation; or the mode is not supported by the
              filesystem containing the file referred to by fd.

The manual does not say explicitly that block devices are supported, but support
for block devices was added 5 years ago in:

commit 25f4c41415e513f0e9fb1f3fce2ce98fcba8d263
Author: Darrick J. Wong <darrick.wong>
Date:   Tue Oct 11 13:51:11 2016 -0700

    block: implement (some of) fallocate for block devices

Looking in kernel v4.18:

    case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
        error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
                         GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
        break;

Following the code it seems that this translates to REQ_OP_WRITE_ZEROES
which finally issue a WRITE_SAME command with the 0x8 (UNMAP) flag.

Version-Release number of selected component (if applicable):
4.2

How reproducible:
100%

Steps to Reproduce:

1. Configure multipath on the client side.

2. Connect to ceph iSCSI gateway, here is an example session on a RHV system:

# iscsiadm -m session | grep ceph
tcp: [1] 10.46.12.5:3260,1 iqn.2003-01.com.redhat.iscsi-gw:ceph-rhv (non-flash)
tcp: [2] 10.46.12.6:3260,2 iqn.2003-01.com.redhat.iscsi-gw:ceph-rhv (non-flash)

36001405ef54578354644821a81a13f4f dm-60 LIO-ORG,TCMU device
size=75G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 16:0:0:2 sdar 66:176 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 17:0:0:2 sdav 66:240 active ready running

3. Create PV:

# pvcreate /dev/mapper/36001405ef54578354644821a81a13f4f
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 4096.
  Physical volume "/dev/mapper/36001405ef54578354644821a81a13f4f" successfully created.

(I'm not sure why we get the from other devices but it does not seems related)

4. Create VG:

# vgcreate ceph-iscsi-vg /dev/mapper/36001405ef54578354644821a81a13f4f
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 4096.
  Volume group "ceph-iscsi-vg" successfully created

5. Create LV:

# lvcreate --name test --size 58g ceph-iscsi-vg
  Logical volume "test" created.

(Using 58g since this issue was found when coping 58g image in RHV).

6. Copy qcow2 image to the volume

# qemu-img info /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b
image: /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b
file format: qcow2
virtual size: 58 GiB (62277025792 bytes)
disk size: 6.86 GiB
cluster_size: 65536
Format specific information:
    compat: 0.10
    compression type: zlib
    refcount bits: 16

# strace -f -tt -T -o convert.strace qemu-img convert -n -f qcow2 -O raw -t none -T none -W /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b /dev/ceph-iscsi-vg/test

Command succeeded, but:

# qemu-img compare /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b /dev/ceph-iscsi-vg/test
Content mismatch at offset 10457088!

Looking in the trace, offset 10457088 was:

890360 14:43:19.875720 fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 10457088, 28672 <unfinished ...>
...
890360 14:43:19.924785 <... fallocate resumed>) = 0 <0.049053>

So we expect to find zeroes in this range, but:

# dd if=/dev/ceph-iscsi-vg/test bs=28672 count=1 skip=10457088 iflag=direct,skip_bytes of=data
1+0 records in
1+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.00149407 s, 19.2 MB/s

# hexdump data | head
0000000 7022 3076 2c22 3220 5d0a 7d0a 7d0a 0a0a
0000010 756f 6274 786f 7b20 690a 2064 203d 7722
0000020 7058 4f79 2d35 7174 4e65 6a2d 494d 2d36
0000030 526a 7030 322d 5977 2d4e 344b 6266 712d
0000040 7733 6767 2247 730a 6174 7574 2073 203d
0000050 225b 4552 4441 2c22 2220 5257 5449 2245
0000060 202c 5622 5349 4249 454c 5d22 660a 616c
0000070 7367 3d20 5b20 0a5d 7263 6165 6974 6e6f
0000080 745f 6d69 2065 203d 3631 3031 3539 3133
0000090 3338 630a 6572 7461 6f69 5f6e 6f68 7473

# head data
"pv0", 2
]
}
}

outbox {
id = "wXpyO5-tqeN-jMI6-jR0p-2wYN-K4fb-q3wggG"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1610953183

This is leftover lvm metadata from previous user of this LUN.

Actual results:
fallocate() succeeds, range not zeroed.

Expected results:
fallocate() succeeds, range zeroed.

Additional info:

A minimal reproduction:

# dd if=/dev/zero bs=1M count=4 | tr "\0" "x" > /dev/ceph-iscsi-vg/test
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00563928 s, 744 MB/s

# sync

# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

# fallocate --length=28672 --offset=1m --punch-hole /dev/ceph-iscsi-vg/test; echo $?
0

Command succeeded, but:

[root@oncilla04 ~]# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

No change in storage!

If length is aligned to 1M, it works:

# fallocate --length=1m --offset=28672 --punch-hole /dev/ceph-iscsi-vg/test; echo $?
0

# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0010000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

So the issue seems to be magical length aligment. If the length is not aligned the
call succeeds without modifying storage.

Applications affected by this:
- fallocate
- qemu-img
- qemu-kvm
- nbdkit file plugin
- nbdcopy

Comment 1 Yaniv Kaul 2021-03-02 14:18:35 UTC

This impacts our ability to support RHV with Ceph's iSCSI.

Comment 3 Jason Dillaman 2021-03-03 01:15:04 UTC

Please provide a blktrace or directly test using sg_unmap / sg_write_same to send low-level discard / write-same commands directly to the iSCSI device ensure the issue is not at a higher layer of the stack. i.e. if LVM is dropping the request, for example, this isn't something that will be addressed by Ceph iSCSI.

Comment 4 Nir Soffer 2021-03-03 15:59:36 UTC

I think this bug should be easy to reproduce in the lab. It is unlikely that
lvm or device mapper modify the request, since the same stack works with
other types of iSCSI or FC storage for years.

To check that this is not related to lvm, I tested also on the multipath
level:

# multipath -ll 36001405ef54578354644821a81a13f4f
36001405ef54578354644821a81a13f4f dm-60 LIO-ORG,TCMU device
size=75G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 16:0:0:2 sdar 66:176 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 17:0:0:2 sdav 66:240 active ready running

# dd if=/dev/zero bs=1M count=4 | tr "\0" "x" > /dev/mapper/36001405ef54578354644821a81a13f4f
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00579098 s, 724 MB/s

# sync

# dd if=/dev/mapper/36001405ef54578354644821a81a13f4f bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

# fallocate --length=28672 --offset=1m --punch-hole /dev/mapper/36001405ef54578354644821a81a13f4f; echo $?
0

# dd if=/dev/mapper/36001405ef54578354644821a81a13f4f bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

# fallocate --length=1m --offset=28672 --punch-hole /dev/mapper/36001405ef54578354644821a81a13f4f; echo $?
0

# dd if=/dev/mapper/36001405ef54578354644821a81a13f4f bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0010000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

Reproduced with multipath device, no lvm.

To make sure this is not related to device mapper, I tested also directly
on the scsi device.

I added this drop-in multipath confguration:

# cat /etc/multipath/conf.d/local.conf 
blacklist {
    wwid "36001405ef54578354644821a81a13f4f"
}

# multipathd reconfigure
ok

# multipath -ll | grep 36001405ef54578354644821a81a13f4f
(no output)


Testing again on /dev/sdar:

# dd if=/dev/zero bs=1M count=4 | tr "\0" "x" > /dev/sdar
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00567953 s, 738 MB/s

# sync

# dd if=/dev/sdar bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

# fallocate --length=28672 --offset=1m --punch-hole /dev/sdar; echo $?
0

# dd if=/dev/sdar bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

# fallocate --length=1m --offset=28672 --punch-hole /dev/sdar; echo $?
0

# dd if=/dev/sdar bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0010000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

Reproduced with scsi device, no multiapth.


I collected blktrace using:

# btrace /dev/sdar > btrace.out

While running:

# fallocate --length=28672 --offset=1m --punch-hole /dev/sdar

The output is not very clear to me, but it seems that relevant parts are:

 66,176  4        1 1266874889.707827220 1781506  Q  WS 2048 + 56 [fallocate]
 66,176  4        2 1266874889.707832006 1781506  G  WS 2048 + 56 [fallocate]
 66,176  4        3 1266874889.707832630 1781506  P   N [fallocate]
 66,176  4        4 1266874889.707833400 1781506 UT   N [fallocate] 1
 66,176  4        5 1266874889.707834094 1781506  I  WS 2048 + 56 [fallocate]
 66,176  4        6 1266874889.707843658   558  D  WS 2048 + 56 [kworker/4:1H]
 66,176  4        7 1266874889.709176764 1774965  D   N 0 [kworker/4:1]

Comment 5 Nir Soffer 2021-03-03 16:01:51 UTC

Created attachment 1760414 [details]
Output of "btrace /dev/sdar"

Comment 6 Nir Soffer 2021-03-03 16:28:45 UTC

Yet another way to reproduce this:

Does not work:

# sg_write_same --unmap --num=127 /dev/sdar

# dd if=/dev/sdar bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

Works:

# sg_write_same --unmap --num=128 /dev/sdar

# dd if=/dev/sdar bs=1M count=4 iflag=direct status=none | hexdump
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0010000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

So the magic length seems to be 128 sectors (64k bytes).

Comment 7 Nir Soffer 2021-03-03 17:19:56 UTC

I captured the traffic to ceph server when using

## sg_write_same --unmap --num=127 /dev/sdar

The server received:

Frame 289: 578 bytes on wire (4624 bits), 578 bytes captured (4624 bits)
Ethernet II, Src: Dell_53:5e:e9 (2c:ea:7f:53:5e:e9), Dst: Qumranet_24:18:05 (00:1a:4a:24:18:05)
Internet Protocol Version 4, Src: 10.46.12.124, Dst: 10.46.12.5
Transmission Control Protocol, Src Port: 46558, Dst Port: 3260, Seq: 385, Ack: 1368529, Len: 512
[2 Reassembled TCP Segments (560 bytes): #288(48), #289(512)]
iSCSI (SCSI Command)
Flags: 0xa1, F, W, Attr: Simple
SCSI CDB Write Same(16)
    [LUN: 0x0002]
    [Command Set:Direct Access Device (0x00) ]
    [Response in: 291]
    Opcode: Write Same(16) (0x93)
    Flags: 0x08, UNMAP
    Logical Block Address (LBA): 0000000000000000
    Transfer Length: 127
    ...0 0000 = Group: 0x00
    Control: 0x00
SCSI Payload (Write Same(16) Request Data)
    [LUN: 0x0002]
    [Command Set:Direct Access Device (0x00) ]
    [SBC Opcode: Write Same(16) (0x93)]
    [Request in: 289]
    [Response in: 291]

And responded with:

Frame 291: 114 bytes on wire (912 bits), 114 bytes captured (912 bits)
Ethernet II, Src: Qumranet_24:18:05 (00:1a:4a:24:18:05), Dst: Dell_53:5e:e9 (2c:ea:7f:53:5e:e9)
Internet Protocol Version 4, Src: 10.46.12.5, Dst: 10.46.12.124
Transmission Control Protocol, Src Port: 3260, Dst Port: 46558, Seq: 1368529, Ack: 897, Len: 48
iSCSI (SCSI Response)
Flags: 0x80
SCSI Response (Write Same(16))
    [LUN: 0x0002]
    [Command Set:Direct Access Device (0x00) ]
    [SBC Opcode: Write Same(16) (0x93)]
    [Request in: 289]
    [Time from request: 0.000082000 seconds]
    [Status: Good (0x00)]


## sg_write_same --unmap --num=128 /dev/sdar

The server received:

Frame 199: 578 bytes on wire (4624 bits), 578 bytes captured (4624 bits)
Ethernet II, Src: Dell_53:5e:e9 (2c:ea:7f:53:5e:e9), Dst: Qumranet_24:18:05 (00:1a:4a:24:18:05)
Internet Protocol Version 4, Src: 10.46.12.124, Dst: 10.46.12.5
Transmission Control Protocol, Src Port: 46558, Dst Port: 3260, Seq: 241, Ack: 1024321, Len: 512
[2 Reassembled TCP Segments (560 bytes): #198(48), #199(512)]
iSCSI (SCSI Command)
Flags: 0xa1, F, W, Attr: Simple
SCSI CDB Write Same(16)
    [LUN: 0x0002]
    [Command Set:Direct Access Device (0x00) ]
    [Response in: 201]
    Opcode: Write Same(16) (0x93)
    Flags: 0x08, UNMAP
    Logical Block Address (LBA): 0000000000000000
    Transfer Length: 128
    ...0 0000 = Group: 0x00
    Control: 0x00
SCSI Payload (Write Same(16) Request Data)
    [LUN: 0x0002]
    [Command Set:Direct Access Device (0x00) ]
    [SBC Opcode: Write Same(16) (0x93)]
    [Request in: 199]
    [Response in: 201]

And responded with:

Frame 201: 114 bytes on wire (912 bits), 114 bytes captured (912 bits)
Ethernet II, Src: Qumranet_24:18:05 (00:1a:4a:24:18:05), Dst: Dell_53:5e:e9 (2c:ea:7f:53:5e:e9)
Internet Protocol Version 4, Src: 10.46.12.5, Dst: 10.46.12.124
Transmission Control Protocol, Src Port: 3260, Dst Port: 46558, Seq: 1024321, Ack: 753, Len: 48
iSCSI (SCSI Response)
Flags: 0x80
SCSI Response (Write Same(16))
    [LUN: 0x0002]
    [Command Set:Direct Access Device (0x00) ]
    [SBC Opcode: Write Same(16) (0x93)]
    [Request in: 199]
    [Time from request: 0.004268000 seconds]
    [Status: Good (0x00)]


So the issue seems to be on the ceph iscsi node side.

Comment 8 Nir Soffer 2021-03-03 17:21:14 UTC

Created attachment 1760439 [details]
tcpdump for sg_write_same --unmap --num=128

Comment 9 Nir Soffer 2021-03-03 17:21:51 UTC

Created attachment 1760440 [details]
tcpdump for sg_write_same --unmap --num=127

Comment 10 Jason Dillaman 2021-03-03 17:35:44 UTC

Thank you for the detailed analysis. That 64KiB limit is actually tied to librbd configuration options "rbd_skip_partial_discard" (defaults to true) and "rbd_discard_granularity_bytes" (defaults to 64KiB). Which version of RHCS (librbd1 on your iSCSI gateway specifically) did you test when you encountered this issue? 

This is similar to BZ1848594 which added a new write-zero API to fix the issue where discards were silently dropped due to the above configuration settings. The write-same API should be tied into this new write-zero API. Previously, the librbd configuration option "rbd_discard_on_zeroed_write_same" (defaults to true) would treat a write-same of all zeroes as a discard, which caused the original issue. 

In the meantime while I await your confirmation of the librbd1 version, I will attempt to recreate the WS issue locally.

Comment 11 Jason Dillaman 2021-03-03 17:47:31 UTC

If the issue only occurs when you send the unmap flag w/ the write-same operation, I think I see the problem. tcmu-runner is taking the WRITE_SAME SCSI command (w/ unmap flag set) and just converting it into an UNMAP SCSI command, which is a hint and not a requirement to zero out the data.

Comment 12 Nir Soffer 2021-03-03 19:36:36 UTC

(In reply to Jason Dillaman from comment #10)
> Which version of RHCS (librbd1 on your iSCSI gateway specifically)

# rpm -qa | egrep 'ceph|tcmu-runner|kernel|librbd' | sort
ceph-base-14.2.11-95.el8cp.x86_64
ceph-common-14.2.11-95.el8cp.x86_64
ceph-iscsi-3.4-3.el8cp.noarch
ceph-osd-14.2.11-95.el8cp.x86_64
ceph-selinux-14.2.11-95.el8cp.x86_64
kernel-4.18.0-240.el8.x86_64
kernel-core-4.18.0-240.el8.x86_64
kernel-modules-4.18.0-240.el8.x86_64
kernel-tools-4.18.0-240.el8.x86_64
kernel-tools-libs-4.18.0-240.el8.x86_64
libcephfs2-14.2.11-95.el8cp.x86_64
librbd1-14.2.11-95.el8cp.x86_64
python3-ceph-argparse-14.2.11-95.el8cp.x86_64
python3-cephfs-14.2.11-95.el8cp.x86_64
tcmu-runner-1.5.2-2.el8cp.x86_64

Comment 13 Nir Soffer 2021-03-03 20:20:45 UTC

I tried the workaround suggested by Jason, and it works:

On the ceph node:

# cat /etc/ceph/ceph.conf 
...

[client]
# For https://bugzilla.redhat.com/1934092
rbd_skip_partial_discard = false

# systemctl restart tcmu-runner

After this change the unaligned write same + unmap works:

# dd if=/dev/zero bs=1M count=4 | tr "\0" "x" > /dev/sdar
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00580197 s, 723 MB/s

# sync

# dd if=/dev/sdar bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

# fallocate --length=28672 --offset=1m --punch-hole /dev/sdar; echo $?
0

# dd if=/dev/sdar bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0100000 0000 0000 0000 0000 0000 0000 0000 0000
*
0107000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

This is not the right way since the configuration file is managed
by ansible, I guess it should be modified in some ansible configuration
file.

Comment 22 Gopi 2021-06-08 07:05:28 UTC

Working as expected on latest version. Moving to verified state.

[root@f10-h21-000-6049p ~]# multipath -ll
36001405f4679cd64317421db9df4fc91 dm-67 LIO-ORG,TCMU device
size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='queue-length 0' prio=50 status=active
| `- 15:0:0:0  sdaw    67:0   active ready running
`-+- policy='queue-length 0' prio=10 status=enabled
  `- 16:0:0:0  sdav    66:240 active ready running


[root@f10-h21-000-6049p ~]# fdisk -l
Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
.
.
.
Disk /dev/mapper/36001405f4679cd64317421db9df4fc91: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 524288 bytes


Disk /dev/sdaw: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 524288 bytes


Disk /dev/sdav: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 524288 bytes


[root@f10-h21-000-6049p ~]# vgcreate ceph-iscsi-vg /dev/mapper/36001405f4679cd64317421db9df4fc91
  Volume group "ceph-iscsi-vg" successfully created

[root@f10-h21-000-6049p ~]# lvcreate --name test --size 90g ceph-iscsi-vg
  Logical volume "test" created.

[root@f10-h21-000-6049p ~]# dd if=/dev/zero bs=1M count=4 | tr "\0" "x" > /dev/ceph-iscsi-vg/test
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00518639 s, 809 MB/s

[root@f10-h21-000-6049p ~]# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

[root@f10-h21-000-6049p ~]# fallocate --length=28672 --offset=1m --punch-hole /dev/ceph-iscsi-vg/test; echo $?
0

[root@f10-h21-000-6049p ~]# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0100000 0000 0000 0000 0000 0000 0000 0000 0000
*
0107000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000
[root@f10-h21-000-6049p ~]#

Comment 24 errata-xmlrpc 2021-06-15 17:13:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2445

Note You need to log in before you can comment on or make changes to this bug.