Bug 1934753

Summary: Ceph iSCSI: fallocate(PUNCH_HOLE) succeeds, storage is not zeroed if length not aligned
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Jason Dillaman <jdillama>
Component: iSCSIAssignee: Xiubo Li <xiubli>
Status: CLOSED ERRATA QA Contact: Gopi <gpatta>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 5.0CC: ceph-eng-bugs, ceph-qe-bugs, gpatta, tserlin, vereddy
Target Milestone: ---   
Target Release: 5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tcmu-runner-1.5.4-1.el8cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-30 08:28:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jason Dillaman 2021-03-03 19:01:24 UTC
This bug was initially created as a copy of Bug #1934092

I am copying this bug because: 



Description of problem:

Using fallocate(PUNCH_HOLE) with Ceph iSCSI device succeeds, but storage is
not zeroed if the length of the request is not aligned (to 1m?).

Applications using fallocate() expect that that the range will be zeroed after
the call, as promised by fallocate(2):

       Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
       in mode deallocates space (i.e., creates a  hole)  in  the  byte  range
       starting  at offset and continuing for len bytes.  Within the specified
       range, partial filesystem  blocks  are  zeroed,  and  whole  filesystem
       blocks  are removed from the file.  After a successful call, subsequent
       reads from this range will return zeroes.

       The FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE  in
       mode;  in  other words, even when punching off the end of the file, the
       file size (as reported by stat(2)) does not change.

       Not all  filesystems  support  FALLOC_FL_PUNCH_HOLE;  if  a  filesystem
       doesn't  support the operation, an error is returned.  The operation is
       supported on at least the following filesystems:

       *  XFS (since Linux 2.6.38)

       *  ext4 (since Linux 3.0)

       *  Btrfs (since Linux 3.7)

       *  tmpfs(5) (since Linux 3.5)

If the call is not supported the application expects the call to fail with:

       EOPNOTSUPP
              The filesystem containing the file referred to by  fd  does  not
              support this operation; or the mode is not supported by the
              filesystem containing the file referred to by fd.

The manual does not say explicitly that block devices are supported, but support
for block devices was added 5 years ago in:

commit 25f4c41415e513f0e9fb1f3fce2ce98fcba8d263
Author: Darrick J. Wong <darrick.wong>
Date:   Tue Oct 11 13:51:11 2016 -0700

    block: implement (some of) fallocate for block devices

Looking in kernel v4.18:

    case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
        error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
                         GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
        break;

Following the code it seems that this translates to REQ_OP_WRITE_ZEROES
which finally issue a WRITE_SAME command with the 0x8 (UNMAP) flag.

Version-Release number of selected component (if applicable):
4.2

How reproducible:
100%

Steps to Reproduce:

1. Configure multipath on the client side.

2. Connect to ceph iSCSI gateway, here is an example session on a RHV system:

# iscsiadm -m session | grep ceph
tcp: [1] 10.46.12.5:3260,1 iqn.2003-01.com.redhat.iscsi-gw:ceph-rhv (non-flash)
tcp: [2] 10.46.12.6:3260,2 iqn.2003-01.com.redhat.iscsi-gw:ceph-rhv (non-flash)

36001405ef54578354644821a81a13f4f dm-60 LIO-ORG,TCMU device
size=75G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 16:0:0:2 sdar 66:176 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 17:0:0:2 sdav 66:240 active ready running

3. Create PV:

# pvcreate /dev/mapper/36001405ef54578354644821a81a13f4f
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 4096.
  Physical volume "/dev/mapper/36001405ef54578354644821a81a13f4f" successfully created.

(I'm not sure why we get the from other devices but it does not seems related)

4. Create VG:

# vgcreate ceph-iscsi-vg /dev/mapper/36001405ef54578354644821a81a13f4f
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f35 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f34 at 0 length 4096.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 512.
  Error reading device /dev/mapper/3600a098038304479363f4c4870454f48 at 0 length 4096.
  Volume group "ceph-iscsi-vg" successfully created

5. Create LV:

# lvcreate --name test --size 58g ceph-iscsi-vg
  Logical volume "test" created.

(Using 58g since this issue was found when coping 58g image in RHV).

6. Copy qcow2 image to the volume

# qemu-img info /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b
image: /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b
file format: qcow2
virtual size: 58 GiB (62277025792 bytes)
disk size: 6.86 GiB
cluster_size: 65536
Format specific information:
    compat: 0.10
    compression type: zlib
    refcount bits: 16

# strace -f -tt -T -o convert.strace qemu-img convert -n -f qcow2 -O raw -t none -T none -W /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b /dev/ceph-iscsi-vg/test

Command succeeded, but:

# qemu-img compare /var/tmp/localvm_ee80jvz/images/edf95fe8-29cc-4f85-85ff-de4ad2dfa4f6/c4143adf-1131-4318-8c31-71c299ef085b /dev/ceph-iscsi-vg/test
Content mismatch at offset 10457088!

Looking in the trace, offset 10457088 was:

890360 14:43:19.875720 fallocate(10, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 10457088, 28672 <unfinished ...>
...
890360 14:43:19.924785 <... fallocate resumed>) = 0 <0.049053>

So we expect to find zeroes in this range, but:

# dd if=/dev/ceph-iscsi-vg/test bs=28672 count=1 skip=10457088 iflag=direct,skip_bytes of=data
1+0 records in
1+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.00149407 s, 19.2 MB/s

# hexdump data | head
0000000 7022 3076 2c22 3220 5d0a 7d0a 7d0a 0a0a
0000010 756f 6274 786f 7b20 690a 2064 203d 7722
0000020 7058 4f79 2d35 7174 4e65 6a2d 494d 2d36
0000030 526a 7030 322d 5977 2d4e 344b 6266 712d
0000040 7733 6767 2247 730a 6174 7574 2073 203d
0000050 225b 4552 4441 2c22 2220 5257 5449 2245
0000060 202c 5622 5349 4249 454c 5d22 660a 616c
0000070 7367 3d20 5b20 0a5d 7263 6165 6974 6e6f
0000080 745f 6d69 2065 203d 3631 3031 3539 3133
0000090 3338 630a 6572 7461 6f69 5f6e 6f68 7473

# head data
"pv0", 2
]
}
}

outbox {
id = "wXpyO5-tqeN-jMI6-jR0p-2wYN-K4fb-q3wggG"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
creation_time = 1610953183

This is leftover lvm metadata from previous user of this LUN.

Actual results:
fallocate() succeeds, range not zeroed.

Expected results:
fallocate() succeeds, range zeroed.

Additional info:

A minimal reproduction:

# dd if=/dev/zero bs=1M count=4 | tr "\0" "x" > /dev/ceph-iscsi-vg/test
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00563928 s, 744 MB/s

# sync

# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

# fallocate --length=28672 --offset=1m --punch-hole /dev/ceph-iscsi-vg/test; echo $?
0

Command succeeded, but:

[root@oncilla04 ~]# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

No change in storage!

If length is aligned to 1M, it works:

# fallocate --length=1m --offset=28672 --punch-hole /dev/ceph-iscsi-vg/test; echo $?
0

# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0010000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

So the issue seems to be magical length aligment. If the length is not aligned the
call succeeds without modifying storage.

Applications affected by this:
- fallocate
- qemu-img
- qemu-kvm
- nbdkit file plugin
- nbdcopy

Comment 8 Gopi 2021-06-11 06:23:38 UTC
Working as expected. Moving to verified state.

Steps:
[root@magna108 ubuntu]# multipath -ll
.
.
.
Hitachi_HUA722010CLA330_JPW9K0N2098ZLE dm-1 ATA,Hitachi HUA72201
size=932G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 2:0:0:0 sdc 8:32 active ready running
3600140574549df1b7b54743aa612198d dm-6 LIO-ORG,TCMU device
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='queue-length 0' prio=50 status=active
| `- 7:0:0:0 sdf 8:80 active ready running
`-+- policy='queue-length 0' prio=10 status=enabled
  `- 6:0:0:0 sde 8:64 active ready running 
[root@magna108 ubuntu]# vgcreate ceph-iscsi-vg /dev/mapper/3600140574549df1b7b54743aa612198d 
WARNING: ext4 signature detected on /dev/mapper/3600140574549df1b7b54743aa612198d at offset 1080. Wipe it? [y/n]: y
  Wiping ext4 signature on /dev/mapper/3600140574549df1b7b54743aa612198d.
  Physical volume "/dev/mapper/3600140574549df1b7b54743aa612198d" successfully created.
  Volume group "ceph-iscsi-vg" successfully created

[root@magna108 ubuntu]# lvcreate --name test --size 9g ceph-iscsi-vg
  Logical volume "test" created.

[root@magna108 ubuntu]# dd if=/dev/zero bs=1M count=4 | tr "\0" "x" > /dev/ceph-iscsi-vg/test
4+0 records in
4+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00857095 s, 489 MB/s

[root@magna108 ubuntu]# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000

[root@magna108 ubuntu]# fallocate --length=28672 --offset=1m --punch-hole /dev/ceph-iscsi-vg/test; echo $?
0

[root@magna108 ubuntu]# dd if=/dev/ceph-iscsi-vg/test bs=1M count=4 iflag=direct status=none | hexdump
0000000 7878 7878 7878 7878 7878 7878 7878 7878
*
0100000 0000 0000 0000 0000 0000 0000 0000 0000
*
0107000 7878 7878 7878 7878 7878 7878 7878 7878
*
0400000
[root@magna108 ubuntu]#

Comment 10 errata-xmlrpc 2021-08-30 08:28:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294