Bug 1954025 - Disk errors while scaling up a node with multipathing enabled
Summary: Disk errors while scaling up a node with multipathing enabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.8.0
Assignee: Jonathan Lebon
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1962850
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-27 13:13 UTC by Apoorva Jagtap
Modified: 2021-09-29 13:58 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Multipath was enabled too late in the boot process. Consequence: RHCOS would hit I/O errors in some multipath environments. Fix: Multipath is now enabled earlier in the boot process. Result: RHCOS no longer hits I/O errors in some multipath environments.
Clone Of:
Environment:
Last Closed: 2021-07-27 23:04:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos fedora-coreos-config pull 1011 0 None open Add support for multipath on firstboot 2021-05-13 15:35:57 UTC
Github openshift os pull 559 0 None open Bump fedora-coreos-config to latest rhcos-4.8 commit 2021-06-01 19:27:07 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:04:22 UTC

Description Apoorva Jagtap 2021-04-27 13:13:44 UTC
OCP Version at Install Time: v4.7.0
RHCOS Version at Install Time: rhcos-47.83.202102090044-0
Platform: bare metal

What are you trying to do? What is your use case?

- Scaling up worker nodes to an OCP cluster (agnostic platform) with the worker as baremetal nodes (Cisco UCS servers with SAN multipath).

What happened? What went wrong or what did you expect?

- After enabling the multipathing, as the machine boots with the coreos-parameter the coreos-installer service fails while it is reading the image initially and writing it back to the disk with error [1]:
[Edit]: The multipathing is being enabled as part of Day2 operations (using machine-config) and not via kernel command line parameters.
[1]
~~~
[ 4905.276845] Buffer I/O error on dev sde, logical block 826485, lost async page write
[ 4905.285754] Buffer I/O error on dev sde, logical block 826486, lost async page write
[ 4905.294674] Buffer I/O error on dev sde, logical block 826487, lost async page write
[ 4905.303622] blk_update_request: I/O error, dev sde, sector 6596160 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.315884] blk_update_request: I/O error, dev sde, sector 6610288 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.328142] blk_update_request: I/O error, dev sde, sector 6614000 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.340401] blk_update_request: I/O error, dev sde, sector 6611952 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.352657] blk_update_request: I/O error, dev sde, sector 6612080 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.364915] blk_update_request: I/O error, dev sde, sector 6612208 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.377171] blk_update_request: I/O error, dev sde, sector 6612336 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.389447] blk_update_request: I/O error, dev sde, sector 6605424 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 4905.401692] blk_update_request: I/O error, dev sde, sector 6605936 op 0x1:(WRITE) flags 0x4800 phys_seg 1 prio class 0
[ 5117.424531] blk_update_request: I/O error, dev sde, sector 6939488 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 
~~~

Attaching the latest serial console logs for reference. CU has tried multiple methods to boot up the node with multipathing enabled, however, they encounter similar issues.

Comment 2 Jonathan Lebon 2021-04-27 14:47:49 UTC
Multipath should not be turned on at install time. It must be enabled via machine config as documented (https://github.com/openshift/openshift-docs/pull/30697 adds more clarity on this point). So you can drop the rd.driver.pre=dm_multipath kernel argument (which AFAICT doesn't actually turn on multipath because multipathd doesn't do anything if /etc/multipath.conf is missing).

Is /dev/sde the only path in the set? If not, can you try another path instead? Is it possible that /dev/sde is not an active path?

Comment 7 Jonathan Lebon 2021-04-29 18:32:37 UTC
Had a chat with Benjamin Marzinski about this issue, and got some feedback:
1. What RHCOS is doing is completely fine (installing in non-multipath mode first and then turning it on later)
2. The errors seem to come from the underlying storage device, so even if multipath *were* enabled, it wouldn't have retried the request on a different path.

So IOW at this point, this seems to have nothing to do with multipath at all and is more related to the SCSI device(s) themselves.

I'm not sure why RHEL would install fine but not RHCOS. I wonder if it just manifests itself more easily on RHCOS because of the heavy I/O from OCP.

Comment 8 Jon Magrini 2021-04-29 18:45:33 UTC
Reviewed the attached case and agree with comment #7. From the system logs cmds are timing out on the storage device: 
---
sd 1:0:2:0: [sdc] tag#178 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
sd 1:0:2:0: [sdc] tag#178 CDB: Read(10) 28 00 04 3a 3d 58 00 00 08 00
blk_update_request: I/O error, dev sdc, sector 70925656 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0

Further looking into one of the sosreports attached to the case, device sdc is an active/non-optimized path: 

sdc: sysfs prio = 10

3600a098038304241465d4b67502d3455 dm-0 NETAPP,LUN C-Mode
size=200G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:0 sda 8:0  active ready running
| `- 1:0:1:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:2:0 sdc 8:32 active ready running
  `- 1:0:3:0 sdd 8:48 active ready running


I think point #1 is causing the issue since paths aren't yet properly grouped and it seems io is going to a non-optimized path and not getting serviced within the cmd timeout. Sending the io to a non-optimized path maybe causing some trespass scenario delaying the completion state.  
---
1. What RHCOS is doing is completely fine (installing in non-multipath mode first and then turning it on later)

Comment 9 Jon Magrini 2021-04-29 19:02:34 UTC
Can we have the customer choose one of these two paths at install time:

E: DEVLINKS=/dev/disk/by-path/fc-0x20000025b502009e-0x205000a09841d273-lun-0 /dev/disk/by-id/wwn-0x600a098038304241465d4b67502d3455 /dev/disk/by-id/scsi-3600a098038304241465d4b67502d3455 /dev/disk/by-path/pci-0000:86:00.0-fc-0x205000a09841d273-lun-0
E: DEVNAME=/dev/sda

E: DEVLINKS=/dev/disk/by-path/fc-0x20000025b502009e-0x204e00a09841d273-lun-0 /dev/disk/by-path/pci-0000:86:00.0-fc-0x204e00a09841d273-lun-0 /dev/disk/by-id/scsi-3600a098038304241465d4b67502d3455 /dev/disk/by-id/wwn-0x600a098038304241465d4b67502d3455
E: DEVNAME=/dev/sdb


And *not* one of these (which are the non-optimized paths):

E: DEVLINKS=/dev/disk/by-id/wwn-0x600a098038304241465d4b67502d3455 /dev/disk/by-path/pci-0000:86:00.0-fc-0x205400a09841d273-lun-0 /dev/disk/by-path/fc-0x20000025b502009e-0x205400a09841d273-lun-0 /dev/disk/by-id/scsi-3600a098038304241465d4b67502d3455
E: DEVNAME=/dev/sdc

E: DEVLINKS=/dev/disk/by-id/scsi-3600a098038304241465d4b67502d3455 /dev/disk/by-path/fc-0x20000025b502009e-0x205200a09841d273-lun-0 /dev/disk/by-id/wwn-0x600a098038304241465d4b67502d3455 /dev/disk/by-path/pci-0000:86:00.0-fc-0x205200a09841d273-lun-0
E: DEVNAME=/dev/sdd

Comment 10 Ben Marzinski 2021-04-29 19:27:05 UTC
(In reply to Jonathan Lebon from comment #7)
> Had a chat with Benjamin Marzinski about this issue, and got some feedback:
> 1. What RHCOS is doing is completely fine (installing in non-multipath mode
> first and then turning it on later)
> 2. The errors seem to come from the underlying storage device, so even if
> multipath *were* enabled, it wouldn't have retried the request on a
> different path.

Looking into the kernel code, this error would have been retried on other paths, if multipath was running on top of this device, but it wasn't. Multipath doesn't appear to be configured to run (multipathd does start in the initramfs, but there is no /etc/multipath.conf so it does nothing, and multipathd doesn't appear to get started after the pivot-root). So perhaps the issue is that multipath is not running, and non-optimized path appeared first and was used by the system, instead of the optimized path. In this case, configuring multipath so that it claims this device before the rest of the system starts using it may solve the issue.

> So IOW at this point, this seems to have nothing to do with multipath at all
> and is more related to the SCSI device(s) themselves.
> 
> I'm not sure why RHEL would install fine but not RHCOS. I wonder if it just
> manifests itself more easily on RHCOS because of the heavy I/O from OCP.

Comment 11 Jonathan Lebon 2021-04-29 19:55:18 UTC
Thanks all,

At this point it seems like we need to support multipath right from the first boot in order to support it correctly. I'll look at reviving the patchset at https://github.com/coreos/fedora-coreos-config/pull/815 and backporting it to 4.7.

Comment 12 Derrick Ornelas 2021-04-29 20:23:37 UTC
I just finished discussing this with Jon and Jonathan.  Essentially, without multipath enabled the user needs to identify and select the correct path for their storage configuration to install on.  Device names/paths are not guaranteed between reboots, and in this particular instance with this particular SAN configuration sda & sdb are the usable paths, whereas sdc & sdd are not.  (Sorry for the oversimplification.)

The fact that device/path names are not guaranteed actually helps explain why the results from the various tests are so inconsistent/confusing.  Installing to a non-optimized path would result in an errors when running coreos-installer.  Installing RHEL 8 most likely worked because multipath support is enabled in anaconda and the initial install target was the multipath device.  

Even once a usable [optimized] path is chosen as the install target, upon reboot there is still an issue of which device/path will be used as root during first boot.  There will be a race between all of the underlying paths, and if one of the unusable [non-optimized] paths is chosen the system will fail to boot.  I believe this explains some of the random boot failures that were seen during testing.  


We see a couple of short-term solutions here:

1.  Unshare the non-optimized paths to these systems.  This eliminates the possibility that they will be used as the install and/or boot device.  A node can be provisioned as expected and end up with a "day-2" multipath configuration without hitting the above issues.  At any point after that the non-optimized paths can be shared/exported back to the systems.

2.  Boot the Live installer with 'rd.multipath=default' which will enable multipathd in the live install environment.  Then the multipath device can be used as the install target without worrying about which individual path to use.  Upon reboot/first boot, multipath is not enable (as expected), so the chance of booting off one of the non-optimized paths is still there.  If that happens, rebooting the system again so that it eventually boots off an optimized path should work.  At that point first boot should continue as normal.  


Longer term, as Jonathan said, we'll need to look at adding support for installing onto a multipath device from the start.

Comment 21 Jonathan Lebon 2021-05-26 15:35:55 UTC
Hi Jatan,

We're working on fixing multipath support in RHCOS. Would the customers be open to test a 4.8 bootimage scratch build?

Comment 33 Jonathan Lebon 2021-06-07 14:18:45 UTC
Shouldn't have moved this to MODIFIED. Moving back to POST. Bot will pick it up once it's referenced in the bootimage bump PR.

Comment 34 Micah Abbott 2021-06-28 16:00:19 UTC
The associated boot image bump was merged and the BZ was verified.  This can be included as part of OCP 4.8 GA.

Comment 37 Michael Nguyen 2021-07-03 03:43:04 UTC
Verified on RHCOS 48.84.202106091622-0

Enabled multipath day-1 with ISO

qemu-img create -f qcow2 -o nocow=on /tmp/disk765592878 20G
qemu-nbd --format qcow2 --cache unsafe --discard unmap --socket /tmp/disk765592878.socket --share 2 /tmp/disk765592878 &
qemu-system-x86_64 -machine accel=kvm -cpu host -m 4096 -smp 1 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -nographic -drive file=/var/lib/libvirt/images/rhah/rhcos-48.84.202106091622-0-live.x86_64.iso,format=raw,if=none,readonly=on,id=installiso -device ide-cd,drive=installiso,bootindex=3 -device virtio-scsi-pci,id=scsi_mpath10 -device scsi-hd,bus=scsi_mpath10.0,drive=mpath10,vendor=NVME,product=VirtualMultipath,wwn=17603150149934919455,serial=primary-disk,bootindex=1 -drive if=none,id=mpath10,format=raw,file=nbd:unix:/tmp/disk765592878.socket,media=disk,auto-read-only=off,cache=unsafe -device virtio-scsi-pci,id=scsi_mpath11  -device scsi-hd,bus=scsi_mpath11.0,drive=mpath11,vendor=NVME,product=VirtualMultipath,wwn=17603150149934919455,serial=primary-disk,bootindex=2 -drive if=none,id=mpath11,format=raw,file=nbd:unix:/tmp/disk765592878.socket,media=disk,auto-read-only=off,cache=unsafe

boot into live-iso
sudo coreos-installer install /dev/sda --ignition-url=<url> --append-karg rd.multipath=default --append-karg root=/dev/disk/by-label/dm-mpath-root
systemctl reboot

Comment 38 Michael Nguyen 2021-07-03 03:44:05 UTC
RHCOS 48.84.202106091622-0 is the boot image for registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-07-02-143616

Comment 40 errata-xmlrpc 2021-07-27 23:04:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.