hi: since kernel 5.3 the driver change behavior so now disk orders maybe unpredictable. https://lore.kernel.org/lkml/59eedd28-25d4-7899-7c3c-89fe7fdd4b43@acm.org/t/#m6d134a012823377bb2ce91ea2350e8be9200ff91 SUSE made parameter for this https://www.suse.com/support/kb/doc/?id=000018449 but SUSE parameter is not working under RHEL9. I try similar parameter like "scsi_mod.scan=sync". it achieves about 90% stability in one 3 disks system. but in another 16 disks system, that parameter is totally useless. we need consistent disk order to apply some command like "smartctl -l scterc,70,70 /dev/sda". it is also convenient when the disk fails and need to hot-swap it. or we need to check /dev/disk/by-path to make sure the disk location.
Are you able to use the persistent names in /dev/disk/by-id/xxxx or /dev/disk/by-path/xxx for your application? Generally speaking we do not (and have not) guarantee disk probe ordering, i.e. sda, sdb, etc. The reason is, it may work for some simple environments, e.g. for local disks where the drivers are probed in the same order, but it does not always work e.g. for SAN-attached devices. Can you provide log files of the 16 disk system showing (A) the desired order and (B) when a different order resulted when "scsi_mod.scan=sync" was used?
Hi: /dev/disk/by-id and /dev/disk/by-path is not good for daily use. like software raid below. it will a terrible list without simple disk names. even anaconda installation need these simple disk names. [root@love-2 by-path]# cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md2 : active raid5 sdf1[1] sdg1[3] sdj1[6] sdc1[0] sdk1[8] sdh1[4] sdp1[5] sdi1[7] 27348205248 blocks super 1.2 level 5, 64k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 0/30 pages [0KB], 65536KB chunk md3 : active raid5 sdl1[0] sdn1[2] sdq1[5] sdm1[1] sdo1[3] 31255572480 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] bitmap: 3/59 pages [12KB], 65536KB chunk md1 : active raid6 sda3[4] sdb3[1] sdd3[2] sde3[3] 7813154432 blocks super 1.2 level 6, 64k chunk, algorithm 2 [4/4] [UUUU] bitmap: 4/30 pages [16KB], 65536KB chunk md0 : active raid1 sda2[4] sdd2[2] sdb2[1] sde2[3] 308160 blocks super 1.0 [4/4] [UUUU] bitmap: 0/1 pages [0KB], 65536KB chunk below is the disk order with parameter "scsi_mod.scan=sync". it is a 32 disk bay enclosure which has two sas expanders in front and back with a singe SAS io card. it has 16 disks installed now. most systems can have consistent names if the disk name (sda,sdb) follow the physical name (eg: disk-by-path name order). it was the logic under RHEL7/8. [root@love-2 by-path]# ls -l total 0 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy12-lun-0 -> ../../sda lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy12-lun-0-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy12-lun-0-part2 -> ../../sda2 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy12-lun-0-part3 -> ../../sda3 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy13-lun-0 -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy13-lun-0-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy13-lun-0-part2 -> ../../sdb2 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e4f53f-phy13-lun-0-part3 -> ../../sdb3 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy12-lun-0 -> ../../sdd lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy12-lun-0-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy12-lun-0-part2 -> ../../sdd2 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy12-lun-0-part3 -> ../../sdd3 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy13-lun-0 -> ../../sde lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy13-lun-0-part1 -> ../../sde1 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy13-lun-0-part2 -> ../../sde2 lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy13-lun-0-part3 -> ../../sde3 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy14-lun-0 -> ../../sdc lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy14-lun-0-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy15-lun-0 -> ../../sdf lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy15-lun-0-part1 -> ../../sdf1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy16-lun-0 -> ../../sdg lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy16-lun-0-part1 -> ../../sdg1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy17-lun-0 -> ../../sdh lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy17-lun-0-part1 -> ../../sdh1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy18-lun-0 -> ../../sdi lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy18-lun-0-part1 -> ../../sdi1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy19-lun-0 -> ../../sdj lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy19-lun-0-part1 -> ../../sdj1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy20-lun-0 -> ../../sdp lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy20-lun-0-part1 -> ../../sdp1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy21-lun-0 -> ../../sdk lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy21-lun-0-part1 -> ../../sdk1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy22-lun-0 -> ../../sdl lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy22-lun-0-part1 -> ../../sdl1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy23-lun-0 -> ../../sdm lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy23-lun-0-part1 -> ../../sdm1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy24-lun-0 -> ../../sdn lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy24-lun-0-part1 -> ../../sdn1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy25-lun-0 -> ../../sdo lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy25-lun-0-part1 -> ../../sdo1 lrwxrwxrwx 1 root root 9 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy26-lun-0 -> ../../sdq lrwxrwxrwx 1 root root 10 Jan 13 00:30 pci-0000:03:00.0-sas-exp0x5003048017e9c33f-phy26-lun-0-part1 -> ../../sdq1
Hi: will the new kernel parameter "<modulename>.async_probe" help is this situation? I saw this post https://lore.kernel.org /lkml/Yr9yCMsB1HJ1NEuF.org/T/ I don't know if "sd.async_probe=0" would make disk order consistent again if newer kernel support the parameter?
Unfortunately I believe that parameter can only be used to force async probing for modules that did not specify it in the driver template. If the driver (like sd) specifies async probing, the parameter will not override that. I'll check to make sure, but from code inspection that appears to be the case. The intent of the kernel developer community was to eventually make everything async. I'm looking into whether I can get away with adding an sd_mod parameter, but this would likely not be accepted upstream, which means I would need to justify a RHEL-only change which we would have to carry forward in future versions.
static bool driver_allows_async_probing(struct device_driver *drv) { switch (drv->probe_type) { case PROBE_PREFER_ASYNCHRONOUS: <== if sd_template.probe_type = PROBE_PREFER_ASYNCHRONOUS return true; then the code does not even check the module parameter case PROBE_FORCE_SYNCHRONOUS: return false; default: if (cmdline_requested_async_probing(drv->name)) return true; if (module_requested_async_probing(drv->owner)) return true; return false; } } static struct scsi_driver sd_template = { .gendrv = { .name = "sd", .owner = THIS_MODULE, .probe = sd_probe, .probe_type = PROBE_PREFER_ASYNCHRONOUS, <=== .remove = sd_remove, .shutdown = sd_shutdown, .pm = &sd_pm_ops, }, .rescan = sd_rescan, .init_command = sd_init_command, .uninit_command = sd_uninit_command, .done = sd_done, .eh_action = sd_eh_action, .eh_reset = sd_eh_reset, };
Something like this seems to work. It makes the entire sd probe path synchronous though, not just the first portion with the minor # allocation. Unlike earlier kernels there would be no overlap of all the INQUIRY, READ CAPACITY, etc commands that have to be issued for each device. Might be OK for a small number of local devices though. diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 441e73c7265c..b78ab120903d 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -130,6 +130,15 @@ static const char *sd_cache_types[] = { "write back, no read (daft)" }; +static const char *sd_probe_types[] = { "async", "sync" }; + +static char sd_probe_type[6] = "async"; +module_param_string(probe, sd_probe_type, sizeof(sd_probe_type), + S_IRUGO|S_IWUSR); +MODULE_PARM_DESC(probe, "async or sync. Setting to 'sync' disables asynchronous " + "device number assignments (sda, sdb, ...)."); + static void sd_set_flush_flag(struct scsi_disk *sdkp) { bool wc = false, fua = false; @@ -3842,6 +3850,8 @@ static int __init init_sd(void) goto err_out_cache; } + if (!strcmp(sd_probe_type, "sync")) + sd_template.gendrv.probe_type = PROBE_FORCE_SYNCHRONOUS; err = scsi_register_driver(&sd_template.gendrv); if (err) goto err_out_driver;
Thanks a lot for your effort! I wonder if people don't need simple device names anymore. I can live when "eth0" become "eno1" or even "enp1s0". but "sda" become "pci-0000:07:00.0-sas-phy0-lun-0" seems too much. we still need a simple consistent name when doing things like software raid or smart error monitoring.
I'm asking to increase the priority here because this behavior results in very dangerous situation IMHO. Please see what happens in my case and I'm quite sure I'm not alone https://lists.centos.org/pipermail/centos/2023-March/896737.html The reulting devices in case of HPE Smart Array controlled disks result something like this but changing with every reboot: /dev/disk/by-id/scsi-0HP_LOGICAL_VOLUME_00000000 -> ../../sda /dev/disk/by-id/scsi-0HP_LOGICAL_VOLUME_00000000-part1 -> ../../sdb1 /dev/disk/by-id/scsi-0HP_LOGICAL_VOLUME_00000000-part2 -> ../../sdb2 Imagine what happens if you want to wipe the two partitions on /dev/disk/by-id/scsi-0HP_LOGICAL_VOLUME_00000000... That's really a critical thing which should not be possible. I'm still not exactly sure where the issue comes from. It may be that also 'sg3_utils' and 'dracut' are involved here. At least I found these bugs in them: sg3_utils: This file /usr/lib/udev/rules.d/65-scsi-cciss_id.rules calls 'cciss_id' but the program called is not shipped in the RPM. The spec file patch below should fix this: --- sg3_utils.spec.orig 2022-06-15 14:03:29.000000000 +0200 +++ sg3_utils.spec 2023-03-01 10:38:54.691384321 +0100 @@ -102,6 +102,7 @@ # need to run after 62-multipath.rules install -p -m 644 scripts/58-scsi-sg3_symlink.rules $RPM_BUILD_ROOT%{_udevrulesdir}/63-scsi-sg3_symlink.rules install -p -m 644 scripts/59-scsi-cciss_id.rules $RPM_BUILD_ROOT%{_udevrulesdir}/65-scsi-cciss_id.rules +install -p -m 755 scripts/cciss_id $RPM_BUILD_ROOT%{_udevlibdir} install -p -m 644 scripts/59-fc-wwpn-id.rules $RPM_BUILD_ROOT%{_udevrulesdir}/63-fc-wwpn-id.rules install -p -m 755 scripts/fc_wwpn_id $RPM_BUILD_ROOT%{_udevlibdir} @@ -113,6 +114,7 @@ %{_udevrulesdir}/63-scsi-sg3_symlink.rules %{_udevrulesdir}/63-fc-wwpn-id.rules %{_udevrulesdir}/65-scsi-cciss_id.rules +%{_udevlibdir}/cciss_id %{_udevrulesdir}/40-usb-blacklist.rules %{_udevlibdir}/fc_wwpn_id dracut: The file /usr/lib/dracut/modules.d/95udev-rules/module-setup.sh makes use of the following udev rules 55-scsi-sg3_id.rules 58-scsi-sg3_symlink.rules But these files are renamed in EL9 to 61-scsi-sg3_id.rules 63-scsi-sg3_symlink.rules I hope some of my input is helpful to fix the issue. Unfortunately the server I've used to test things for be available for more tests soon. Regards, Simon
I discussed this issue with the other upstream Linux SCSI maintainers during the Linux Foundation LSF/MM conference last week. James and Martin will not accept a kernel patch to allow the sd device probing to return to its prior synchronous behavior. James' suggestion was, as I expected, to use udev to provide some naming consistency (his example was to do what the network devices do, and make the naming persist after first boot). I pointed out that this did not solve the issue of newly added devices e.g. with scsi_add_device() calls from mpt3sas but there was no agreement that the kernel should be changed. I am pursuing a RHEL-only change for RHEL 9 now. However, that may not be accepted either, since the upstream kernel maintainers do not agree and the RHEL kernel team is trying to minimize upstream deviations.
Thanks again! Hope RHEL can make this work done like SUSE. I saw discussions among Arch and Debian users also complain about the behavior, but they don't have solutions to overcome it.
Still awaiting internal kernel team review/acceptance of RHEL-only patch.
There are several other reports of this issue, we are in the process of merging the module parameter described in comment # 6 above into RHEL 9, what I would like to do is open this BZ up to all other interested parties, is that acceptable?
Hi: Thanks a lot for the effort! Please share the information as you like. I was afraid that the patch won't be accepted and I need to set up the disk name manually. Thanks for the great news.
I reboot and test 10 times, and the disk order can keep consistent every time [root@storageqe-102 ~]# uname -r 5.14.0-340.2819_935944297.el9.x86_64 [root@storageqe-102 ~]# cat /sys/module/sd_mod/parameters/probe sync [root@storageqe-102 ~]# (cd /dev/disk/by-path && ls -l | grep /s) lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-1 -> ../../sda lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-1.0 -> ../../sda lrwxrwxrwx 1 root root 10 Jul 21 00:32 pci-0000:00:17.0-ata-1.0-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jul 21 00:32 pci-0000:00:17.0-ata-1.0-part2 -> ../../sda2 lrwxrwxrwx 1 root root 10 Jul 21 00:32 pci-0000:00:17.0-ata-1.0-part3 -> ../../sda3 lrwxrwxrwx 1 root root 10 Jul 21 00:32 pci-0000:00:17.0-ata-1-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jul 21 00:32 pci-0000:00:17.0-ata-1-part2 -> ../../sda2 lrwxrwxrwx 1 root root 10 Jul 21 00:32 pci-0000:00:17.0-ata-1-part3 -> ../../sda3 lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-2 -> ../../sdb lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-2.0 -> ../../sdb lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-3 -> ../../sdc lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-3.0 -> ../../sdc lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-4 -> ../../sdd lrwxrwxrwx 1 root root 9 Jul 21 00:32 pci-0000:00:17.0-ata-4.0 -> ../../sdd [root@storageqe-102 ~]#
reboot and test 10 times,the disk order can keep consistent every time [root@storageqe-103 ~]# uname -r 5.14.0-344.el9.x86_64 [root@storageqe-103 ~]# [root@storageqe-103 ~]# cat /sys/module/sd_mod/parameters/probe sync [root@storageqe-103 ~]# [root@storageqe-103 ~]# (cd /dev/disk/by-path && ls -l | grep /s) lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-5 -> ../../sda lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-5.0 -> ../../sda lrwxrwxrwx 1 root root 10 Jul 30 23:12 pci-0000:00:17.0-ata-5.0-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jul 30 23:12 pci-0000:00:17.0-ata-5.0-part2 -> ../../sda2 lrwxrwxrwx 1 root root 10 Jul 30 23:12 pci-0000:00:17.0-ata-5-part1 -> ../../sda1 lrwxrwxrwx 1 root root 10 Jul 30 23:12 pci-0000:00:17.0-ata-5-part2 -> ../../sda2 lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-6 -> ../../sdb lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-6.0 -> ../../sdb lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-7 -> ../../sdc lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-7.0 -> ../../sdc lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-8 -> ../../sdd lrwxrwxrwx 1 root root 9 Jul 30 23:12 pci-0000:00:17.0-ata-8.0 -> ../../sdd [root@storageqe-103 ~]#