Bug 1310264

Summary: [RHEL7] raid1 udev/mdadm timeouts at boot with very large mdadm devices
Product: Red Hat Enterprise Linux 7 Reporter: John Pittman <jpittman>
Component: kernelAssignee: XiaoNi <xni>
kernel sub component: Multiple Devices (MD) QA Contact: ChanghuiZhong <czhong>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: unspecified CC: alex.wang, bubrown, cww, guazhang, harald, heinzm, jmagrini, jshortt, loberman, ncroxon, nweddle, rune, tcleveng, xni, yizhan
Version: 7.2Keywords: Reopened
Target Milestone: rc   
Target Release: 7.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-25 03:28:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298243, 1550088    

Description John Pittman 2016-02-19 22:20:34 UTC
Created attachment 1128673 [details]
journalctl part 1

Description of problem:
With mdadm devices that are very large, at boot, the device will not create fast enough causing device timeout and udev settle to fail.

Version-Release number of selected component (if applicable):

3.10.0-327.4.5.el7.x86_64

How reproducible:
Create a very large mdadm device.  At boot, from what I can tell, assembly takes too long and we timeout.

Steps to Reproduce:
1.Create mdadm device
   - My example ~400TB w/ 74 legs
2.Format w/ XFS and add line to fstab
3.Allow to assemble, reboot

Actual results:

Boot fails with:

Feb 11 06:50:41 host systemd[1]: dev-md1.device changed dead -> tentative
Feb 11 06:50:41 host systemd[1]: Installed new job dev-md1.device/start as 2456
Feb 11 06:53:42 host systemd[1]: Job dev-md1.device/start timed out.
Feb 11 06:53:42 host systemd[1]: Job dev-md1.device/start finished, result=timeout
Feb 11 06:53:42 host systemd[1]: Timed out waiting for device /dev/md1.
Feb 11 06:53:42 host systemd[1]: Job dev-md1.device/start failed with result 'timeout'.

(1 of 2) A start job is running for dev-md1.device (51s / 1min 30s)
(1 of 2) A start job is running for dev-md1.device (51s / 1min 30s)
(2 of 2) A start job is running for udev Wait for Complete Device Initialization (49s / 2min 57s)
(2 of 2) A start job is running for udev Wait for Complete Device Initialization (49s / 2min 57s)
(2 of 2) A start job is running for udev Wait for Complete Device Initialization (50s / 2min 57s)
A start job is running for udev Wait for Complete Device Initialization (1min 57s / 2min 57s)
Failed to start udev Wait for Complete Device Initialization.
See 'systemctl status systemd-udev-settle.service' for details.
         Starting Activation of DM RAID sets...
 Started Activation of DM RAID sets.
Reached target Local File Systems.

UNIT                 ACTIVE    SUB        DESCRIPTION
dev-md1.device       loaded    activating tentative /dev/md1
systemd-udev-settle.service loaded    failed     failed    udev Wait for Complete Device Initialization

Expected results:

We should wait for the md device to complete initialization.  I found https://bugzilla.redhat.com/show_bug.cgi?id=1208680 ; I was going to ask there if this issue could be resolved by the possible fix mentioned there, but did not want to crowd the bug in case I was incorrect.

Additional info:

Workaround for issue is to add nofail to the fstab entry for the md device, then mount manually post boot.  The disks are "JBOD"; seagate ST6000NM0034 behind Fusion MPT SAS Host

I've had a bit of a hard time fully isolating things from the logs, so I will attach the full logs to the bug.  Also, still waiting on /proc/mdstat output, and array_state at the moment of failure; will update with that when I get it.

Comment 4 Jes Sorensen 2016-02-19 22:37:48 UTC
74 legs?

Second, please provide some _reasonable_ log files. The files > 300MB and go
back 5 days - please reduce it to something that can actually be analyzed.

Last, you talk about failure - did the array synchronize prior to reboot, or
is it still in the process of needing to be synchronized?

It sounds like systemd kills the script during startup because it takes too
long. How long does it take if you assemble the array on the command line?

Jes

Comment 5 John Pittman 2016-02-19 22:45:09 UTC
Thanks for looking into this Jes; I will get the information you asked for asap.

Wanted to mention as well, the customer has a second environment with 44 disk raid6 array showing the same symptoms.  I tried a recreate with 45 disks but was not successful; I am assuming due to disk size.  

For clarity /proc/mdstat from 74 disk array after boot.

md1 : active raid6 sdbl1[28] sdbn1[30] sdbr1[34] sdbu1[37] sdbd1[20] sdce1[48] sdau1[10] sdax1[13] sdaz1[15] sdbj1[26] sdbq1[33] sdbg1[23] sdbs1[35] sdca1[44] sdby1[41] sdv1[69] sds1[66] sdbf1[22] sdaw1[12] sdcc1[46] sdcf1[49] sdbx1[40] sdbi1[25] sdaq1[6] sdbk1[27] sdar1[7] sdap1[5] sdcb1[45] sdbt1[36] sdav1[11] sday1[14] sdbv1[38] sdbw1[39] sdao1[4] sdy1[72] sdbe1[21] sdba1[17] sdcg1[50] sdat1[9] sdal1[1] sdcd1[47] sdas1[8] sdbz1[42] sdam1[2] sdw1[70] sdbm1[29] sdbp1[32] sdbh1[24] sdbc1[19] sdbo1[31] sdx1[71] sdu1[68] sdbb1[18] sdt1[67] sdan1[3] sdak1[0] sdr1[65] sdz1[73] sdq1[64] sdp1[63] sdk1[58] sdj1[57] sdi1[56] sdh1[55] sdg1[54] sde1[52] sdf1[53] sdd1[51] sdo1[62] sdm1[60] sdl1[59] sdn1[61] sdc1[43] sdb1[16]
      421948071936 blocks super 1.2 level 6, 512k chunk, algorithm 2 [74/74] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU]
      bitmap: 20/44 pages [80KB], 65536KB chunk

John

Comment 7 John Pittman 2016-02-23 20:18:29 UTC
Hi Jes,

I attached a smaller log file; only the messages from Feb 13th.  The customer is working within their next maintenance window to grab us the additional data you asked for.  Will update as soon as I get the next set of data.

John

Comment 35 guazhang@redhat.com 2018-01-26 03:46:10 UTC
Hello

re_install OS and same test result as before 

test steps:

[root@storageqe-09 ~]# lsblk
NAME                        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                           8:0    0 931.5G  0 disk 
├─sda1                        8:1    0     1G  0 part /boot
└─sda2                        8:2    0 930.5G  0 part 
  ├─rhel_storageqe--09-root 253:0    0    50G  0 lvm  /
  ├─rhel_storageqe--09-swap 253:1    0   7.8G  0 lvm  [SWAP]
  └─rhel_storageqe--09-home 253:2    0 872.7G  0 lvm  /home
├─sdg1                        8:97   0   1.8T  0 part 
└─sdg2                        8:98   0   1.8T  0 part 
sdh                           8:112  0   3.7T  0 disk 


[root@storageqe-09 ~]# mdadm -CR /dev/md0 -l1 -n2 /dev/sdg[12] --assume-clean
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.


[root@storageqe-09 ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdg2[1] sdg1[0]
      1953376256 blocks super 1.2 [2/2] [UU]
      bitmap: 0/15 pages [0KB], 65536KB chunk



mkfs.ext4 /dev/md0 

mount /dev/md0 /mnt/bug

sdg                           8:96   0   3.7T  0 disk  
├─sdg1                        8:97   0   1.8T  0 part  
│ └─md0                       9:0    0   1.8T  0 raid1 /mnt/bug
└─sdg2                        8:98   0   1.8T  0 part  
  └─md0                       9:0    0   1.8T  0 raid1 /mnt/bug


vim /etc/fstable

/dev/md0 /mnt/bug default 1 2 

reboot 

[ TIME ] Timed out waiting for device dev-md0.device.
[DEPEND] Dependency failed for /mnt/bug.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Relabel all filesystems, if necessary.
[DEPEND] Dependency failed for Migrate local... structure to the new structure.
[DEPEND] Dependency failed for File System Check on /dev/md0.


system drop into "Welcome to emergGive root password for maintenance"

├─sdg1                        8:97   0   1.8T  0 part  
│ └─md127                     9:127  0   1.8T  0 raid1 
└─sdg2                        8:98   0   1.8T  0 part  
  └─md127                     9:127  0   1.8T  0 raid1 
sdh                           8:112  0   3.7T  0 disk  
[root@localhost ~]# cat /proc/mdstat 
Personalities : [raid1] 
md127 : active (auto-read-only) raid1 sdg1[0] sdg2[1]
      1953376256 blocks super 1.2 [2/2] [UU]
      bitmap: 0/15 pages [0KB], 65536KB chunk

Comment 36 loberman 2018-01-26 03:54:14 UTC
Awesome job Guazhang
Thank you.
It's reproduced.

Xiao this is very easy as you can see now to reproduce.
Please can we now make progress.
Perhaps Nigel Croxon can also help.

Regards
Laurence

Comment 46 Jon Magrini 2018-03-13 21:23:26 UTC
I am able to reproduce this issue at will: 

# mdadm --examine --scan --verbose
ARRAY /dev/md/0  level=raid1 metadata=1.2 num-devices=2 UUID=91d394e5:23b21c73:0715f5f2:8e9a5607 name=0
   devices=/dev/sdd,/dev/sdc

# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdd[1] sdc[0]
      1953383488 blocks super 1.2 [2/2] [UU]
      [======>..............]  resync = 31.3% (613192512/1953383488) finish=172.4min speed=129488K/sec
      bitmap: 11/15 pages [44KB], 65536KB chunk


Adding either the mdraid uuid or /dev/md0 to fstab results in the same boot failure: 

#UUID=91d394e5-23b2-1c73-0715-f5f28e9a5607 /test xfs defaults 1 2
/dev/md0 /test xfs defaults 1 2

[ TIME ] Timed out waiting for device dev-md0.device.
[DEPEND] Dependency failed for /test.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for File Systems Check on /dev/md0.


After further debugging the root cause appears to be some contention and/or race condition between dmraid and mdraid components.  The key to the reproducer is raid devices need to be attached to Intel RST controller.  

00:1f.2 RAID bus controller: Intel Corporation C600/X79 series chipset SATA RAID Controller (rev 05)

pci-0000:00:11.4-ata-3.0 -> ../../sdc
pci-0000:00:11.4-ata-4.0 -> ../../sdd

# dmraid -b
/dev/sdd:   3907029168 total, "WD-WCC4M0267803"
/dev/sdc:   3907029168 total, "WD-WCC4M0262885"

By default both dmraid and mdraid components are enabled, which it seems dmraid fires first and activates the raid w ism devname and uuid which then results in fstab and dependency failures.  

The solution is to disable dmraid component by adding rd.dm=0 to GRUB_CMDLINE_LINUX in /etc/default/grub: 

GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg_root/root rd.luks=0 rd.dm=0 rd.lvm.lv=vg_root/swap rhgb quiet"

Rebuild grub2: 

# grub2-mkconfig -o /boot/grub2/grub.cfg
# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

reboot. The system will now reboot successfully with either mdadm's devname /dev/md0 or uuid.

Comment 48 loberman 2019-04-09 13:13:44 UTC
Hello, asking jon and john as they have been more recently active in this case.

Regards
Laurence

Comment 49 John Pittman 2019-04-09 16:08:22 UTC
Hi Xiao, from the initial description we see "Started Activation of DM RAID sets.", which makes it seem as if Jon's suspicion in comment 46 is correct.  Also, in regard to the rc.local edit, that was a workaround provided to the customer by Laurence, so that would not negate Jon's suspicion if I am understanding correctly.  What we thought was due to large, numerous devices, may be due to the dmraid component.  The customer has long been unresponsive.  If you have a test package, we can likely test on Jon's system.

Comment 51 Jon Magrini 2019-04-10 15:02:17 UTC
Regardless of mdadm test package I can no longer reproduce on the latest kernel, even using various versions of mdadm and systemd.  The largest devices I have available are 2TB each which was used below.  

[root@storage-test ~]# uname -r
3.10.0-957.10.1.el7.x86_64

[root@storage-test ~]# rpm -qa |grep mdadm
mdadm-4.1-rc1_2.el7.x86_64

[root@storage-test ~]# lsscsi
[3:0:0:0]    disk    ATA      Samsung SSD 850  2B6Q  /dev/sda 
[4:0:0:0]    disk    ATA      WDC WD20EZRX-19D 0A81  /dev/sdb 
[5:0:0:0]    disk    ATA      WDC WD20EZRX-19D 0A81  /dev/sdc 

[root@storage-test ~]# dmraid -b
/dev/sdc:   3907029168 total, "WD-WCC4M0375096"
/dev/sdb:   3907029168 total, "WD-WCC4M0258187"
/dev/sda:    500118192 total, "S251NXAH345920N"


[root@storage-test ~]# mdadm -C /dev/md0 --metadata=1.2 --level raid1 -n 2 /dev/sd[b-c]
mdadm: array /dev/md0 started.

[root@storage-test ~]# mkfs.xfs /dev/md0
[root@storage-test ~]# mount /dev/md0 /test

Tested mounting by dev name and uuid: 

UUID=7debbf22-0c1d-450c-9569-920ab0357c22 /test xfs defaults 1 2
/dev/md0 /test xfs defaults 1 2

[root@storage-test ~]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-957.10.1.el7.x86_64 root=/dev/mapper/vg_root-root ro crashkernel=auto rd.lvm.lv=vg_root/root rd.lvm.lv=vg_root/swap rhgb quiet LANG=en_US.UTF-8

[root@storage-test ~]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=/dev/mapper/vg_root-root ro crashkernel=auto rd.lvm.lv=vg_root/root rd.lvm.lv=vg_root/swap rhgb quiet


[root@storage-test ~]# yum history
Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager
ID     | Command line             | Date and time    | Action(s)      | Altered
-------------------------------------------------------------------------------
    61 | update systemd           | 2019-04-10 09:46 | Update         |   10   
    60 | update mdadm             | 2019-04-10 09:40 | Update         |    1   
    59 | update kernel            | 2019-04-10 09:39 | E, I, U        |   14 EE

[root@storage-test ~]# yum history undo 61
[root@storage-test ~]# yum history undo 60

4.0-13.el7.x86_64    @anaconda/7.5
4.1-rc1_2.el7.x86_64 @rhel-7-server-rpms


The one thing that differs from original reproducer from latest is the following is now observed: 
---
Apr 10 10:16:33 storage-test systemd: Stopped Timer to wait for more drives before activating degraded array md0..

Which looks to be logged from unit mdadm-last-resort


My raid is still resync'ing which is painfully taking a long time, but I will continue testing once this completes and see if behavior changes.  Will also downgrade udev.  

[root@storage-test ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdc[1] sdb[0]
      1953382464 blocks super 1.2 [2/2] [UU]
      [====>................]  resync = 20.0% (391244352/1953382464) finish=201.8min speed=128992K/sec
      bitmap: 13/15 pages [52KB], 65536KB chunk

unused devices: <none>