Bug 1455679

Summary:

Install to lvm on top of mdadm raid1 takes a great deal of time

Product:

Red Hat Enterprise Linux 7

Reporter:

John Pittman <jpittman>

Component:

kernel

Assignee:

Nigel Croxon <ncroxon>

kernel sub component:

Multiple Devices (MD)

QA Contact:

guazhang <guazhang>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

urgent

Priority:

high

CC:

agk, bjarolim, cww, guazhang, harald, jpittman, jshortt, kwalker, loberman, mkadmiel, mmatsuya, ncroxon, prajnoha, xni

Version:

7.3

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-03-08 19:20:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1420851, 1442258, 1477664

Attachments:

Description	Flags
loberman CEE lab install log	none
udev_debug.log	none

Description John Pittman 2017-05-25 18:53:56 UTC

Description of problem:

Install to lvm on top of mdadm raid1 takes a great deal of time.  
Version-Release number of selected component (if applicable):

Full install takes about an hour:
=================================
05:16:09,273 INFO anaconda: /sbin/anaconda 21.48.22.93-1
~ 59 mins
06:15:42,140 INFO anaconda: Running kickstart %%post script(s)

Storage setup is a large chunk of this:
=======================================
05:16:34,504 INFO anaconda: Setting up the installation environment
~ 46 mins
06:02:44,836 INFO anaconda: Running pre-installation scripts

Storage layout
==============

* Only 2 1TB devices on system.  Used as a 2 leg RAID1 mdadm device with LVM on top.

[root@host]# cat proc/mdstat 
Personalities : [raid1] 
md3 : active raid1 sda5[0] sdb5[1]
      12574720 blocks super 1.2 [2/2] [UU]
      
md1 : active raid1 sda3[0] sdb3[1]
      12574720 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid1 sda2[0] sdb2[1]
      841490432 blocks super 1.2 [2/2] [UU]
      bitmap: 4/7 pages [16KB], 65536KB chunk

md0 : active raid1 sda1[0] sdb1[1]
      524224 blocks super 1.0 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

  PV                    VG   Fmt  Attr PSize   PFree DevSize PV UUID                                PMdaFree  PMdaSize  #PMda #PMdaUse 1st PE 
  /dev/md2              vg00 lvm2 a--  802.50g    0  802.51g mRWS6w-wadG-DbzV-tUMN-OuBj-P40B-SFG9JW        0   1020.00k     1        1   1.00m

  LV          VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert LV Tags Devices        
  home        vg00 -wi-ao----   8.00g                                                             /dev/md2(512)  
  log         vg00 -wi-ao---- 690.50g                                                             /dev/md2(1280) 
  mgtservices vg00 -wi-ao----  16.00g                                                             /dev/md2(0)    
  opt         vg00 -wi-ao----  16.00g                                                             /dev/md2(23376)
  tmp         vg00 -wi-ao----  24.00g                                                             /dev/md2(23888)
  usr         vg00 -wi-ao----  32.00g                                                             /dev/md2(24656)
  var         vg00 -wi-ao----  16.00g                                                             /dev/md2(768)  


How reproducible:

Normal install 

Steps to Reproduce:

We have not been able to reproduce internally yet; only seen in customer environment.  HP blades with smartarry controller.
We are still attempting up reproduce, and will update if we are able.

Actual results:

System takes a great deal of time to install.  

Expected results:

Install should take about half the time.

Additional info:

We added a 'time pvs' command to the %post script in anaconda.  The command took about 35 seconds to complete.  For this reason, I am opening the bug against md.  Multiple slowdowns can be seen throughout the storage layout process. We're not sure yet why this is happening, so please feel free to change the component if needed.

Once we know if we can reproduce internally, we will grab further debug logs.

Comment 5 loberman 2017-05-25 21:44:11 UTC

Sent to customer via case update.

Hello

I have a dl360g7 in which I placed 2 2TB drives.
Within the SmartArray BIOS, I created two virtual drives each on a single physical disk.

The total install took around 28 minutes, I am using an ILO virtual cdrom image.

I dont see the stalls you are seeing but I dont have a blade with the internal SAS drives via the embedded P220.
My system has a P410 and is a standalone server.

Given that we have not been able to recreate this, it would appear this is maybe not a generic issue. 

Please can you try this on non-blade server and see if you still see the issue.

I will attach my install log.

Supporting Data
---------------
I used the following ks.cfg slightly modified that John gave me.

# Install OS instead of upgrade
install
 
# Use text mode install and cdrom
text
cdrom
 
# Disable firewall
firewall --disabled
 
# Keyboard layout
keyboard --vckeymap=us --xlayouts=''
 
# System language
lang en_US
 
# Network information
network --bootproto=dhcp
 
# Reboot after installation
reboot
 
# Root password
rootpw Storage1!
 
# Do not configure the X Window System
skipx
 
# System timezone
timezone America/New_York --isUtc
 
# Ignore sdc
ignoredisk --drives=sdc
 
# System bootloader configuration
bootloader --append=" crashkernel=auto" --location=mbr
 
# Clear the Master Boot Record
# zerombr
 
# Partition clearing information
clearpart --all --initlabel
 
# Disk partitioning information
part raid.225 --ondisk=sda --size=512
part raid.231 --ondisk=sdb --size=512
part raid.261 --ondisk=sda --size=12288
part raid.255 --ondisk=sdb --size=12288
part raid.249 --ondisk=sda --size=12288
part raid.267 --ondisk=sdb --size=12288
part raid.237 --ondisk=sda --size=800000
part raid.243 --ondisk=sdb --size=800000
 
raid pv.285 --device=2 --fstype="lvmpv" --level=RAID1 raid.237 raid.243
raid / --device=1 --fstype="ext4" --level=RAID1 --mkfsoptions="-E nodiscard" raid.249 raid.255
raid /boot --device=0 --fstype="ext4" --level=RAID1 --mkfsoptions="-E nodiscard" raid.225 raid.231
raid swap --device=3 --fstype="swap" --level=RAID1 raid.261 raid.267
 
volgroup vg00 --pesize=32768 pv.285
 
logvol /usr  --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=usr --vgname=vg00
logvol /tmp  --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=tmp --vgname=vg00
logvol /opt  --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=opt --vgname=vg00
logvol /var/log  --fstype="ext4" --grow --size=1 --mkfsoptions="-E nodiscard" --name=log --vgname=vg00
logvol /var  --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=var --vgname=vg00
logvol /home  --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=home --vgname=vg00
logvol /opt/mgtservices  --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=mgtservices --vgname=vg00
 
%packages
@Base --nodefaults
@Core
kexec-tools
sos
%end


Thanks
Laurence Oberman

Comment 6 loberman 2017-05-25 21:44:47 UTC

Created attachment 1282418 [details]
loberman CEE lab install log

Comment 18 John Pittman 2017-06-21 12:38:25 UTC

Xiao,

I think we should wait until a proper release as there is a customer attached to this bug.  In CEE our workflow is to keep the case open until the fix is delivered, and to track that fix via the open bug associated.

John

Comment 19 guazhang@redhat.com 2017-08-03 02:23:42 UTC

Hello

Don't reproduce this bug in QE test environment, so request OtherQA. Could someone help to check if customer can help to provide test result?
and QE will do sanity tests here.

thanks
guazhang

Comment 20 John Pittman 2017-08-03 12:37:41 UTC

Hi guazhang, I sent ISO that Xiao provided and it alleviated the issue.  So the customer has already tested and the results were good.

Comment 21 guazhang@redhat.com 2017-08-03 12:46:48 UTC

(In reply to John Pittman from comment #20)
> Hi guazhang, I sent ISO that Xiao provided and it alleviated the issue.  So
> the customer has already tested and the results were good.

Hello

Do you think if need customer test the fixed version kernel ?

Comment 22 John Pittman 2017-08-03 12:54:06 UTC

I don't believe so, as long as the identified fix is in the release, then we will just send them the release.

Comment 23 guazhang@redhat.com 2017-08-04 00:31:16 UTC

(In reply to John Pittman from comment #22)
> I don't believe so, as long as the identified fix is in the release, then we
> will just send them the release.

Thanks, QE will do sanity test here.

Comment 25 John Pittman 2017-08-23 20:20:27 UTC

Hello Xiao,

The customer has been using 7.4 now for installations, and while things are better, they are still facing some slowness.  The latest installation took about 15 minutes to create the partitions and filesystems.  This is much better than what we saw before at 40 minutes, but still slow.

12:19:41,598 INFO anaconda: Created disklabel on /dev/sdb
12:19:41,859 INFO anaconda: Creating mdmember on /dev/sdb1
12:19:41,935 INFO anaconda: Created mdmember on /dev/sdb1
snip....
12:31:55,554 INFO anaconda: Created mdmember on /dev/sda3
12:32:12,034 INFO anaconda: Creating ext4 on /dev/md/boot
12:33:31,794 INFO anaconda: Created ext4 on /dev/md/boot

Also, the customer has complained that post installation, after reboot, the application install is extremely slow.  It seems this slowness is because the md device is still syncing.  I have recommended that they increase the sync speed in hopes that the syncing will get done earlier, avoiding interfering with the application install.

The defaults on my 7.4 system are:

dev.raid.speed_limit_max = 200000
dev.raid.speed_limit_min = 1000

Recommended increase to:

dev.raid.speed_limit_max=800000
dev.raid.speed_limit_min=150000

Post installation raid sync speed:
Personalities : [raid1] 
md124 : active raid1 sda5[0] sdb5[1]
      864751616 blocks super 1.2 [2/2] [UU]
      [=================>...]  resync = 88.9% (769159616/864751616) finish=12.8min speed=124428K/sec
      bitmap: 3/7 pages [12KB], 65536KB chunk

md125 : active raid1 sdb2[1] sda2[0]
      3905536 blocks super 1.2 [2/2] [UU]
      
md126 : active raid1 sdb1[1] sda1[0]
      9764864 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active raid1 sda3[0] sdb3[1]
      488128 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

* Are we to expect slowness in writing/reading to and from the disk during a resync operation?
  - I would think the resync should perform asynchronous to other processes
  - If yes, is there anything we can suggest here to help other than wait until the resync completes?

* I will upload a gathering of logs from the latest installation.  In anaconda.log and journalctl.log, you can see the time frame taken for partitioning and filesystem creation.  A good deal of time spent in udev settle as before.
  - For this environment and disk size, should we expect this type of performance? Surely not, as on my test system with a similar setup I have seen it take around 1 minute.
  - Could the slowness they are seeing with application install have the same cause as the slowness with the partition/fs creation? Could this be due to the resync occurring at the time?

Thanks for any help.

Comment 28 XiaoNi 2017-08-25 07:00:52 UTC

(In reply to John Pittman from comment #25)
> Hello Xiao,
> 
> The customer has been using 7.4 now for installations, and while things are
> better, they are still facing some slowness.  The latest installation took
> about 15 minutes to create the partitions and filesystems.  This is much
> better than what we saw before at 40 minutes, but still slow.

Hi John
Hmm you say it's still slow. Is it the reason that the speed installation of SLES is faster than RHEL? 
How much time does it cost with SLES? In comment9, Laurence said "a couple of minutes". Does it mean that it costs less than 10 minutes? 

> 
> * Are we to expect slowness in writing/reading to and from the disk during a
> resync operation?
>   - I would think the resync should perform asynchronous to other processes
>   - If yes, is there anything we can suggest here to help other than wait
> until the resync completes?

Yes the I/O should be slow when there is resync/recovery/reshape. Because they are not ssds, so it needs to find address by disk head. Because the I/O and resync don't write the same address. It takes more seek time.

We can also give advice to reduce speed_limit_max when customer want to install application. After installing application it should reset speed_limit_max again.

> 
> * I will upload a gathering of logs from the latest installation.  In
> anaconda.log and journalctl.log, you can see the time frame taken for
> partitioning and filesystem creation.  A good deal of time spent in udev
> settle as before.

You mean udev settle costs much time? Is it a problem of udev?

>   - For this environment and disk size, should we expect this type of
> performance? Surely not, as on my test system with a similar setup I have
> seen it take around 1 minute.
>   - Could the slowness they are seeing with application install have the
> same cause as the slowness with the partition/fs creation? Could this be due
> to the resync occurring at the time?

If the resync finished, does it take long time to install application? 

Thanks
Xiao

Comment 29 John Pittman 2017-08-28 16:52:50 UTC

Xiao,

Thanks for your continued help. We sent the snapshot ISO to the customer to test again, just to make sure there was no error between the versions.  They said performance was better for the snapshot.  We don't expect any clear difference, but we sent it anyways.

The comment about SLES that Laurence made was one that the customer relayed to us.  It was essentially their reason for opening the case.  I don't know the actual value of how long it took the SLES system, but I can ask.  It's likely not very relevant to our tests as the SLES version was in line with RHEL6 and not our testbed, RHEL7.  The customer may be willing to test with a more up to date SLES, for comparison, but this is likely a rabbit hole.

For reference, the latest sosreport showed partitioning take 13min 50secs.

12:19:41,538 INFO anaconda: Creating disklabel on /dev/sdb
12:33:31,794 INFO anaconda: Created ext4 on /dev/md/boot

This is on an installation that took 26min 56secs.  The customer claimed very recently that one installation took upwards of an hour, as they were when we opened this bug.  I'm assuming this was from the resync activity and I think/hope it's a one-off;  thanks for the recommendation on decreasing sync speed, will send that along to see if it helps.  In the sosreport that took an hour, I found partitioning took about 19 mins.

To show you what I mean about udev, we'll take an example of a creation that took longer than it should.

12:28:34,630 INFO anaconda: Creating ext4 on /dev/mapper/rhel-var
12:29:07,831 INFO anaconda: Created ext4 on /dev/mapper/rhel-var
12:29:35,360 INFO anaconda: Creating ext4 on /dev/mapper/rhel-var_log

On this, we see that it takes over 30 seconds to create a filesystem on /dev/mapper/rhel-var, then it takes roughly another 30 seconds to start creating/dev/mapper/rhel-var_log.  This is on a filesystem that is only 1.82G large.

LV              VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert LV Tags Devices          
var             rhel -wi-ao----   1.82g                                                             /dev/md124(45727)

Checking /var/log/anaconda/journal.log for further information on exactly what was happening during filesystem creation on /dev/mapper/rhel-var.  (Unsure of what happened to the timestamps on this, please ignore as the behavior is still shown)

Aug 25 16:28:34 localhost anaconda[1558]: Creating ext4 on /dev/mapper/rhel-var
Aug 25 16:28:34 localhost blivet[1558]:           LVMLogicalVolumeDevice.setup: rhel-var ; status: True ; controllable: True ; orig: False ;
Aug 25 16:28:34 localhost blivet[1558]:           LVMLogicalVolumeDevice.setup: rhel-var ; status: True ; controllable: True ; orig: False ;
Aug 25 16:28:34 localhost blivet[1558]:            Ext4FS.create: device: /dev/mapper/rhel-var ; status: False ; type: ext4 ;
Aug 25 16:28:34 localhost blivet[1558]:            Ext4FS.doFormat: device: /dev/mapper/rhel-var ; mountpoint: /var ; type: ext4 ;
Aug 25 16:28:34 localhost program[1558]: Running... mke2fs -t ext4 /dev/mapper/rhel-var
Aug 25 16:28:38 localhost program[1558]: mke2fs 1.42.9 (28-Dec-2013)
Aug 25 16:28:38 localhost program[1558]: Filesystem label=
Aug 25 16:28:38 localhost program[1558]: OS type: Linux
Aug 25 16:28:38 localhost program[1558]: Block size=4096 (log=2)
Aug 25 16:28:38 localhost program[1558]: Fragment size=4096 (log=2)
Aug 25 16:28:38 localhost program[1558]: Stride=64 blocks, Stripe width=64 blocks
Aug 25 16:28:38 localhost program[1558]: 119760 inodes, 478208 blocks
Aug 25 16:28:38 localhost program[1558]: 23910 blocks (5.00%) reserved for the super user
Aug 25 16:28:38 localhost program[1558]: First data block=0
Aug 25 16:28:38 localhost program[1558]: Maximum filesystem blocks=490733568
Aug 25 16:28:38 localhost program[1558]: 15 block groups
Aug 25 16:28:38 localhost program[1558]: 32768 blocks per group, 32768 fragments per group
Aug 25 16:28:38 localhost program[1558]: 7984 inodes per group
Aug 25 16:28:38 localhost program[1558]: Superblock backups stored on blocks:
Aug 25 16:28:38 localhost program[1558]:         32768, 98304, 163840, 229376, 294912
Aug 25 16:28:38 localhost program[1558]: 
Aug 25 16:28:38 localhost program[1558]: [49B blob data]
Aug 25 16:28:38 localhost program[1558]: [46B blob data]
Aug 25 16:28:38 localhost program[1558]: Creating journal (8192 blocks): done
Aug 25 16:28:38 localhost program[1558]: [113B blob data]
Aug 25 16:28:38 localhost program[1558]: 
Aug 25 16:28:38 localhost program[1558]: Return code: 0
****[mkfs takes 4 seconds to return]
Aug 25 16:28:38 localhost blivet[1558]:             Ext4FS.notifyKernel: device: /dev/mapper/rhel-var ; type: ext4 ;
Aug 25 16:28:38 localhost blivet[1558]: notifying kernel of 'change' event on device /sys/class/block/dm-7
Aug 25 16:28:38 localhost program[1558]: Running... e2label /dev/mapper/rhel-var
Aug 25 16:28:38 localhost program[1558]: Return code: 0
Aug 25 16:28:38 localhost blivet[1558]:              Ext4FS.notifyKernel: device: /dev/mapper/rhel-var ; type: ext4 ;
Aug 25 16:28:38 localhost blivet[1558]: notifying kernel of 'change' event on device /sys/class/block/dm-7
Aug 25 16:28:38 localhost program[1558]: Running... udevadm settle --timeout=300
****[udevadm settle takes 59 seconds to return]
Aug 25 16:29:07 localhost program[1558]: Return code: 0
Aug 25 16:29:07 localhost anaconda[1558]: Created ext4 on /dev/mapper/rhel-var
Aug 25 16:29:07 localhost program[1558]: Running... udevadm settle --timeout=300
Aug 25 16:29:07 localhost program[1558]: Return code: 0
Aug 25 16:29:07 localhost blivet[1558]: executing action: [8275] create device lvmlv rhel-var_log (id 8273)
Aug 25 16:29:07 localhost blivet[1558]:           LVMLogicalVolumeDevice.create: rhel-var_log ; status: False ;
Aug 25 16:29:07 localhost blivet[1558]:               LVMLogicalVolumeDevice.setupParents: kids: 0 ; name: rhel-var_log ; orig: False ;
Aug 25 16:29:07 localhost blivet[1558]:                LVMVolumeGroupDevice.setup: rhel ; status: True ; controllable: True ; orig: False ;
Aug 25 16:29:07 localhost program[1558]: Running... lvm vgs --noheadings --nosuffix --nameprefixes --unquoted --units m -o uuid,size,free,extent_size,extent_count,free_count,pv_count rhel --config  devices { preferred_names=["^/dev/mapper/", "^/dev/md/", "^/dev/sd"] } global {locking_type=4}
Aug 25 16:29:07 localhost program[1558]:   LVM2_VG_UUID=wxiYcn-BwwC-wQ22-Jrh6-p1n4-b5z6-lpk3EI LVM2_VG_SIZE=844480.00 LVM2_VG_FREE=659704.00 LVM2_VG_EXTENT_SIZE=4.00 LVM2_VG_EXTENT_COUNT=211120 LVM2_VG_FREE_COUNT=164926 LVM2_PV_COUNT=1
Aug 25 16:29:07 localhost program[1558]: Return code: 0
Aug 25 16:29:07 localhost blivet[1558]:            LVMLogicalVolumeDevice._create: rhel-var_log ; status: False ;
Aug 25 16:29:07 localhost program[1558]: Running... lvm lvcreate -L 3774m -n var_log -y rhel --config  devices { preferred_names=["^/dev/mapper/", "^/dev/md/", "^/dev/sd"] }
****[lvm lv creation takes 14 seconds to kick off]
Aug 25 16:29:21 localhost program[1558]:   Rounding up size to full physical extent <3.69 GiB
Aug 25 16:29:21 localhost program[1558]:   Logical volume "var_log" created.
Aug 25 16:29:21 localhost program[1558]: Return code: 0
Aug 25 16:29:21 localhost blivet[1558]:             LVMLogicalVolumeDevice.setup: rhel-var_log ; status: True ; controllable: True ; orig: False ;
Aug 25 16:29:21 localhost blivet[1558]:             LVMLogicalVolumeDevice.updateSysfsPath: rhel-var_log ; status: True ;
Aug 25 16:29:21 localhost blivet[1558]: rhel-var_log sysfsPath set to /sys/devices/virtual/block/dm-8
Aug 25 16:29:21 localhost program[1558]: Running... udevadm settle --timeout=300
****[14 more seconds for settle]
Aug 25 16:29:35 localhost program[1558]: Return code: 0
Aug 25 16:29:35 localhost blivet[1558]:               LVMLogicalVolumeDevice.readCurrentSize: path: /dev/mapper/rhel-var_log ; sysfsPath: /sys/devices/virtual/block/dm-8 ; exists: True ;
Aug 25 16:29:35 localhost blivet[1558]: updated rhel-var_log size to 3776 MiB (3776 MiB)
Aug 25 16:29:35 localhost program[1558]: Running... udevadm settle --timeout=300
Aug 25 16:29:35 localhost program[1558]: Return code: 0
Aug 25 16:29:35 localhost blivet[1558]: executing action: [8616] create format ext4 filesystem mounted at /var/log on lvmlv rhel-var_log (id 8273)
Aug 25 16:29:35 localhost anaconda[1558]: Creating ext4 on /dev/mapper/rhel-var_log

As you can see, there are very specific spots where we stop moving forward, and not due to the size of the filesystem, but rather seemingly due to udev or the raid sync.  I will mention adding udev debug as a part of troubleshooting going forward.

Also, I will ask exacly what happens when resync finishes in terms of application install performance.  Will report back here.

Comment 30 John Pittman 2017-08-28 17:06:23 UTC

Edit: initial udev settle only took about 30 seconds; did the math incorrectly :)

Comment 31 John Pittman 2017-09-04 18:50:04 UTC

Xiao,

Sorry for the wait.  Reducing the sync_speed_max fixed the issue.  I did some tests in the lab and it increased the speed of installation on my system by around 30%.  The customer implemented the decrease through a %pre script in kickstart and the change was drastic.  Quote from the customer:  "The overall OS installation time as I calculated was 16 minutes, starting from the PXE stage till the login prompt".  That's one-half to one-third the time of previous installs; currently, seemingly all caused by the latency produced by the resync.  Is this what you expect?  I would expect some perf hit, but of this magnitude?

Comment 32 XiaoNi 2017-09-05 03:00:17 UTC

(In reply to John Pittman from comment #29)
> 
> The comment about SLES that Laurence made was one that the customer relayed
> to us.  It was essentially their reason for opening the case.  I don't know
> the actual value of how long it took the SLES system, but I can ask.  It's
> likely not very relevant to our tests as the SLES version was in line with
> RHEL6 and not our testbed, RHEL7.  The customer may be willing to test with
> a more up to date SLES, for comparison, but this is likely a rabbit hole.

I don't know about this. Yes we don't need to compare RHEL7 with SLES.
> 
> For reference, the latest sosreport showed partitioning take 13min 50secs.
> 
> 12:19:41,538 INFO anaconda: Creating disklabel on /dev/sdb
> 12:33:31,794 INFO anaconda: Created ext4 on /dev/md/boot
> 
> This is on an installation that took 26min 56secs.  The customer claimed
> very recently that one installation took upwards of an hour, as they were
> when we opened this bug.  I'm assuming this was from the resync activity and
> I think/hope it's a one-off;  thanks for the recommendation on decreasing
> sync speed, will send that along to see if it helps.  In the sosreport that
> took an hour, I found partitioning took about 19 mins.

I'm a little confused. It's RHEL7.3 that took more than one hour to install OS, right? It becomes better with RHEL7.4 which has the patches from https://bugzilla.redhat.com/show_bug.cgi?id=1455679#c16. So it should be less than one hour.


> Aug 25 16:28:38 localhost blivet[1558]: notifying kernel of 'change' event
> on device /sys/class/block/dm-7
> Aug 25 16:28:38 localhost program[1558]: Running... udevadm settle
> --timeout=300
> ****[udevadm settle takes 59 seconds to return]

What's the device dm-7? Sorry, I'm not familiar with anaconda and blivet. If I installed a new system, where can I see this kind of logs?


> Aug 25 16:29:21 localhost blivet[1558]:            
> LVMLogicalVolumeDevice.setup: rhel-var_log ; status: True ; controllable:
> True ; orig: False ;
> Aug 25 16:29:21 localhost blivet[1558]:            
> LVMLogicalVolumeDevice.updateSysfsPath: rhel-var_log ; status: True ;
> Aug 25 16:29:21 localhost blivet[1558]: rhel-var_log sysfsPath set to
> /sys/devices/virtual/block/dm-8
> Aug 25 16:29:21 localhost program[1558]: Running... udevadm settle
> --timeout=300
> ****[14 more seconds for settle]
> Aug 25 16:29:35 localhost program[1558]: Return code: 0

Why does udev take more seconds for settle? 

> 
> As you can see, there are very specific spots where we stop moving forward,
> and not due to the size of the filesystem, but rather seemingly due to udev
> or the raid sync.  I will mention adding udev debug as a part of
> troubleshooting going forward.

Hmm why do you think it's due to raid sync. 

Thanks
Xiao

Comment 33 XiaoNi 2017-09-05 03:02:23 UTC

(In reply to John Pittman from comment #31)
> Xiao,
> 
> Sorry for the wait.  Reducing the sync_speed_max fixed the issue.  I did
> some tests in the lab and it increased the speed of installation on my
> system by around 30%.  The customer implemented the decrease through a %pre
> script in kickstart and the change was drastic.  Quote from the customer: 
> "The overall OS installation time as I calculated was 16 minutes, starting
> from the PXE stage till the login prompt".  That's one-half to one-third the
> time of previous installs; currently, seemingly all caused by the latency
> produced by the resync.  Is this what you expect?  I would expect some perf
> hit, but of this magnitude?

Yes, it's my expect. Resync speed gets down can let the performance of other programmes get better. So the problem is fixed?

Sorry, what's the meaning of "some perf hit, but of this magnitude"

Thanks
Xiao

Comment 34 John Pittman 2017-09-05 12:27:33 UTC

Hi Xiao,

Yes the issue was eliminated by decreasing the sync_speed_max from a %pre script.

Comment 36 John Pittman 2017-09-06 20:31:30 UTC

Xiao,

I ran performance tests on a non-root mdadm device during sync, and did not see the performance impact that we see during installation.  The system seemed to handle the I/O and resync well.  There was very little difference between 1x-32x threaded buffered write I/O during a resync vs not. The resync speed never got above 80k, which is much lower than the max that we saw during install.  We were getting up to 180k during install (if I recall correctly).  This was unexpected for me, I understand that there was less I/O during install, probably allowing the resync to reach a higher maximum speed, but I don't understand how sync can obviously slow down install but does not slow down I/O in the live environment.  Is there some sort of prio given to a root device sync that may get in the way?

Also, the customer set their sync_speed_max at 45K sync_speed_max.  This provided them with a good balance of install speed and sync speed.  If the impact on installation is expected (with my tests, now, I have no idea why it would be), can we implement some sort of decrease in sync_speed_max, only during install?

Comment 37 John Pittman 2017-09-07 18:43:19 UTC

One more question, why is it that sometimes the resync speed sometimes flaps up and down wildly, then other times it is fairly consistent?  Is this due to he consistency of the I/O & CPU load?

Comment 39 XiaoNi 2017-09-08 02:10:47 UTC

(In reply to John Pittman from comment #36)
> Xiao,
> 
> I ran performance tests on a non-root mdadm device during sync, and did not
> see the performance impact that we see during installation.  The system
> seemed to handle the I/O and resync well.  There was very little difference
> between 1x-32x threaded buffered write I/O during a resync vs not. The
> resync speed never got above 80k, which is much lower than the max that we
> saw during install.  We were getting up to 180k during install (if I recall
> correctly).  This was unexpected for me, I understand that there was less
> I/O during install, probably allowing the resync to reach a higher maximum
> speed, but I don't understand how sync can obviously slow down install but
> does not slow down I/O in the live environment.  Is there some sort of prio
> given to a root device sync that may get in the way?

It's a general problem. Resync/recovery can decrease the speed of I/O. In comment32 I have some questions. It looks like we can know who takes most time in the log in anaconda. In the log, it looks like udev takes most time. Maybe it's the key we should look into.
> 
> Also, the customer set their sync_speed_max at 45K sync_speed_max.  This
> provided them with a good balance of install speed and sync speed.  If the
> impact on installation is expected (with my tests, now, I have no idea why
> it would be), can we implement some sort of decrease in sync_speed_max, only
> during install?

It's not a good solution for this problem. 

Thanks
Xiao

Comment 40 XiaoNi 2017-09-08 02:14:57 UTC

(In reply to John Pittman from comment #37)
> One more question, why is it that sometimes the resync speed sometimes flaps
> up and down wildly, then other times it is fairly consistent?  Is this due
> to he consistency of the I/O & CPU load?

How much is the difference?

Comment 44 John Pittman 2017-09-08 20:15:03 UTC

Created attachment 1323894 [details]
udev_debug.log

Attached is an example of something that takes too long.  When adding up many events like this, we end up with a very long install.  /dev/md0 is only 512M but ext4 format takes 35 seconds, whereas after install, formatting takes .520s (resync is not running in this case after install).  We even see that the mke2fs program not officially starting until over 20 seconds after the command is called.

Sep 08 12:30:06 localhost.localdomain program[1631]: Running... mke2fs -t ext4 /dev/md/0
....
Sep 08 12:30:28 localhost.localdomain program[1631]: mke2fs 1.42.9 (28-Dec-2013)

Comment 45 John Pittman 2017-09-08 20:40:07 UTC

Lots of time spent in 'Validate module index' for comment 44

Comment 52 Harald Hoyer 2017-10-09 14:03:25 UTC

What I see in attachment 1323894 [details] :

Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: inotify event: 8 for /dev/md0
Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: device /dev/md0 closed, synthesising 'change'


Sep 08 12:30:28 localhost.localdomain blivet[1631]:             Ext4FS.notifyKernel: device: /dev/md/0 ; type: ext4 ;
Sep 08 12:30:28 localhost.localdomain blivet[1631]: notifying kernel of 'change' event on device /sys/class/block/md0

What means that for md0 two udev change events are created for one action. One from systemd-udevd itsself via the watch rules and one by blivet.

Either blivet should turn off the udev watch rules or leave the job to udev for the change events.

This doubled raid rescans from udev.


Regarding "too much time spent in udevsettle":

If you have to wait a long time for udev to settle, then something fishy is going on in the udev rules and a helper application might take more time there than expected. Further debugging of the helper is needed.

If a udev event is processed, this is done via a worker (with a PID and a seq id). You can see how long this worker is in action via the debug log.

Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: seq 5912 queued, 'change' 'block'
Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: seq 5912 forked new worker [3641]
Sep 08 12:30:28 localhost.localdomain systemd-udevd[3641]: seq 5912 running
[…]
Sep 08 12:30:31 localhost.localdomain systemd-udevd[3641]: seq 5912 processed with 0

So, this particular "change" uevent for /dev/md0 took 3 seconds.

Seems like a lot of time was spent in blkid.

Seems like "e2label /dev/md/0" is running in parallel to udev running "blkid", which probably bites.

So, either there should be a udevsettle after the mke2fs before running e2label, or as I said before, the udev watch rules could be turned off (with blivet emitting the change events anyway).

You should contact the blivet/anaconda team about that.

Comment 69 Harald Hoyer 2017-11-08 13:25:18 UTC

so, what happened e.g. for sda from 16:34:45,907 - 16:34:59,484: 

16:34:44 program[2943]: Running... udevadm settle --timeout=300
16:34:44 program[2943]: Return code: 0
16:34:44 blivet[2943]: executing action: [257] create device partition sda2 (id 255)
16:34:44 blivet[2943]:           PartitionDevice.create: sda2 ; status: False ;
16:34:44 blivet[2943]:             PartitionDevice.setupParents: kids: 1 ; name: sda2 ; orig: False ;
16:34:44 blivet[2943]:              DiskDevice.setup: sda ; status: True ; controllable: True ; orig: False ;
16:34:44 blivet[2943]:              DiskLabel.setup: device: /dev/sda ; status: False ; type: disklabel ;
16:34:44 blivet[2943]:            PartitionDevice._create: sda2 ; status: False ;
16:34:44 blivet[2943]:             PartitionDevice._wipe: sda2 ; status: False ;
16:34:44 program[2943]: Running... dd if=/dev/zero of=/dev/sda bs=512 seek=421529600 count=2048
16:34:45 program[2943]: 2048+0 records in
16:34:45 program[2943]: 2048+0 records out
16:34:45 program[2943]: 1048576 bytes (1.0 MB) copied, 1.21934 s, 860 kB/s
16:34:45 program[2943]: Return code: 0
16:34:45 program[2943]: Running... udevadm settle --timeout=300
16:34:59 program[2943]: Return code: 0
16:34:59 blivet[2943]:             DiskLabel.commit: device: /dev/sda ; numparts: 2 ;

Note: these timestamps are taken from the journal.log, where timestamps are created upon processing the journal log, which does not guarantee any ordering or microsecond resolution.

My _guess_ is, that while blivet is creating the partition, or "dd" is running (why on earth dd???? on /dev/sda *sigh* one more open/close which triggers a change uevent), the old uevent is running blkid. This blkid might get confused with an incomplete partition table.

Comment 70 Peter Rajnoha 2017-11-08 14:05:34 UTC

(Just a note to the enormous trust and confidence with the settle here - the udevadm settle can't even guarantee that the synthesized uevents have been issued at all at the time we call the settle - the uevent might not even be triggered yet as it needs the inotify event to be received by udevd first which then issues the synthetic uevent for this inotify event for the device's "close" operation... We can't rely on the udevadm  settle here.)

Comment 71 David Lehman 2017-11-15 18:14:35 UTC

(In reply to Harald Hoyer from comment #69)
> My _guess_ is, that while blivet is creating the partition, or "dd" is
> running (why on earth dd???? on /dev/sda *sigh* one more open/close which
> triggers a change uevent), the old uevent is running blkid. This blkid might
> get confused with an incomplete partition table.

That is writing zeros to the beginning of the disk region the partition will occupy after we tell parted to commit the new partition to disk.

Comment 72 David Lehman 2017-11-15 18:27:37 UTC

(In reply to Peter Rajnoha from comment #70)
> ... We can't rely on the udevadm  settle here.)

What other option do we have?

Comment 73 David Lehman 2017-11-15 19:45:18 UTC

Before we get carried away with minor optimizations, let's keep in mind that this process generally works fine. I just did a test install of 7.4 using qemu with two 40GiB virtio disks creating four raid1 arrays for /boot, /, swap, and lvm, all filesystems ext4. The storage configuration took a total of one minute. So the number of disks, the number of md arrays, and the filesystem type are all the same. What's different? The size of the partitions/arrays and the hardware. The problem is not related to the number of devices being managed nor the basic methods used by the installer, so let's move on from that theory, please.

Comment 79 Peter Rajnoha 2017-11-28 12:06:01 UTC

(In reply to David Lehman from comment #72)
> (In reply to Peter Rajnoha from comment #70)
> > ... We can't rely on the udevadm  settle here.)
> 
> What other option do we have?

Unfortunately, none at the moment with current resources and older kernels that I can think of quickly.

What we need here is for "dd" (or any other tool which causes the WATCH udev rule to trigger) to issue the uevent itself and wait for it to get processed. Then, when "dd" (or any other other tool) exits, we're sure that the synthetic uevent has been processed.

I understand that it's not that easy to change each existing tool that may have changed anything to issue a uevent itself and wait for it instead of relying on the WATCH udev rule...

What we can do instead with recent kernels (version >= 4.13) that support extended synthetic uevent interface (see also: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-uevent) is to create a wrapper that executes a command and at its end, it generates synthetic uevent with exact UUID and waits for the uevent via udev monitor by monitoring for the UUID. This way, we know the uevent is processed already when the wrapper exits and at the same time, we don't rely on the WATCH udev rule which is racy.

(For future, it should be the command itself that generates and waits for the uevent, but as intermediary solution, we could create the wrapper.)

Otherwise, I've no better idea how to resolve this.

Comment 98 Nigel Croxon 2018-01-16 16:15:52 UTC

RHEL7.4 has been released as of kernel-3.10.0-693.el7.x86_64.
Some of the test kernels given out for this BZ come after 7.4 was released.
If you want changes to 7.4, you need to request a 7.4-zstream release.

There has been numerous fixes that have gone into 7.5, but as to the fix
to this bz, we don't have a specific commit id that fixes it.
Other bugzillas that have reported resync issues have been resolved with 
the 7.5 kernel.  On this issue, we still need a local reproducer.

We do have the latest 7.5 kernel and userspace mdadm program available.
mdadm:
http://download-node-02.eng.bos.redhat.com/brewroot/packages/mdadm/4.0/10.el7/

Kernel:
http://people.redhat.com/ncroxon/rhel7/.bz1455679_nokia/kernel-3.10.0-830.el7bz1455679.x86_64.rpm

-Nigel

Comment 118 John Pittman 2018-03-08 19:20:37 UTC

I think we can go ahead and close this.  Case has been closed.  I will close it as 'currentrelease' since the earlier update to the code did help the customers performance a great deal.  Any other issues should be remedied by decreasing max sync speed.