Bug 1455679
Summary: | Install to lvm on top of mdadm raid1 takes a great deal of time | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | John Pittman <jpittman> | ||||||
Component: | kernel | Assignee: | Nigel Croxon <ncroxon> | ||||||
kernel sub component: | Multiple Devices (MD) | QA Contact: | guazhang <guazhang> | ||||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||
Severity: | urgent | ||||||||
Priority: | high | CC: | agk, bjarolim, cww, guazhang, harald, jpittman, jshortt, kwalker, loberman, mkadmiel, mmatsuya, ncroxon, prajnoha, xni | ||||||
Version: | 7.3 | ||||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-03-08 19:20:37 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1420851, 1442258, 1477664 | ||||||||
Attachments: |
|
Description
John Pittman
2017-05-25 18:53:56 UTC
Sent to customer via case update. Hello I have a dl360g7 in which I placed 2 2TB drives. Within the SmartArray BIOS, I created two virtual drives each on a single physical disk. The total install took around 28 minutes, I am using an ILO virtual cdrom image. I dont see the stalls you are seeing but I dont have a blade with the internal SAS drives via the embedded P220. My system has a P410 and is a standalone server. Given that we have not been able to recreate this, it would appear this is maybe not a generic issue. Please can you try this on non-blade server and see if you still see the issue. I will attach my install log. Supporting Data --------------- I used the following ks.cfg slightly modified that John gave me. # Install OS instead of upgrade install # Use text mode install and cdrom text cdrom # Disable firewall firewall --disabled # Keyboard layout keyboard --vckeymap=us --xlayouts='' # System language lang en_US # Network information network --bootproto=dhcp # Reboot after installation reboot # Root password rootpw Storage1! # Do not configure the X Window System skipx # System timezone timezone America/New_York --isUtc # Ignore sdc ignoredisk --drives=sdc # System bootloader configuration bootloader --append=" crashkernel=auto" --location=mbr # Clear the Master Boot Record # zerombr # Partition clearing information clearpart --all --initlabel # Disk partitioning information part raid.225 --ondisk=sda --size=512 part raid.231 --ondisk=sdb --size=512 part raid.261 --ondisk=sda --size=12288 part raid.255 --ondisk=sdb --size=12288 part raid.249 --ondisk=sda --size=12288 part raid.267 --ondisk=sdb --size=12288 part raid.237 --ondisk=sda --size=800000 part raid.243 --ondisk=sdb --size=800000 raid pv.285 --device=2 --fstype="lvmpv" --level=RAID1 raid.237 raid.243 raid / --device=1 --fstype="ext4" --level=RAID1 --mkfsoptions="-E nodiscard" raid.249 raid.255 raid /boot --device=0 --fstype="ext4" --level=RAID1 --mkfsoptions="-E nodiscard" raid.225 raid.231 raid swap --device=3 --fstype="swap" --level=RAID1 raid.261 raid.267 volgroup vg00 --pesize=32768 pv.285 logvol /usr --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=usr --vgname=vg00 logvol /tmp --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=tmp --vgname=vg00 logvol /opt --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=opt --vgname=vg00 logvol /var/log --fstype="ext4" --grow --size=1 --mkfsoptions="-E nodiscard" --name=log --vgname=vg00 logvol /var --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=var --vgname=vg00 logvol /home --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=home --vgname=vg00 logvol /opt/mgtservices --fstype="ext4" --size=50000 --mkfsoptions="-E nodiscard" --name=mgtservices --vgname=vg00 %packages @Base --nodefaults @Core kexec-tools sos %end Thanks Laurence Oberman Created attachment 1282418 [details]
loberman CEE lab install log
Xiao, I think we should wait until a proper release as there is a customer attached to this bug. In CEE our workflow is to keep the case open until the fix is delivered, and to track that fix via the open bug associated. John Hello Don't reproduce this bug in QE test environment, so request OtherQA. Could someone help to check if customer can help to provide test result? and QE will do sanity tests here. thanks guazhang Hi guazhang, I sent ISO that Xiao provided and it alleviated the issue. So the customer has already tested and the results were good. (In reply to John Pittman from comment #20) > Hi guazhang, I sent ISO that Xiao provided and it alleviated the issue. So > the customer has already tested and the results were good. Hello Do you think if need customer test the fixed version kernel ? I don't believe so, as long as the identified fix is in the release, then we will just send them the release. (In reply to John Pittman from comment #22) > I don't believe so, as long as the identified fix is in the release, then we > will just send them the release. Thanks, QE will do sanity test here. Hello Xiao, The customer has been using 7.4 now for installations, and while things are better, they are still facing some slowness. The latest installation took about 15 minutes to create the partitions and filesystems. This is much better than what we saw before at 40 minutes, but still slow. 12:19:41,598 INFO anaconda: Created disklabel on /dev/sdb 12:19:41,859 INFO anaconda: Creating mdmember on /dev/sdb1 12:19:41,935 INFO anaconda: Created mdmember on /dev/sdb1 snip.... 12:31:55,554 INFO anaconda: Created mdmember on /dev/sda3 12:32:12,034 INFO anaconda: Creating ext4 on /dev/md/boot 12:33:31,794 INFO anaconda: Created ext4 on /dev/md/boot Also, the customer has complained that post installation, after reboot, the application install is extremely slow. It seems this slowness is because the md device is still syncing. I have recommended that they increase the sync speed in hopes that the syncing will get done earlier, avoiding interfering with the application install. The defaults on my 7.4 system are: dev.raid.speed_limit_max = 200000 dev.raid.speed_limit_min = 1000 Recommended increase to: dev.raid.speed_limit_max=800000 dev.raid.speed_limit_min=150000 Post installation raid sync speed: Personalities : [raid1] md124 : active raid1 sda5[0] sdb5[1] 864751616 blocks super 1.2 [2/2] [UU] [=================>...] resync = 88.9% (769159616/864751616) finish=12.8min speed=124428K/sec bitmap: 3/7 pages [12KB], 65536KB chunk md125 : active raid1 sdb2[1] sda2[0] 3905536 blocks super 1.2 [2/2] [UU] md126 : active raid1 sdb1[1] sda1[0] 9764864 blocks super 1.2 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md127 : active raid1 sda3[0] sdb3[1] 488128 blocks super 1.2 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk * Are we to expect slowness in writing/reading to and from the disk during a resync operation? - I would think the resync should perform asynchronous to other processes - If yes, is there anything we can suggest here to help other than wait until the resync completes? * I will upload a gathering of logs from the latest installation. In anaconda.log and journalctl.log, you can see the time frame taken for partitioning and filesystem creation. A good deal of time spent in udev settle as before. - For this environment and disk size, should we expect this type of performance? Surely not, as on my test system with a similar setup I have seen it take around 1 minute. - Could the slowness they are seeing with application install have the same cause as the slowness with the partition/fs creation? Could this be due to the resync occurring at the time? Thanks for any help. (In reply to John Pittman from comment #25) > Hello Xiao, > > The customer has been using 7.4 now for installations, and while things are > better, they are still facing some slowness. The latest installation took > about 15 minutes to create the partitions and filesystems. This is much > better than what we saw before at 40 minutes, but still slow. Hi John Hmm you say it's still slow. Is it the reason that the speed installation of SLES is faster than RHEL? How much time does it cost with SLES? In comment9, Laurence said "a couple of minutes". Does it mean that it costs less than 10 minutes? > > * Are we to expect slowness in writing/reading to and from the disk during a > resync operation? > - I would think the resync should perform asynchronous to other processes > - If yes, is there anything we can suggest here to help other than wait > until the resync completes? Yes the I/O should be slow when there is resync/recovery/reshape. Because they are not ssds, so it needs to find address by disk head. Because the I/O and resync don't write the same address. It takes more seek time. We can also give advice to reduce speed_limit_max when customer want to install application. After installing application it should reset speed_limit_max again. > > * I will upload a gathering of logs from the latest installation. In > anaconda.log and journalctl.log, you can see the time frame taken for > partitioning and filesystem creation. A good deal of time spent in udev > settle as before. You mean udev settle costs much time? Is it a problem of udev? > - For this environment and disk size, should we expect this type of > performance? Surely not, as on my test system with a similar setup I have > seen it take around 1 minute. > - Could the slowness they are seeing with application install have the > same cause as the slowness with the partition/fs creation? Could this be due > to the resync occurring at the time? If the resync finished, does it take long time to install application? Thanks Xiao Xiao, Thanks for your continued help. We sent the snapshot ISO to the customer to test again, just to make sure there was no error between the versions. They said performance was better for the snapshot. We don't expect any clear difference, but we sent it anyways. The comment about SLES that Laurence made was one that the customer relayed to us. It was essentially their reason for opening the case. I don't know the actual value of how long it took the SLES system, but I can ask. It's likely not very relevant to our tests as the SLES version was in line with RHEL6 and not our testbed, RHEL7. The customer may be willing to test with a more up to date SLES, for comparison, but this is likely a rabbit hole. For reference, the latest sosreport showed partitioning take 13min 50secs. 12:19:41,538 INFO anaconda: Creating disklabel on /dev/sdb 12:33:31,794 INFO anaconda: Created ext4 on /dev/md/boot This is on an installation that took 26min 56secs. The customer claimed very recently that one installation took upwards of an hour, as they were when we opened this bug. I'm assuming this was from the resync activity and I think/hope it's a one-off; thanks for the recommendation on decreasing sync speed, will send that along to see if it helps. In the sosreport that took an hour, I found partitioning took about 19 mins. To show you what I mean about udev, we'll take an example of a creation that took longer than it should. 12:28:34,630 INFO anaconda: Creating ext4 on /dev/mapper/rhel-var 12:29:07,831 INFO anaconda: Created ext4 on /dev/mapper/rhel-var 12:29:35,360 INFO anaconda: Creating ext4 on /dev/mapper/rhel-var_log On this, we see that it takes over 30 seconds to create a filesystem on /dev/mapper/rhel-var, then it takes roughly another 30 seconds to start creating/dev/mapper/rhel-var_log. This is on a filesystem that is only 1.82G large. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LV Tags Devices var rhel -wi-ao---- 1.82g /dev/md124(45727) Checking /var/log/anaconda/journal.log for further information on exactly what was happening during filesystem creation on /dev/mapper/rhel-var. (Unsure of what happened to the timestamps on this, please ignore as the behavior is still shown) Aug 25 16:28:34 localhost anaconda[1558]: Creating ext4 on /dev/mapper/rhel-var Aug 25 16:28:34 localhost blivet[1558]: LVMLogicalVolumeDevice.setup: rhel-var ; status: True ; controllable: True ; orig: False ; Aug 25 16:28:34 localhost blivet[1558]: LVMLogicalVolumeDevice.setup: rhel-var ; status: True ; controllable: True ; orig: False ; Aug 25 16:28:34 localhost blivet[1558]: Ext4FS.create: device: /dev/mapper/rhel-var ; status: False ; type: ext4 ; Aug 25 16:28:34 localhost blivet[1558]: Ext4FS.doFormat: device: /dev/mapper/rhel-var ; mountpoint: /var ; type: ext4 ; Aug 25 16:28:34 localhost program[1558]: Running... mke2fs -t ext4 /dev/mapper/rhel-var Aug 25 16:28:38 localhost program[1558]: mke2fs 1.42.9 (28-Dec-2013) Aug 25 16:28:38 localhost program[1558]: Filesystem label= Aug 25 16:28:38 localhost program[1558]: OS type: Linux Aug 25 16:28:38 localhost program[1558]: Block size=4096 (log=2) Aug 25 16:28:38 localhost program[1558]: Fragment size=4096 (log=2) Aug 25 16:28:38 localhost program[1558]: Stride=64 blocks, Stripe width=64 blocks Aug 25 16:28:38 localhost program[1558]: 119760 inodes, 478208 blocks Aug 25 16:28:38 localhost program[1558]: 23910 blocks (5.00%) reserved for the super user Aug 25 16:28:38 localhost program[1558]: First data block=0 Aug 25 16:28:38 localhost program[1558]: Maximum filesystem blocks=490733568 Aug 25 16:28:38 localhost program[1558]: 15 block groups Aug 25 16:28:38 localhost program[1558]: 32768 blocks per group, 32768 fragments per group Aug 25 16:28:38 localhost program[1558]: 7984 inodes per group Aug 25 16:28:38 localhost program[1558]: Superblock backups stored on blocks: Aug 25 16:28:38 localhost program[1558]: 32768, 98304, 163840, 229376, 294912 Aug 25 16:28:38 localhost program[1558]: Aug 25 16:28:38 localhost program[1558]: [49B blob data] Aug 25 16:28:38 localhost program[1558]: [46B blob data] Aug 25 16:28:38 localhost program[1558]: Creating journal (8192 blocks): done Aug 25 16:28:38 localhost program[1558]: [113B blob data] Aug 25 16:28:38 localhost program[1558]: Aug 25 16:28:38 localhost program[1558]: Return code: 0 ****[mkfs takes 4 seconds to return] Aug 25 16:28:38 localhost blivet[1558]: Ext4FS.notifyKernel: device: /dev/mapper/rhel-var ; type: ext4 ; Aug 25 16:28:38 localhost blivet[1558]: notifying kernel of 'change' event on device /sys/class/block/dm-7 Aug 25 16:28:38 localhost program[1558]: Running... e2label /dev/mapper/rhel-var Aug 25 16:28:38 localhost program[1558]: Return code: 0 Aug 25 16:28:38 localhost blivet[1558]: Ext4FS.notifyKernel: device: /dev/mapper/rhel-var ; type: ext4 ; Aug 25 16:28:38 localhost blivet[1558]: notifying kernel of 'change' event on device /sys/class/block/dm-7 Aug 25 16:28:38 localhost program[1558]: Running... udevadm settle --timeout=300 ****[udevadm settle takes 59 seconds to return] Aug 25 16:29:07 localhost program[1558]: Return code: 0 Aug 25 16:29:07 localhost anaconda[1558]: Created ext4 on /dev/mapper/rhel-var Aug 25 16:29:07 localhost program[1558]: Running... udevadm settle --timeout=300 Aug 25 16:29:07 localhost program[1558]: Return code: 0 Aug 25 16:29:07 localhost blivet[1558]: executing action: [8275] create device lvmlv rhel-var_log (id 8273) Aug 25 16:29:07 localhost blivet[1558]: LVMLogicalVolumeDevice.create: rhel-var_log ; status: False ; Aug 25 16:29:07 localhost blivet[1558]: LVMLogicalVolumeDevice.setupParents: kids: 0 ; name: rhel-var_log ; orig: False ; Aug 25 16:29:07 localhost blivet[1558]: LVMVolumeGroupDevice.setup: rhel ; status: True ; controllable: True ; orig: False ; Aug 25 16:29:07 localhost program[1558]: Running... lvm vgs --noheadings --nosuffix --nameprefixes --unquoted --units m -o uuid,size,free,extent_size,extent_count,free_count,pv_count rhel --config devices { preferred_names=["^/dev/mapper/", "^/dev/md/", "^/dev/sd"] } global {locking_type=4} Aug 25 16:29:07 localhost program[1558]: LVM2_VG_UUID=wxiYcn-BwwC-wQ22-Jrh6-p1n4-b5z6-lpk3EI LVM2_VG_SIZE=844480.00 LVM2_VG_FREE=659704.00 LVM2_VG_EXTENT_SIZE=4.00 LVM2_VG_EXTENT_COUNT=211120 LVM2_VG_FREE_COUNT=164926 LVM2_PV_COUNT=1 Aug 25 16:29:07 localhost program[1558]: Return code: 0 Aug 25 16:29:07 localhost blivet[1558]: LVMLogicalVolumeDevice._create: rhel-var_log ; status: False ; Aug 25 16:29:07 localhost program[1558]: Running... lvm lvcreate -L 3774m -n var_log -y rhel --config devices { preferred_names=["^/dev/mapper/", "^/dev/md/", "^/dev/sd"] } ****[lvm lv creation takes 14 seconds to kick off] Aug 25 16:29:21 localhost program[1558]: Rounding up size to full physical extent <3.69 GiB Aug 25 16:29:21 localhost program[1558]: Logical volume "var_log" created. Aug 25 16:29:21 localhost program[1558]: Return code: 0 Aug 25 16:29:21 localhost blivet[1558]: LVMLogicalVolumeDevice.setup: rhel-var_log ; status: True ; controllable: True ; orig: False ; Aug 25 16:29:21 localhost blivet[1558]: LVMLogicalVolumeDevice.updateSysfsPath: rhel-var_log ; status: True ; Aug 25 16:29:21 localhost blivet[1558]: rhel-var_log sysfsPath set to /sys/devices/virtual/block/dm-8 Aug 25 16:29:21 localhost program[1558]: Running... udevadm settle --timeout=300 ****[14 more seconds for settle] Aug 25 16:29:35 localhost program[1558]: Return code: 0 Aug 25 16:29:35 localhost blivet[1558]: LVMLogicalVolumeDevice.readCurrentSize: path: /dev/mapper/rhel-var_log ; sysfsPath: /sys/devices/virtual/block/dm-8 ; exists: True ; Aug 25 16:29:35 localhost blivet[1558]: updated rhel-var_log size to 3776 MiB (3776 MiB) Aug 25 16:29:35 localhost program[1558]: Running... udevadm settle --timeout=300 Aug 25 16:29:35 localhost program[1558]: Return code: 0 Aug 25 16:29:35 localhost blivet[1558]: executing action: [8616] create format ext4 filesystem mounted at /var/log on lvmlv rhel-var_log (id 8273) Aug 25 16:29:35 localhost anaconda[1558]: Creating ext4 on /dev/mapper/rhel-var_log As you can see, there are very specific spots where we stop moving forward, and not due to the size of the filesystem, but rather seemingly due to udev or the raid sync. I will mention adding udev debug as a part of troubleshooting going forward. Also, I will ask exacly what happens when resync finishes in terms of application install performance. Will report back here. Edit: initial udev settle only took about 30 seconds; did the math incorrectly :) Xiao, Sorry for the wait. Reducing the sync_speed_max fixed the issue. I did some tests in the lab and it increased the speed of installation on my system by around 30%. The customer implemented the decrease through a %pre script in kickstart and the change was drastic. Quote from the customer: "The overall OS installation time as I calculated was 16 minutes, starting from the PXE stage till the login prompt". That's one-half to one-third the time of previous installs; currently, seemingly all caused by the latency produced by the resync. Is this what you expect? I would expect some perf hit, but of this magnitude? (In reply to John Pittman from comment #29) > > The comment about SLES that Laurence made was one that the customer relayed > to us. It was essentially their reason for opening the case. I don't know > the actual value of how long it took the SLES system, but I can ask. It's > likely not very relevant to our tests as the SLES version was in line with > RHEL6 and not our testbed, RHEL7. The customer may be willing to test with > a more up to date SLES, for comparison, but this is likely a rabbit hole. I don't know about this. Yes we don't need to compare RHEL7 with SLES. > > For reference, the latest sosreport showed partitioning take 13min 50secs. > > 12:19:41,538 INFO anaconda: Creating disklabel on /dev/sdb > 12:33:31,794 INFO anaconda: Created ext4 on /dev/md/boot > > This is on an installation that took 26min 56secs. The customer claimed > very recently that one installation took upwards of an hour, as they were > when we opened this bug. I'm assuming this was from the resync activity and > I think/hope it's a one-off; thanks for the recommendation on decreasing > sync speed, will send that along to see if it helps. In the sosreport that > took an hour, I found partitioning took about 19 mins. I'm a little confused. It's RHEL7.3 that took more than one hour to install OS, right? It becomes better with RHEL7.4 which has the patches from https://bugzilla.redhat.com/show_bug.cgi?id=1455679#c16. So it should be less than one hour. > Aug 25 16:28:38 localhost blivet[1558]: notifying kernel of 'change' event > on device /sys/class/block/dm-7 > Aug 25 16:28:38 localhost program[1558]: Running... udevadm settle > --timeout=300 > ****[udevadm settle takes 59 seconds to return] What's the device dm-7? Sorry, I'm not familiar with anaconda and blivet. If I installed a new system, where can I see this kind of logs? > Aug 25 16:29:21 localhost blivet[1558]: > LVMLogicalVolumeDevice.setup: rhel-var_log ; status: True ; controllable: > True ; orig: False ; > Aug 25 16:29:21 localhost blivet[1558]: > LVMLogicalVolumeDevice.updateSysfsPath: rhel-var_log ; status: True ; > Aug 25 16:29:21 localhost blivet[1558]: rhel-var_log sysfsPath set to > /sys/devices/virtual/block/dm-8 > Aug 25 16:29:21 localhost program[1558]: Running... udevadm settle > --timeout=300 > ****[14 more seconds for settle] > Aug 25 16:29:35 localhost program[1558]: Return code: 0 Why does udev take more seconds for settle? > > As you can see, there are very specific spots where we stop moving forward, > and not due to the size of the filesystem, but rather seemingly due to udev > or the raid sync. I will mention adding udev debug as a part of > troubleshooting going forward. Hmm why do you think it's due to raid sync. Thanks Xiao (In reply to John Pittman from comment #31) > Xiao, > > Sorry for the wait. Reducing the sync_speed_max fixed the issue. I did > some tests in the lab and it increased the speed of installation on my > system by around 30%. The customer implemented the decrease through a %pre > script in kickstart and the change was drastic. Quote from the customer: > "The overall OS installation time as I calculated was 16 minutes, starting > from the PXE stage till the login prompt". That's one-half to one-third the > time of previous installs; currently, seemingly all caused by the latency > produced by the resync. Is this what you expect? I would expect some perf > hit, but of this magnitude? Yes, it's my expect. Resync speed gets down can let the performance of other programmes get better. So the problem is fixed? Sorry, what's the meaning of "some perf hit, but of this magnitude" Thanks Xiao Hi Xiao, Yes the issue was eliminated by decreasing the sync_speed_max from a %pre script. Xiao, I ran performance tests on a non-root mdadm device during sync, and did not see the performance impact that we see during installation. The system seemed to handle the I/O and resync well. There was very little difference between 1x-32x threaded buffered write I/O during a resync vs not. The resync speed never got above 80k, which is much lower than the max that we saw during install. We were getting up to 180k during install (if I recall correctly). This was unexpected for me, I understand that there was less I/O during install, probably allowing the resync to reach a higher maximum speed, but I don't understand how sync can obviously slow down install but does not slow down I/O in the live environment. Is there some sort of prio given to a root device sync that may get in the way? Also, the customer set their sync_speed_max at 45K sync_speed_max. This provided them with a good balance of install speed and sync speed. If the impact on installation is expected (with my tests, now, I have no idea why it would be), can we implement some sort of decrease in sync_speed_max, only during install? One more question, why is it that sometimes the resync speed sometimes flaps up and down wildly, then other times it is fairly consistent? Is this due to he consistency of the I/O & CPU load? (In reply to John Pittman from comment #36) > Xiao, > > I ran performance tests on a non-root mdadm device during sync, and did not > see the performance impact that we see during installation. The system > seemed to handle the I/O and resync well. There was very little difference > between 1x-32x threaded buffered write I/O during a resync vs not. The > resync speed never got above 80k, which is much lower than the max that we > saw during install. We were getting up to 180k during install (if I recall > correctly). This was unexpected for me, I understand that there was less > I/O during install, probably allowing the resync to reach a higher maximum > speed, but I don't understand how sync can obviously slow down install but > does not slow down I/O in the live environment. Is there some sort of prio > given to a root device sync that may get in the way? It's a general problem. Resync/recovery can decrease the speed of I/O. In comment32 I have some questions. It looks like we can know who takes most time in the log in anaconda. In the log, it looks like udev takes most time. Maybe it's the key we should look into. > > Also, the customer set their sync_speed_max at 45K sync_speed_max. This > provided them with a good balance of install speed and sync speed. If the > impact on installation is expected (with my tests, now, I have no idea why > it would be), can we implement some sort of decrease in sync_speed_max, only > during install? It's not a good solution for this problem. Thanks Xiao (In reply to John Pittman from comment #37) > One more question, why is it that sometimes the resync speed sometimes flaps > up and down wildly, then other times it is fairly consistent? Is this due > to he consistency of the I/O & CPU load? How much is the difference? Created attachment 1323894 [details]
udev_debug.log
Attached is an example of something that takes too long. When adding up many events like this, we end up with a very long install. /dev/md0 is only 512M but ext4 format takes 35 seconds, whereas after install, formatting takes .520s (resync is not running in this case after install). We even see that the mke2fs program not officially starting until over 20 seconds after the command is called.
Sep 08 12:30:06 localhost.localdomain program[1631]: Running... mke2fs -t ext4 /dev/md/0
....
Sep 08 12:30:28 localhost.localdomain program[1631]: mke2fs 1.42.9 (28-Dec-2013)
Lots of time spent in 'Validate module index' for comment 44 What I see in attachment 1323894 [details] :
Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: inotify event: 8 for /dev/md0
Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: device /dev/md0 closed, synthesising 'change'
Sep 08 12:30:28 localhost.localdomain blivet[1631]: Ext4FS.notifyKernel: device: /dev/md/0 ; type: ext4 ;
Sep 08 12:30:28 localhost.localdomain blivet[1631]: notifying kernel of 'change' event on device /sys/class/block/md0
What means that for md0 two udev change events are created for one action. One from systemd-udevd itsself via the watch rules and one by blivet.
Either blivet should turn off the udev watch rules or leave the job to udev for the change events.
This doubled raid rescans from udev.
Regarding "too much time spent in udevsettle":
If you have to wait a long time for udev to settle, then something fishy is going on in the udev rules and a helper application might take more time there than expected. Further debugging of the helper is needed.
If a udev event is processed, this is done via a worker (with a PID and a seq id). You can see how long this worker is in action via the debug log.
Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: seq 5912 queued, 'change' 'block'
Sep 08 12:30:28 localhost.localdomain systemd-udevd[1341]: seq 5912 forked new worker [3641]
Sep 08 12:30:28 localhost.localdomain systemd-udevd[3641]: seq 5912 running
[…]
Sep 08 12:30:31 localhost.localdomain systemd-udevd[3641]: seq 5912 processed with 0
So, this particular "change" uevent for /dev/md0 took 3 seconds.
Seems like a lot of time was spent in blkid.
Seems like "e2label /dev/md/0" is running in parallel to udev running "blkid", which probably bites.
So, either there should be a udevsettle after the mke2fs before running e2label, or as I said before, the udev watch rules could be turned off (with blivet emitting the change events anyway).
You should contact the blivet/anaconda team about that.
so, what happened e.g. for sda from 16:34:45,907 - 16:34:59,484: 16:34:44 program[2943]: Running... udevadm settle --timeout=300 16:34:44 program[2943]: Return code: 0 16:34:44 blivet[2943]: executing action: [257] create device partition sda2 (id 255) 16:34:44 blivet[2943]: PartitionDevice.create: sda2 ; status: False ; 16:34:44 blivet[2943]: PartitionDevice.setupParents: kids: 1 ; name: sda2 ; orig: False ; 16:34:44 blivet[2943]: DiskDevice.setup: sda ; status: True ; controllable: True ; orig: False ; 16:34:44 blivet[2943]: DiskLabel.setup: device: /dev/sda ; status: False ; type: disklabel ; 16:34:44 blivet[2943]: PartitionDevice._create: sda2 ; status: False ; 16:34:44 blivet[2943]: PartitionDevice._wipe: sda2 ; status: False ; 16:34:44 program[2943]: Running... dd if=/dev/zero of=/dev/sda bs=512 seek=421529600 count=2048 16:34:45 program[2943]: 2048+0 records in 16:34:45 program[2943]: 2048+0 records out 16:34:45 program[2943]: 1048576 bytes (1.0 MB) copied, 1.21934 s, 860 kB/s 16:34:45 program[2943]: Return code: 0 16:34:45 program[2943]: Running... udevadm settle --timeout=300 16:34:59 program[2943]: Return code: 0 16:34:59 blivet[2943]: DiskLabel.commit: device: /dev/sda ; numparts: 2 ; Note: these timestamps are taken from the journal.log, where timestamps are created upon processing the journal log, which does not guarantee any ordering or microsecond resolution. My _guess_ is, that while blivet is creating the partition, or "dd" is running (why on earth dd???? on /dev/sda *sigh* one more open/close which triggers a change uevent), the old uevent is running blkid. This blkid might get confused with an incomplete partition table. (Just a note to the enormous trust and confidence with the settle here - the udevadm settle can't even guarantee that the synthesized uevents have been issued at all at the time we call the settle - the uevent might not even be triggered yet as it needs the inotify event to be received by udevd first which then issues the synthetic uevent for this inotify event for the device's "close" operation... We can't rely on the udevadm settle here.) (In reply to Harald Hoyer from comment #69) > My _guess_ is, that while blivet is creating the partition, or "dd" is > running (why on earth dd???? on /dev/sda *sigh* one more open/close which > triggers a change uevent), the old uevent is running blkid. This blkid might > get confused with an incomplete partition table. That is writing zeros to the beginning of the disk region the partition will occupy after we tell parted to commit the new partition to disk. (In reply to Peter Rajnoha from comment #70) > ... We can't rely on the udevadm settle here.) What other option do we have? Before we get carried away with minor optimizations, let's keep in mind that this process generally works fine. I just did a test install of 7.4 using qemu with two 40GiB virtio disks creating four raid1 arrays for /boot, /, swap, and lvm, all filesystems ext4. The storage configuration took a total of one minute. So the number of disks, the number of md arrays, and the filesystem type are all the same. What's different? The size of the partitions/arrays and the hardware. The problem is not related to the number of devices being managed nor the basic methods used by the installer, so let's move on from that theory, please. (In reply to David Lehman from comment #72) > (In reply to Peter Rajnoha from comment #70) > > ... We can't rely on the udevadm settle here.) > > What other option do we have? Unfortunately, none at the moment with current resources and older kernels that I can think of quickly. What we need here is for "dd" (or any other tool which causes the WATCH udev rule to trigger) to issue the uevent itself and wait for it to get processed. Then, when "dd" (or any other other tool) exits, we're sure that the synthetic uevent has been processed. I understand that it's not that easy to change each existing tool that may have changed anything to issue a uevent itself and wait for it instead of relying on the WATCH udev rule... What we can do instead with recent kernels (version >= 4.13) that support extended synthetic uevent interface (see also: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-uevent) is to create a wrapper that executes a command and at its end, it generates synthetic uevent with exact UUID and waits for the uevent via udev monitor by monitoring for the UUID. This way, we know the uevent is processed already when the wrapper exits and at the same time, we don't rely on the WATCH udev rule which is racy. (For future, it should be the command itself that generates and waits for the uevent, but as intermediary solution, we could create the wrapper.) Otherwise, I've no better idea how to resolve this. RHEL7.4 has been released as of kernel-3.10.0-693.el7.x86_64. Some of the test kernels given out for this BZ come after 7.4 was released. If you want changes to 7.4, you need to request a 7.4-zstream release. There has been numerous fixes that have gone into 7.5, but as to the fix to this bz, we don't have a specific commit id that fixes it. Other bugzillas that have reported resync issues have been resolved with the 7.5 kernel. On this issue, we still need a local reproducer. We do have the latest 7.5 kernel and userspace mdadm program available. mdadm: http://download-node-02.eng.bos.redhat.com/brewroot/packages/mdadm/4.0/10.el7/ Kernel: http://people.redhat.com/ncroxon/rhel7/.bz1455679_nokia/kernel-3.10.0-830.el7bz1455679.x86_64.rpm -Nigel I think we can go ahead and close this. Case has been closed. I will close it as 'currentrelease' since the earlier update to the code did help the customers performance a great deal. Any other issues should be remedied by decreasing max sync speed. |