Description of problem: I have a small torque cluster (8 nodes) using a shared pool of disks for application storage. Each node also has a private disk for the OS, swap, etc. The shared pool of storage consists of 15 Fiber Channel disks, each with 11 partitions (three primary, eight extended). LVM (lvm2-cluster) is used to stripe the partitions to present 11 LVs. Each node in the cluster has at least one LV marked as active, which it then mounts (via fstab). However, each node has /dev entries for all of the physical devices and all of their partitions. So, in /dev (and in the /sys structure) there are many (well over 100) partitions visible by each node in this system. When upgrading the kernel recently, two parts of mkinitrd apparently hangs: nash and grubby. However, after a recent upgrade of nash, it no longer hangs. Grubby still hangs. By hangs, I mean it runs for days, consuming near 100% of a CPU core. As a result, kernel upgrades do not make the /etc/grub.conf changes, and other upgrades, such as with yum, do not occur. I have performed an strace on the grubby process. I see that grubby is walking through the partitions. The sequence for each partition: a getdents, followed by a close an open to /sys/block/[the physical]/[the partition]/dev, a read(8:197\n), another read, a close, an access to the partition via /dev/[partition], an open of /proc/devices a read(Character devices:\n 1 mem\n 4 /d) another read a close, an open of /proc/misc a read(229 fuse\n 57 dlm_plock\n 58 dlm...) another read a close an open of /sys/block/[the physical]/[the partition]/slaves -- result is -1, no such file an open of /sys/block/[the physical]/[the partition] an fcntl64(F_GETFD) an fcntl64(F_SETFDmFD_CLOEXEC) a getdents an open of /sys/block/[the physical]/[the partition]/uevent/dev -- result is -1, not a directory an open of /sys/block/[the physical]/[the partition]/dev/dev -- result is -1, not a directory an open of /sys/block/[the physical]/[the partition]/subsystem/dev -- result is -1, no such file or directory an open of /sys/block/[the physical]/[the partition]/start/dev -- result is -1, not a directory an open of /sys/block/[the physical]/[the partition]/size/dev -- result is -1, not a directory an open of /sys/block/[the physical]/[the partition]/stat/dev -- result is -1, not a directory an open of /sys/block/[the physical]/[the partition]/power/dev -- result is -1, no such file or directory an open of /sys/block/[the physical]/[the partition]/holders/dev -- result is -1, no such file or directory a getdents a close Each scan of the partition I have seen looks the same, but each partition takes longer to run this sequence than the one before. The first few run this sequence in a few seconds. By the time it gets to /dev/sdn, it takes several minutes. Eventually, it stops making apparent forward progress. If I remove lvm2-cluster and reboot, I can get the system to come up with only the local /dev/sda (and /dev/sda1, /dev/sda2). Then I can install the kernel without a hitch, and the whole rpm/mkinitrd/nash/grubby completes very quickly and makes the appropriate /boot/grub/grub.conf entries. However, if I re-install lvm2-cluster, reboot, and see the 100+ partitions in /dev again, I can no longer do a kernel upgrade. A grubby command line which exhibits the problem was: /sbin/grubby --add-kernel=/boot/vmlinuz-2.6.29.4-167.fc11.i686.PAE --initrd /boot/initrd-2.6.29.4-167.fc11.i686.PAE.img --copy-default --make-default --title Fedora (2.6.29.4-167.fc11.i686.PAE) --args=root=/dev/VolGroup00/LogVol00 --remove-kernel=TITLE=Fedora (2.6.29.4-167.fc11.i686.PAE) Version-Release number of selected component (if applicable): grubby-6.0.86-2.fc11.i586 How reproducible: Every time, on all eight nodes Steps to Reproduce: 1. yum -y upgrade kernel 2. 3. Actual results: grubby runs for days consuming 100% of a CPU core Expected results: grubby should run quickly and modify the /boot/grub/grub.conf as appropriate Additional info: available on request
Created attachment 348924 [details] strace output of grubby This is an strace of the grubby run. It did eventually complete.
I have attached an strace of a grubby run that finally completed after several days.
This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.