506988 – grubby runs "forever" during kernel upgrades in systems with multiple disks

Bug 506988 - grubby runs "forever" during kernel upgrades in systems with multiple disks

Summary: grubby runs "forever" during kernel upgrades in systems with multiple disks

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mkinitrd
Sub Component:
Version:	10
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Peter Jones
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-19 16:56 UTC by Ken Chilton
Modified:	2009-12-18 09:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-12-18 09:34:28 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace output of grubby (3.75 MB, text/plain) 2009-06-22 14:55 UTC, Ken Chilton	no flags	Details
View All

Description Ken Chilton 2009-06-19 16:56:11 UTC

Description of problem:

I have a small torque cluster (8 nodes) using a shared pool of disks for application storage.  Each node also has a private disk for the OS, swap, etc.

The shared pool of storage consists of 15 Fiber Channel disks, each with 11 partitions (three primary, eight extended).  LVM (lvm2-cluster) is used to stripe the partitions to present 11 LVs.  Each node in the cluster has at least one LV marked as active, which it then mounts (via fstab).  However, each node has /dev entries for all of the physical devices and all of their partitions.

So, in /dev (and in the /sys structure) there are many (well over 100) partitions visible by each node in this system.

When upgrading the kernel recently, two parts of mkinitrd apparently hangs: nash and grubby.  However, after a recent upgrade of nash, it no longer hangs.  Grubby still hangs.  By hangs, I mean it runs for days, consuming near 100% of a CPU core.  As a result, kernel upgrades do not make the /etc/grub.conf changes, and other upgrades, such as with yum, do not occur.

I have performed an strace on the grubby process.  I see that grubby is walking through the partitions.  The sequence for each partition:
 a getdents, followed by a close
 an open to /sys/block/[the physical]/[the partition]/dev,
   a read(8:197\n),
   another read,
 a close,
 an access to the partition via /dev/[partition],
 an open of /proc/devices
  a read(Character devices:\n 1 mem\n 4 /d)
  another read
 a close,
 an open of /proc/misc
  a read(229 fuse\n 57 dlm_plock\n 58 dlm...)
  another read
 a close
 an open of /sys/block/[the physical]/[the partition]/slaves
   -- result is -1, no such file
 an open of /sys/block/[the physical]/[the partition]
  an fcntl64(F_GETFD)
  an fcntl64(F_SETFDmFD_CLOEXEC)
  a getdents
 an open of /sys/block/[the physical]/[the partition]/uevent/dev
   -- result is -1, not a directory
 an open of /sys/block/[the physical]/[the partition]/dev/dev
   -- result is -1, not a directory
 an open of /sys/block/[the physical]/[the partition]/subsystem/dev
   -- result is -1, no such file or directory
 an open of /sys/block/[the physical]/[the partition]/start/dev
   -- result is -1, not a directory
 an open of /sys/block/[the physical]/[the partition]/size/dev
   -- result is -1, not a directory
 an open of /sys/block/[the physical]/[the partition]/stat/dev
   -- result is -1, not a directory
 an open of /sys/block/[the physical]/[the partition]/power/dev
   -- result is -1, no such file or directory
 an open of /sys/block/[the physical]/[the partition]/holders/dev
   -- result is -1, no such file or directory
  a getdents
 a close

Each scan of the partition I have seen looks the same, but each partition takes longer to run this sequence than the one before.  The first few run this sequence in a few seconds.  By the time it gets to /dev/sdn, it takes several minutes.  Eventually, it stops making apparent forward progress.

If I remove lvm2-cluster and reboot, I can get the system to come up with only the local /dev/sda (and /dev/sda1, /dev/sda2).  Then I can install the kernel without a hitch, and the whole rpm/mkinitrd/nash/grubby completes very quickly and makes the appropriate /boot/grub/grub.conf entries.  However, if I re-install lvm2-cluster, reboot, and see the 100+ partitions in /dev again, I can no longer do a kernel upgrade.

A grubby command line which exhibits the problem was:
/sbin/grubby --add-kernel=/boot/vmlinuz-2.6.29.4-167.fc11.i686.PAE --initrd /boot/initrd-2.6.29.4-167.fc11.i686.PAE.img --copy-default --make-default --title Fedora (2.6.29.4-167.fc11.i686.PAE) --args=root=/dev/VolGroup00/LogVol00 --remove-kernel=TITLE=Fedora (2.6.29.4-167.fc11.i686.PAE)

Version-Release number of selected component (if applicable):
grubby-6.0.86-2.fc11.i586

How reproducible:
Every time, on all eight nodes

Steps to Reproduce:
1. yum -y upgrade kernel
2.
3.
  
Actual results:
grubby runs for days consuming 100% of a CPU core

Expected results:
grubby should run quickly and modify the /boot/grub/grub.conf as appropriate

Additional info:
available on request

Comment 1 Ken Chilton 2009-06-22 14:55:10 UTC

Created attachment 348924 [details]
strace output of grubby

This is an strace of the grubby run.  It did eventually complete.

Comment 2 Ken Chilton 2009-06-22 14:56:22 UTC

I have attached an strace of a grubby run that finally completed after several days.

Comment 3 Bug Zapper 2009-11-18 12:06:39 UTC

This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 4 Bug Zapper 2009-12-18 09:34:28 UTC

Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.