Bug 121577 - mkinitrd phase of kernel install slow
Summary: mkinitrd phase of kernel install slow
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Pete Zaitcev
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-04-23 14:11 UTC by Steve Timm
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-19 19:27:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Steve Timm 2004-04-23 14:11:48 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031030

Description of problem:
Pete Zaitcev created a test kernel for me

The above kernel was created for me by Pete Zaitcev
as a fix to bugzilla number 118436.

I tried to install it with command 

rpm -i kernel-smp-2.4.21-12.usbserial3.1.EL.i686.rpm

The command appeared to hang, until you kill it at the command line
and even then there is an orphan process left around 
in status "D"

root      8718  8716  0 11:55 pts/0    00:01:44 /sbin/nash --force --quiet


I tried to install another kernel rpm with up2date, this time
the current errata kernel kernel-2.4.21-9.0.3.EL.i686.rpm
I get the same problem, another hung nash process.
So this has nothing to do with pete's kernel per se.

But it turns out, if you wait 3-4 hours, then the 
rpm -i does actually complete successfully, just very slowly.



Pete said in E-mail to me:

>This might have nothing to do with the USB. But maybe you ran the
>installation from the (known broken) USB serial console or something.
>
I didn't.

>Please file a bug, attach the output of "lsof -p 8718", and its >kernel
>stack trace. It is caught by doing this. First, you try to get rid of
>unnecessary processes, usually by shutting down extra services before
>reboot. Then, do:
> echo 1 > /proc/sys/kernel/sysrq
> echo t > /proc/sysrq-trigger
>After that, gather dmesg with "dmesg > /tmp/d.out" and, in case that
>wraps, "cp /var/log/messages /tmp/messages".
>
>Verify that an old, known good kernel is default in grub.conf and
>proceed with the reboot and attach the saved files from /tmp
>to the new bug. That would help to fix the installation problem.


lsof from the "nash" process while it was running was:
root@fnpca root]# lsof -p 8718
COMMAND  PID USER   FD   TYPE DEVICE  SIZE     NODE NAME
nash    8718 root  cwd    DIR    8,1  4096        2 /
nash    8718 root  rtd    DIR    8,1  4096        2 /
nash    8718 root  txt    REG    8,1 29972   304863 /sbin/nash
nash    8718 root    0r  FIFO    0,5       11276626 pipe
nash    8718 root    1w  FIFO    0,5       11276627 pipe
nash    8718 root    2u   CHR  136,0              2 /dev/pts/0
nash    8718 root    3r   REG    0,2     0     4114 /proc/partitions
nash    8718 root   19u   CHR  136,0              2 /dev/pts/0


The process actually exited before I could do the rest of 
the steps.  A reboot wouldn't be necessary now, except to 
pick up the new kernel.

In a follow-up E-mail from P. Zaitcev he says:


>Now that I think of it, perhaps USB was a factor. The only place
>which uses nash is when /sbin/new-kernel-pkg asks it to run
>"showlabels". This causes nash to read everything that is present
>in /proc/partitions. If a USB storage was present and something
>was wrong with it, this might take up too 100sec to clear.
>Also, running "showlabels" on NAS fabrics is probably not
>terribly safe. Maybe we should file a bug all the same against
>mkinitrd.

>-- Pete
So here are the contents of /proc/partitions:

[root@fnpca root]# cat /proc/partitions
major minor  #blocks  name     rio rmerge rsect ruse wio wmerge wsect
wuse running use aveq

   8     0   35548160 sda 63822 3708 523002 450540 3665885 508462
33455672 12418631 0 23294160 14046531
   8     1    5120608 sda1 60930 3630 516450 444820 3665120 508137
33446976 11921621 0 23291840 13542801
   8     2    2096262 sda2 54 0 168 410 0 0 0 0 0 70 410
   8     3    2048256 sda3 54 0 168 270 0 0 0 0 0 50 270
   8     4          1 sda4 7 0 14 0 0 0 0 0 0 0 0
   8     5    2048224 sda5 54 0 168 280 0 0 0 0 0 40 280
   8     6    2048224 sda6 54 0 168 250 0 0 0 0 0 50 250
   8     7   22178740 sda7 174 1 722 980 765 325 8696 497430 0 80050
498410
   8    16 1003606016 sdb 166496 3883 1359162 11703700 908010415
449770309 -2021397360 36562928 607 1971035 8817035
   8    17 1003605403 sdb1 166372 3883 1358914 11703600 908010415
449770309 -2021397360 37178248 607 1970965 6037445

So there is no usb storage in /proc/partitions.

This is a very very busy system, it is running a ganglia "gmetad"
process which is collecting information for over 700 nodes,
with its round robin database on the / partition,
so the loading of the system and constant I/O is a factor,
but it is still possible to do most other disk I/O at a
reasonable speed.

System is 8 processor IBM x440 running RHEL AS 3.0.with 
all security updates from up2date as of yesterday.














Version-Release number of selected component (if applicable):
kernel-2.4.21-9.0-3

How reproducible:
Always

Steps to Reproduce:
1.rpm -i (any kernel rpm)  or install a kernel rpm with up2date
2.
3.
    

Actual Results:  Command succeeded eventually but took approximately 4
hours
to return.  Most of time in "nash" subprocess.

Expected Results:  rpm install should have come back relatively fast,
a couple 
of minutes at most.

Additional info:

Comment 1 Pete Zaitcev 2004-04-23 15:26:55 UTC
The "all security updates" is a codeword for 2.4.21-9.0.3.EL here.
Unfortunately, to get USB working Steve has to run a
test kernel I branched off 2.4.21-12.EL.

I suggest looking at the output of iostat -x 5, it measures
the disk queue sizes.


Comment 2 Steve Timm 2004-04-23 18:18:40 UTC
What rpm is iostat in?  I don't seem to have that utility 
installed on my machine?

Also, there is still one more kernel rpm 
I have to install so if anyone wants me to reproduce this problem,
let me know and I can replicate the conditions again.



Comment 3 Pete Zaitcev 2004-04-23 19:06:06 UTC
[zaitcev@niphredil zaitcev]$ rpm -qf /usr/bin/iostat
sysstat-4.0.7-5


Comment 4 Steve Timm 2004-04-23 19:34:08 UTC
[root@fnpca root]# iostat -x 5
Linux 2.4.21-12.usbserial3.1.ELsmp (fnpca.fnal.gov)     04/23/2004

avg-cpu:  %user   %nice    %sys   %idle
           2.73    0.00    6.64   90.62

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
/dev/sda     1.02   1.07  0.53  0.95   11.15   16.10     5.58     8.05
   18.50     0.25   17.00   6.10   0.90
/dev/sda1    0.84   1.06  0.49  0.94   10.71   16.10     5.35     8.05
   18.63     0.25   17.38   6.22   0.90
/dev/sda2    0.00   0.00  0.00  0.00    0.01    0.00     0.01     0.00
   12.92     0.00    3.08   3.08   0.00
/dev/sda3    0.00   0.00  0.00  0.00    0.01    0.00     0.01     0.00
   12.92     0.00    3.85   3.85   0.00
/dev/sda4    0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
    2.00     0.00    1.43   1.43   0.00
/dev/sda5    0.00   0.00  0.00  0.00    0.01    0.00     0.01     0.00
   12.92     0.00    2.31   2.31   0.00
/dev/sda6    0.00   0.00  0.00  0.00    0.01    0.00     0.01     0.00
   12.92     0.00    3.08   3.08   0.00
/dev/sda7    0.01   0.00  0.00  0.00    0.02    0.00     0.01     0.00
   12.08     0.00    4.23   3.46   0.00
/dev/sdb     1.88 192.87  2.06 1903.24   31.17 16769.29    15.59 
8384.65     8.82    26.69    1.40   0.62 117.47
/dev/sdb1    1.87 192.87  2.05 1903.24   31.16 16769.29    15.58 
8384.65     8.82    26.68    1.40   0.62 117.47


As you see, write queue to /dev/sdb (where lives our ganglia repository
among other things, is very heavy.)  /dev/sdb is IBM Serveraid,
16 spindles 2 channels arranged in hardware raid 50.


Comment 5 Jeremy Katz 2004-05-06 22:00:09 UTC
nash is just trying to read labels off of partitions so that it can do
mounting by label.

Comment 6 Steve Timm 2004-06-23 17:18:33 UTC
I am seeing this same behavior happen again in the process of 
running up2date.  The up2date is picking up, among others, 
kernel-2.4.21-15.0.2.EL and kernel-smp-2.4.21-15.0.2.EL

This appears to be the second of two "nash" processes that 
are running.  I began up2date at 16:30 last night. 
It appears that the first nash finished at about 04:04 this 
morning (from the first kernel rpm) and the second one
started then and is still ongoing at 12:18

What is taking nash 8-10 hours to search my disk?
It should not take 8-10 hours to just read the label.




Comment 7 RHEL Program Management 2007-10-19 19:27:19 UTC
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.


Note You need to log in before you can comment on or make changes to this bug.