From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031030 Description of problem: Pete Zaitcev created a test kernel for me The above kernel was created for me by Pete Zaitcev as a fix to bugzilla number 118436. I tried to install it with command rpm -i kernel-smp-2.4.21-12.usbserial3.1.EL.i686.rpm The command appeared to hang, until you kill it at the command line and even then there is an orphan process left around in status "D" root 8718 8716 0 11:55 pts/0 00:01:44 /sbin/nash --force --quiet I tried to install another kernel rpm with up2date, this time the current errata kernel kernel-2.4.21-9.0.3.EL.i686.rpm I get the same problem, another hung nash process. So this has nothing to do with pete's kernel per se. But it turns out, if you wait 3-4 hours, then the rpm -i does actually complete successfully, just very slowly. Pete said in E-mail to me: >This might have nothing to do with the USB. But maybe you ran the >installation from the (known broken) USB serial console or something. > I didn't. >Please file a bug, attach the output of "lsof -p 8718", and its >kernel >stack trace. It is caught by doing this. First, you try to get rid of >unnecessary processes, usually by shutting down extra services before >reboot. Then, do: > echo 1 > /proc/sys/kernel/sysrq > echo t > /proc/sysrq-trigger >After that, gather dmesg with "dmesg > /tmp/d.out" and, in case that >wraps, "cp /var/log/messages /tmp/messages". > >Verify that an old, known good kernel is default in grub.conf and >proceed with the reboot and attach the saved files from /tmp >to the new bug. That would help to fix the installation problem. lsof from the "nash" process while it was running was: root@fnpca root]# lsof -p 8718 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME nash 8718 root cwd DIR 8,1 4096 2 / nash 8718 root rtd DIR 8,1 4096 2 / nash 8718 root txt REG 8,1 29972 304863 /sbin/nash nash 8718 root 0r FIFO 0,5 11276626 pipe nash 8718 root 1w FIFO 0,5 11276627 pipe nash 8718 root 2u CHR 136,0 2 /dev/pts/0 nash 8718 root 3r REG 0,2 0 4114 /proc/partitions nash 8718 root 19u CHR 136,0 2 /dev/pts/0 The process actually exited before I could do the rest of the steps. A reboot wouldn't be necessary now, except to pick up the new kernel. In a follow-up E-mail from P. Zaitcev he says: >Now that I think of it, perhaps USB was a factor. The only place >which uses nash is when /sbin/new-kernel-pkg asks it to run >"showlabels". This causes nash to read everything that is present >in /proc/partitions. If a USB storage was present and something >was wrong with it, this might take up too 100sec to clear. >Also, running "showlabels" on NAS fabrics is probably not >terribly safe. Maybe we should file a bug all the same against >mkinitrd. >-- Pete So here are the contents of /proc/partitions: [root@fnpca root]# cat /proc/partitions major minor #blocks name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq 8 0 35548160 sda 63822 3708 523002 450540 3665885 508462 33455672 12418631 0 23294160 14046531 8 1 5120608 sda1 60930 3630 516450 444820 3665120 508137 33446976 11921621 0 23291840 13542801 8 2 2096262 sda2 54 0 168 410 0 0 0 0 0 70 410 8 3 2048256 sda3 54 0 168 270 0 0 0 0 0 50 270 8 4 1 sda4 7 0 14 0 0 0 0 0 0 0 0 8 5 2048224 sda5 54 0 168 280 0 0 0 0 0 40 280 8 6 2048224 sda6 54 0 168 250 0 0 0 0 0 50 250 8 7 22178740 sda7 174 1 722 980 765 325 8696 497430 0 80050 498410 8 16 1003606016 sdb 166496 3883 1359162 11703700 908010415 449770309 -2021397360 36562928 607 1971035 8817035 8 17 1003605403 sdb1 166372 3883 1358914 11703600 908010415 449770309 -2021397360 37178248 607 1970965 6037445 So there is no usb storage in /proc/partitions. This is a very very busy system, it is running a ganglia "gmetad" process which is collecting information for over 700 nodes, with its round robin database on the / partition, so the loading of the system and constant I/O is a factor, but it is still possible to do most other disk I/O at a reasonable speed. System is 8 processor IBM x440 running RHEL AS 3.0.with all security updates from up2date as of yesterday. Version-Release number of selected component (if applicable): kernel-2.4.21-9.0-3 How reproducible: Always Steps to Reproduce: 1.rpm -i (any kernel rpm) or install a kernel rpm with up2date 2. 3. Actual Results: Command succeeded eventually but took approximately 4 hours to return. Most of time in "nash" subprocess. Expected Results: rpm install should have come back relatively fast, a couple of minutes at most. Additional info:
The "all security updates" is a codeword for 2.4.21-9.0.3.EL here. Unfortunately, to get USB working Steve has to run a test kernel I branched off 2.4.21-12.EL. I suggest looking at the output of iostat -x 5, it measures the disk queue sizes.
What rpm is iostat in? I don't seem to have that utility installed on my machine? Also, there is still one more kernel rpm I have to install so if anyone wants me to reproduce this problem, let me know and I can replicate the conditions again.
[zaitcev@niphredil zaitcev]$ rpm -qf /usr/bin/iostat sysstat-4.0.7-5
[root@fnpca root]# iostat -x 5 Linux 2.4.21-12.usbserial3.1.ELsmp (fnpca.fnal.gov) 04/23/2004 avg-cpu: %user %nice %sys %idle 2.73 0.00 6.64 90.62 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util /dev/sda 1.02 1.07 0.53 0.95 11.15 16.10 5.58 8.05 18.50 0.25 17.00 6.10 0.90 /dev/sda1 0.84 1.06 0.49 0.94 10.71 16.10 5.35 8.05 18.63 0.25 17.38 6.22 0.90 /dev/sda2 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 12.92 0.00 3.08 3.08 0.00 /dev/sda3 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 12.92 0.00 3.85 3.85 0.00 /dev/sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 1.43 1.43 0.00 /dev/sda5 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 12.92 0.00 2.31 2.31 0.00 /dev/sda6 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 12.92 0.00 3.08 3.08 0.00 /dev/sda7 0.01 0.00 0.00 0.00 0.02 0.00 0.01 0.00 12.08 0.00 4.23 3.46 0.00 /dev/sdb 1.88 192.87 2.06 1903.24 31.17 16769.29 15.59 8384.65 8.82 26.69 1.40 0.62 117.47 /dev/sdb1 1.87 192.87 2.05 1903.24 31.16 16769.29 15.58 8384.65 8.82 26.68 1.40 0.62 117.47 As you see, write queue to /dev/sdb (where lives our ganglia repository among other things, is very heavy.) /dev/sdb is IBM Serveraid, 16 spindles 2 channels arranged in hardware raid 50.
nash is just trying to read labels off of partitions so that it can do mounting by label.
I am seeing this same behavior happen again in the process of running up2date. The up2date is picking up, among others, kernel-2.4.21-15.0.2.EL and kernel-smp-2.4.21-15.0.2.EL This appears to be the second of two "nash" processes that are running. I began up2date at 16:30 last night. It appears that the first nash finished at about 04:04 this morning (from the first kernel rpm) and the second one started then and is still ongoing at 12:18 What is taking nash 8-10 hours to search my disk? It should not take 8-10 hours to just read the label.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.