Description of problem: On running upgrades to U2 of RHEL4, occassionally and on some systems useradd (which is called from the %post section of the RPM) hangs with 100% utilisation. The program is called from one of those rpm-tmp.XXXXX scripts in /var/tmp. When the script is killed, the rpm reports that the %post scriptlet failed, but useradd doesn't die. Attempts to kill useradd with -9 and/or -15 are unsuccessful - it completely takes over one of the CPUs. BTW, the user that useradd is supposed to add, already exists. Version-Release number of selected component (if applicable): shadow-utils-4.0.3-52.RHEL4 How reproducible: Sometimes. Steps to Reproduce: 1. Run upgrade of various RPMS from RHEL4-U2. Actual results: useradd hangs. Expected results: useradd should not hang. Additional info: Happened on various hardware: HP xw9300 workstations (SMP), Sun V20z (SMP).
We just got bitten by this when trying to upgrade RHEL, also on an x86_64 machine. It hung when trying to update the nscd rpm: # ps aux | grep nscd root 3398 99.9 0.0 55076 1396 ? RN 21:28 87:33 /usr/sbin/useradd -M -o -r -d / -s /sbin/nologin -c NSCD Daemon -u 28 nscd Interesting to note that I can't kill it with kill -9 or even pause it with kill -STOP. Especially interesting that /proc/3398/status shows: SigPnd: 0000000000040100 so it has pending signals for sigKILL and sigSTOP. My unix internals book says those signals can't be ignored... ya learn somethin' new every day. strace and ltrace hang, and can't be ^C out (kill -9 from another terminal stops them). They show no output. So I have no debugging info for you.
I'm actually starting to think this must be a kernel bug (no process should be able to ignore a sigKILL, right?). Here's something that might mean something to a developer, though: /proc/3398# cat status Name: useradd State: R (running) SleepAVG: 97% Tgid: 3398 Pid: 3398 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 512 Groups: VmSize: 55076 kB VmLck: 0 kB VmRSS: 1396 kB VmData: 924 kB VmStk: 36 kB VmExe: 55 kB VmLib: 1461 kB StaBrk: 0051d000 kB Brk: 0053e000 kB StaStk: 7fbffffcc0 kB Threads: 1 SigPnd: 0000000000040100 ShdPnd: 0000000000044100 SigBlk: 0000000000000000 SigIgn: 0000000021001000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 00000000fffffeff CapEff: 00000000fffffeff Can we get this bug switched to high priority? This is gonna break a lot of people who scheduled weekend downtime for the U2 upgrade....
I can confirm the strace bit as well. Nothing was coming out on stderr/stdout and it couldn't be Ctrl-C-ed. I think one of the RPMS that caused me trouble was nscd as well, but I'm pretty sure there were a few others too. The funny bit is that it is not 100% repeatable (I did about 12 machines and only 2 gave me trouble). All i386 machines (UP/SMP) I tried (some 10 or so), didn't experience any problems.
I should have mentioned that we were also on an SMP machine (two dual-core x86_64 processors, for a total of 4 seen by the OS).
Same here, the useradd from nscd %post got hung eating all cpu - a dual cpu x86_64. cat /proc/1401/status Name: useradd State: R (running) SleepAVG: 68% Tgid: 1401 Pid: 1401 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 102 102 102 102 FDSize: 256 Groups: 102 500 VmSize: 7672 kB VmLck: 0 kB VmRSS: 1308 kB VmData: 924 kB VmStk: 32 kB VmExe: 55 kB VmLib: 1461 kB StaBrk: 0051d000 kB Brk: 0053e000 kB StaStk: 7fbffffd60 kB Threads: 1 SigPnd: 0000000000040100 ShdPnd: 0000000000040102 SigBlk: 0000000000000000 SigIgn: 0000000001001000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 00000000fffffeff CapEff: 00000000fffffeff
Just to add that this happened to me too, but on all 7 x86_64 (all SMP) machines I have RHEL4 installed on, and none of the many more i386 ones. If it's a kernel bug, it seems x86_64 specific at least. I reported it on the mailing-list before Panu pointed me to this bug. https://www.redhat.com/archives/nahant-list/2005-October/msg00103.html Matthias
Adding lsof output and observations from Matthias' post to nahant-list: Here's the lsof output (edited to fit) : cwd DIR 8,2 4096 2 / rtd DIR 8,2 4096 2 / txt REG 8,2 66536 11737046 /usr/sbin/useradd mem REG 8,2 103176 23789571 /lib64/ld-2.3.4.so mem REG 8,2 217016 23265695 /var/db/nscd/group mem REG 8,2 217016 23265694 /var/db/nscd/passwd mem REG 8,2 27270 23789583 /lib64/libcrypt-2.3.4.so mem REG 8,2 97896 23793686 /lib64/libaudit.so.0.0.0 mem REG 8,2 1484017 23789613 /lib64/tls/libc-2.3.4.so mem REG 8,2 62504 23789792 /lib64/libselinux.so.1 0r FIFO 0,7 2059676 pipe 1w CHR 1,3 1518 /dev/null 2w CHR 1,3 1518 /dev/null 3u sock 0,4 2059679 can't identify protocol 4r REG 8,2 96 4292677 /etc/default/useradd I can't help but notice that libaudit is used, and that it was updated in the same transaction. Matthias
Same for me on several x86_64 servers - all except the one I updated manually with rpm -Fvh, several packages at a time. The problem was made worse by an inability to "shutdown -r". The serial console (Dell DRAC4) also hung except for sysrq, but I was still unable to boot with sysrq; recovery required a hard-reset. John
Here is the yum stderr output of a stuck system, after killing the nscd rpm-tmp.XXXXXX scriptlet stuck on useradd : warning: /etc/nsswitch.conf created as /etc/nsswitch.conf.rpmnew Stopping sshd:[ OK ] Starting sshd:[ OK ] warning: /etc/ld.so.conf created as /etc/ld.so.conf.rpmnew warning: /etc/nsswitch.conf created as /etc/nsswitch.conf.rpmnew warning: /etc/issue saved as /etc/issue.rpmsave warning: /etc/issue.net saved as /etc/issue.net.rpmsave telinit: timeout opening/writing control channel /dev/initctl error: %pre(nscd-2.3.4-2.13.x86_64) scriptlet failed, exit status 0 error: install: %pre scriptlet failed (2), skipping nscd-2.3.4-2.13 The telinit timeout seems like a suspicious bit, no? But maybe more of a consequence than a cause... Matthias
Some additional information from what looks like a potentially related bug (bug 170272): After installation of packages started we see the following: Installing... 1:udev ########################################### [100%] 2:mkinitrd ########################################### [100%] 3:initscripts ########################################### [100%] 4:openssh ########################################### [100%] 5:glibc-kernheaders ########################################### [100%] 6:glibc-headers ########################################### [100%] 7:glibc-devel ########################################### [100%] 8:gcc ########################################### [100%] 9:vixie-cron ########################################### [100%] 10:gcc-c++ ########################################### [100%] 11:gcc-g77 ########################################### [100%] 12:openssh-server ########################################### [100%] 13:system-config-printer ########################################### [100%] 14:xinetd ########################################### [100%] 15:sudo ########################################### [100%] 16:slocate ########################################### [100%] 17:pdksh ########################################### [100%] and nothing else! Running top in a second windows shows: top - 09:33:04 up 21:52, 2 users, load average: 4.02, 3.52, 2.27 Tasks: 90 total, 4 running, 86 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0% us, 25.0% sy, 0.0% ni, 74.9% id, 0.1% wa, 0.0% hi, 0.0% si Mem: 15966556k total, 1479704k used, 14486852k free, 116676k buffers Swap: 4080456k total, 0k used, 4080456k free, 1163628k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13837 root 18 0 54416 1224 588 R 99.9 0.0 17:07.40 useradd 14763 root 16 0 5284 908 688 R 0.3 0.0 0:00.24 top Someone in another bug stopped ypbind to solve the problem. Trying so was to late, I am also unable to "kill" or "kill -9" the useradd process. Maybe this is the reason why anything went wrong. Let me tell you, it's a 4 CPU Sun V40z. Please excuse my german accent :-/ Comment #6 From Andreas Bock (info.redhat.de) on 2005-10-11 04:24 EST [reply] Private on a 2nd identical maschine I stoped ypbind befor using "up2date -u": ... 16:sudo ########################################### [100%] 17:slocate ########################################### [100%] 18:pdksh ########################################### [100%] 19:nscd ########################################### [100%] 20:nfs-utils ########################################### [100%] ... and the process ended successfull. Maybe the postinstall script of pdksh or nscd tries to add a new user. Some times ago I was able to automaticaly update my systems every night. I think it would'nt be a good idea to stop this and update them manualy. Or maybe i should stop ypbind every time befor running up2date.
FWIW, on a dev machine where the problem occured (and where I had let useradd run until now), I also had prelink completely stuck (impossible to kill too), as well as a nash process which seemed forked from the kernel-smp scriplet since mkinitrd was running (but could be killed). Pretty bad since in the end I had useradd, prelink and nash impossible to kill (even with SIGKILL), but on this machine a clean "shutdown -r now" worked. Matthias
Due to a design flaw in up2date (it installs all new RPMs, then removes all old RPMs rather than doing the install/removal for each rpm) the up2date crashes are leaving people with many duplicated RPMs on their systems. Here are the commands I used to clean up our system: # try to remove everything, but save that list of problems to /tmp/dupes for file in `rpm -qa --queryformat="%{NAME} %{ARCH}\n" | sort | uniq -c | grep -v " 1 " | cut -c 9- | cut -d" " -f1`; do rpm -q --last $file | head -1 | cut -d" " -f1; done | grep -v gpg-pubkey | xargs rpm -e --justdb --nodeps 2> /tmp/dupes # explicitly remove the i386 and x86_64 versions of the problem packages for rpm in `cut -d\" -f2 /tmp/dupes`; do rpm -e --justdb --nodeps ${rpm}.i386 ${rpm}.x86_64; done # go back through and fix all the other packages for file in `rpm -qa --queryformat="%{NAME} %{ARCH}\n" | sort | uniq -c | grep -v " 1 " | cut -c 9- | cut -d" " -f1`; do rpm -q --last $file | head -1 | cut -d" " -f1; done | grep -v gpg-pubkey | xargs rpm -e --justdb --nodeps # re-sync with Red Hat Network up2date -p # try the upgrade again and hope for the best up2date
(In reply to comment #1) > We just got bitten by this when trying to upgrade RHEL, also on an x86_64 > machine. It hung when trying to update the nscd rpm: Same exact problem on FC4x86_64 when trying to up2date openssh.
Created attachment 119938 [details] Patch avoiding audit calls on old kernels I think this patch will solve the problem. If kernel 2.6.9-11 or lower, it should not open the audit socket.
That patch looks highly RHEL4-specific. It should probably be generalized, since apparently this issue affects FC4 also as per Comment #13. Or is that a different bug?
The problem AFAICT is a kernel bug that affects early versions of the 2.6 kernel. I am working around the problem in RHEL4 because the number of kernel releases are low and the U2 kernel should work fine. The solution for FC4 is to use the most recent kernel. If the problem exists in FC4, a new bug report should be opened since the kernel and versions of all related software is different. I've seen hangs on FC4, too. They were futex related.
Created attachment 119976 [details] Revised patch to avoid audit calls on old kernels After some testing the patch was revised slightly to make an exception for the CAPP cert kernel which is based off of -11 series, but has a correct and functioning audit system.
Are there plans to push an updated util-linux package? It occurs to me that the bug probably affects i386 users also, but just not in a dramatic-enough way that they have noticed.
At this point there's no plans to push one. I haven't seen any issues related to util-linux. The problem that we've seen so far only happens during an upgrade. I'd be interested in any reports that occur after upgrading or with any other package in RHEL4 U2.
Why would a new version not be pushed? I'm sure not all machines have been updated yet, and the failed up2date caused 130 duplicate packages on machines I have updated that I am still working on cleaning up. Also, if an update is not made available, is it not safe to run '2.6.9-11.ELsmp' anymore? I'm currently experiencing a production problem that started after the U2 update that _may_ be related to the new kernel. I'd like the ability to boot to the older kernel if necessary.
(In reply to comment #22) > At this point there's no plans to push one. I haven't seen any issues related to > util-linux. The problem that we've seen so far only happens during an upgrade. You mean you haven't seen any issues *other than completely breaking all x86_64-smp machines*, right? You're making three poor assumptions: 1: everyone has already upgraded 2: everyone has already rebooted to the new kernel 3: this bug only affects x86_64-smp machines I think all three assumptions are false. I understand not wanting to release a kernel-specific patch, but I don't see any alternative.
You're having communication problems -- the fix is to shadow-utils, not util-linux. That's why no push of a new util-linux package. Furthermore, the fix to shadow-utils (or the kernel for that matter) will only help in the general case if special arrangements are made such that shadow-utils is the first package upgraded in the transaction (the kernel is even worse in that a reboot into the new kernel would be necessary). So just pushing the new shadow-utils with Steve's patch from above is not enough to solve this one. I can't comment on what is going to be done (or has been done) but someone needed to get you guys on the same page.
Oh, sorry. I should have said shadow-utils instead of util-linux above. In any case, as I understand it, the bug is that the new shadow-utils package doesn't work with the old kernel. But if you push a newer shadow-utils package that *does* work with the old kernel, wouldn't the newer one be picked in preference to the other one during the upgrade process? Seems like that would save a lot of hassle for those who haven't upgraded yet. I'm still curious whether this affects non-x86_64-smp users in any way. It seems strange to me that we can understand the bug, but not why it affects some architectures differently than others.
The plan is to push a new shadow-utils asap if that indeed solved the problem. I'm still waiting for confirmation this fixes the problem. The patch above is not specific to x86_64 so it would help anyone. The update should favor the newer shadow-utils.
*** Bug 171038 has been marked as a duplicate of this bug. ***
I agree with commnent #26. After I upgraded the systems I control (and suffered for it), I warned other teams in my organisation to NOT upgrade due to this particular bug. So, a shadow-utils that works with -11 kernels would be most welcome.
If its actually eating CPU then one fix for folks who can't reboot after the update sequence may be to renice the process to a very low priority, at that point it will basically just replace the idle thread.
(In reply to comment #20) > Created an attachment (id=119976) [edit] > Revised patch to avoid audit calls on old kernels > > After some testing the patch was revised slightly to make an exception for the > CAPP cert kernel which is based off of -11 series, but has a correct and > functioning audit system. If I look at the patch, I think this wont works for 2.6.9-11.ELsmp (etc.) kernels. So the strcmp() should maybe be changed to something like strncmp(u.release, "2.6.9-11.EL", strlen("2.6.9-11.EL")) == 0)
Created attachment 120787 [details] New version of patch to avoid audit system on old kernels This patch should handle smp & hugemem kernels.
*** Bug 168951 has been marked as a duplicate of this bug. ***
This bug has caused me a lot of pain trying to do an update on a fresh FC4 install (x86_64). I install FC4, then (whether I update the kernel or not), "yum update" gets roughly half way through the updates and then sits there. Control-z-ing out and looking at "ps" shows me it hangs in useradd. I've tried a few different package sets, same thing. Did many many attempts to install and update FC4 today before seeing this bug listing (under RHEL of all places.) While hung, I cannot log in as "su". If I do a shutdown, the message about the scriptlet failing is shown. I don't know if this is unique to my hardware setup, but this happened for me doing a vanilla "workstation" install on x86_64 and then doing "update." Nasty stuff!
I must not have updated the kernel, rebooted, and then proceeded with "yum"; going to try that now; also see bug 170098. Maybe package dependencies for FC could flag this somehow. Will post on other bug; thanks for having this info here.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-842.html