170087 – useradd causes 100% CPU utilisation and cannot be killed

Bug 170087 - useradd causes 100% CPU utilisation and cannot be killed

Summary: useradd causes 100% CPU utilisation and cannot be killed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	shadow-utils
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Peter Vrabec
QA Contact:	David Lawrence
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	168951 171038 (view as bug list)
Depends On:
Blocks:	168429
TreeView+	depends on / blocked

Reported:	2005-10-07 04:06 UTC by Bojan Smojver
Modified:	2007-11-30 22:07 UTC (History)
CC List:	22 users (show)
Fixed In Version:	RHBA-2005-842
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-11-18 14:33:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch avoiding audit calls on old kernels (1.33 KB, patch) 2005-10-13 18:39 UTC, Steve Grubb	no flags	Details \| Diff
Revised patch to avoid audit calls on old kernels (1.62 KB, patch) 2005-10-14 14:48 UTC, Steve Grubb	no flags	Details \| Diff
New version of patch to avoid audit system on old kernels (1.77 KB, patch) 2005-11-07 18:53 UTC, Steve Grubb	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:842	0	normal	SHIPPED_LIVE	shadow-utils bug fix update	2005-11-18 05:00:00 UTC

Description Bojan Smojver 2005-10-07 04:06:11 UTC

Description of problem:
On running upgrades to U2 of RHEL4, occassionally and on some systems useradd
(which is called from the %post section of the RPM) hangs with 100% utilisation.
The program is called from one of those rpm-tmp.XXXXX scripts in /var/tmp. When
the script is killed, the rpm reports that the %post scriptlet failed, but
useradd doesn't die. Attempts to kill useradd with -9 and/or -15 are
unsuccessful - it completely takes over one of the CPUs. BTW, the user that
useradd is supposed to add, already exists.


Version-Release number of selected component (if applicable):
shadow-utils-4.0.3-52.RHEL4

How reproducible:
Sometimes.

Steps to Reproduce:
1. Run upgrade of various RPMS from RHEL4-U2.
 

Actual results:
useradd hangs.

Expected results:
useradd should not hang.

Additional info:
Happened on various hardware: HP xw9300 workstations (SMP), Sun V20z (SMP).

Comment 1 Damian Menscher 2005-10-08 03:59:37 UTC

We just got bitten by this when trying to upgrade RHEL, also on an x86_64
machine.  It hung when trying to update the nscd rpm:

# ps aux | grep nscd
root      3398 99.9  0.0 55076 1396 ?        RN   21:28  87:33 /usr/sbin/useradd
-M -o -r -d / -s /sbin/nologin -c NSCD Daemon -u 28 nscd

Interesting to note that I can't kill it with kill -9 or even pause it with kill
-STOP.  Especially interesting that /proc/3398/status shows:

SigPnd: 0000000000040100

so it has pending signals for sigKILL and sigSTOP.  My unix internals book says
those signals can't be ignored... ya learn somethin' new every day.

strace and ltrace hang, and can't be ^C out (kill -9 from another terminal stops
them).  They show no output.  So I have no debugging info for you.

Comment 2 Damian Menscher 2005-10-08 04:14:52 UTC

I'm actually starting to think this must be a kernel bug (no process should be
able to ignore a sigKILL, right?).  Here's something that might mean something
to a developer, though:

/proc/3398# cat status
Name:   useradd
State:  R (running)
SleepAVG:       97%
Tgid:   3398
Pid:    3398
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 512
Groups:
VmSize:    55076 kB
VmLck:         0 kB
VmRSS:      1396 kB
VmData:      924 kB
VmStk:        36 kB
VmExe:        55 kB
VmLib:      1461 kB
StaBrk: 0051d000 kB
Brk:    0053e000 kB
StaStk: 7fbffffcc0 kB
Threads:        1
SigPnd: 0000000000040100
ShdPnd: 0000000000044100
SigBlk: 0000000000000000
SigIgn: 0000000021001000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

Can we get this bug switched to high priority?  This is gonna break a lot of
people who scheduled weekend downtime for the U2 upgrade....

Comment 3 Bojan Smojver 2005-10-08 05:12:27 UTC

I can confirm the strace bit as well. Nothing was coming out on stderr/stdout
and it couldn't be Ctrl-C-ed.

I think one of the RPMS that caused me trouble was nscd as well, but I'm pretty
sure there were a few others too. The funny bit is that it is not 100%
repeatable (I did about 12 machines and only 2 gave me trouble). All i386
machines (UP/SMP) I tried (some 10 or so), didn't experience any problems.

Comment 4 Damian Menscher 2005-10-10 06:35:22 UTC

I should have mentioned that we were also on an SMP machine (two dual-core
x86_64 processors, for a total of 4 seen by the OS).

Comment 5 Panu Matilainen 2005-10-10 12:10:04 UTC

Same here, the useradd from nscd %post got hung eating all cpu - a dual cpu x86_64.

cat /proc/1401/status
Name:   useradd
State:  R (running)
SleepAVG:       68%
Tgid:   1401
Pid:    1401
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    102     102     102     102
FDSize: 256
Groups: 102 500
VmSize:     7672 kB
VmLck:         0 kB
VmRSS:      1308 kB
VmData:      924 kB
VmStk:        32 kB
VmExe:        55 kB
VmLib:      1461 kB
StaBrk: 0051d000 kB
Brk:    0053e000 kB
StaStk: 7fbffffd60 kB
Threads:        1
SigPnd: 0000000000040100
ShdPnd: 0000000000040102
SigBlk: 0000000000000000
SigIgn: 0000000001001000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 00000000fffffeff
CapEff: 00000000fffffeff

Comment 6 Matthias Saou 2005-10-10 13:52:07 UTC

Just to add that this happened to me too, but on all 7 x86_64 (all SMP) machines
I have RHEL4 installed on, and none of the many more i386 ones. If it's a kernel
bug, it seems x86_64 specific at least.

I reported it on the mailing-list before Panu pointed me to this bug.

https://www.redhat.com/archives/nahant-list/2005-October/msg00103.html

Matthias

Comment 7 Jeff Needle 2005-10-10 14:17:50 UTC

Adding lsof output and observations from Matthias' post to nahant-list:

Here's the lsof output (edited to fit) :

cwd       DIR        8,2     4096          2 /
rtd       DIR        8,2     4096          2 /
txt       REG        8,2    66536   11737046 /usr/sbin/useradd
mem       REG        8,2   103176   23789571 /lib64/ld-2.3.4.so
mem       REG        8,2   217016   23265695 /var/db/nscd/group
mem       REG        8,2   217016   23265694 /var/db/nscd/passwd
mem       REG        8,2    27270   23789583 /lib64/libcrypt-2.3.4.so
mem       REG        8,2    97896   23793686 /lib64/libaudit.so.0.0.0
mem       REG        8,2  1484017   23789613 /lib64/tls/libc-2.3.4.so
mem       REG        8,2    62504   23789792 /lib64/libselinux.so.1
  0r     FIFO        0,7             2059676 pipe
  1w      CHR        1,3                1518 /dev/null
  2w      CHR        1,3                1518 /dev/null
  3u     sock        0,4             2059679 can't identify protocol
  4r      REG        8,2       96    4292677 /etc/default/useradd

I can't help but notice that libaudit is used, and that it was updated in
the same transaction.

Matthias

Comment 8 John Vasileff 2005-10-10 19:20:06 UTC

Same for me on several x86_64 servers - all except the one I updated manually
with rpm -Fvh, several packages at a time.

The problem was made worse by an inability to "shutdown -r".  The serial console
(Dell DRAC4) also hung except for sysrq, but I was still unable to boot with
sysrq; recovery required a hard-reset.


John

Comment 9 Matthias Saou 2005-10-11 11:48:43 UTC

Here is the yum stderr output of a stuck system, after killing the nscd
rpm-tmp.XXXXXX scriptlet stuck on useradd :

warning: /etc/nsswitch.conf created as /etc/nsswitch.conf.rpmnew
Stopping sshd:[  OK  ]
Starting sshd:[  OK  ]
warning: /etc/ld.so.conf created as /etc/ld.so.conf.rpmnew
warning: /etc/nsswitch.conf created as /etc/nsswitch.conf.rpmnew
warning: /etc/issue saved as /etc/issue.rpmsave
warning: /etc/issue.net saved as /etc/issue.net.rpmsave
telinit: timeout opening/writing control channel /dev/initctl
error: %pre(nscd-2.3.4-2.13.x86_64) scriptlet failed, exit status 0
error:   install: %pre scriptlet failed (2), skipping nscd-2.3.4-2.13

The telinit timeout seems like a suspicious bit, no? But maybe more of a
consequence than a cause...

Matthias

Comment 10 Suzanne Hillman 2005-10-11 15:43:31 UTC

Some additional information from what looks like a potentially related bug (bug
170272):

After installation of packages started we see the following:

Installing...
   1:udev                   ########################################### [100%]
   2:mkinitrd               ########################################### [100%]
   3:initscripts            ########################################### [100%]
   4:openssh                ########################################### [100%]
   5:glibc-kernheaders      ########################################### [100%]
   6:glibc-headers          ########################################### [100%]
   7:glibc-devel            ########################################### [100%]
   8:gcc                    ########################################### [100%]
   9:vixie-cron             ########################################### [100%]
  10:gcc-c++                ########################################### [100%]
  11:gcc-g77                ########################################### [100%]
  12:openssh-server         ########################################### [100%]
  13:system-config-printer  ########################################### [100%]
  14:xinetd                 ########################################### [100%]
  15:sudo                   ########################################### [100%]
  16:slocate                ########################################### [100%]
  17:pdksh                  ########################################### [100%]

and nothing else!
Running top in a second windows shows:

top - 09:33:04 up 21:52,  2 users,  load average: 4.02, 3.52, 2.27
Tasks:  90 total,   4 running,  86 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us, 25.0% sy,  0.0% ni, 74.9% id,  0.1% wa,  0.0% hi,  0.0% si
Mem:  15966556k total,  1479704k used, 14486852k free,   116676k buffers
Swap:  4080456k total,        0k used,  4080456k free,  1163628k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13837 root      18   0 54416 1224  588 R 99.9  0.0  17:07.40 useradd
14763 root      16   0  5284  908  688 R  0.3  0.0   0:00.24 top


Someone in another bug stopped ypbind to solve the problem. Trying so was to
late, I am also unable to "kill" or "kill -9" the useradd process. Maybe this is
the reason why anything went wrong. Let me tell you, it's a 4 CPU Sun V40z.

Please excuse my german accent :-/


Comment #6 From Andreas Bock (info.redhat.de) 	on 2005-10-11
04:24 EST 	[reply] 	Private

on a 2nd identical maschine I stoped ypbind befor using "up2date -u":

...
  16:sudo                   ########################################### [100%]
  17:slocate                ########################################### [100%]
  18:pdksh                  ########################################### [100%]
  19:nscd                   ########################################### [100%]
  20:nfs-utils              ########################################### [100%]
...

and the process ended successfull. Maybe the postinstall script of pdksh or nscd
tries to add a new user.

Some times ago I was able to automaticaly update my systems every night. I think
it would'nt be a good idea to stop this and update them manualy. Or maybe i
should stop ypbind every time befor running up2date.

Comment 11 Matthias Saou 2005-10-11 16:11:24 UTC

FWIW, on a dev machine where the problem occured (and where I had let useradd
run until now), I also had prelink completely stuck (impossible to kill too), as
well as a nash process which seemed forked from the kernel-smp scriplet since
mkinitrd was running (but could be killed).

Pretty bad since in the end I had useradd, prelink and nash impossible to kill
(even with SIGKILL), but on this machine a clean "shutdown -r now" worked.

Matthias

Comment 12 Damian Menscher 2005-10-11 17:56:30 UTC

Due to a design flaw in up2date (it installs all new RPMs, then removes all old
RPMs rather than doing the install/removal for each rpm) the up2date crashes are
leaving people with many duplicated RPMs on their systems.  Here are the
commands I used to clean up our system:

# try to remove everything, but save that list of problems to /tmp/dupes
for file in `rpm -qa --queryformat="%{NAME} %{ARCH}\n" | sort | uniq -c | grep
-v "  1 " | cut -c 9- | cut -d" " -f1`; do rpm -q --last $file | head -1 | cut
-d" " -f1; done | grep -v gpg-pubkey | xargs rpm -e --justdb --nodeps 2> /tmp/dupes

# explicitly remove the i386 and x86_64 versions of the problem packages
for rpm in `cut -d\" -f2 /tmp/dupes`; do rpm -e --justdb --nodeps ${rpm}.i386
${rpm}.x86_64; done

# go back through and fix all the other packages
for file in `rpm -qa --queryformat="%{NAME} %{ARCH}\n" | sort | uniq -c | grep
-v "  1 " | cut -c 9- | cut -d" " -f1`; do rpm -q --last $file | head -1 | cut
-d" " -f1; done | grep -v gpg-pubkey | xargs rpm -e --justdb --nodeps

# re-sync with Red Hat Network
up2date -p

# try the upgrade again and hope for the best
up2date

Comment 13 Alex Lazarevich 2005-10-12 22:20:00 UTC

(In reply to comment #1)
> We just got bitten by this when trying to upgrade RHEL, also on an x86_64
> machine.  It hung when trying to update the nscd rpm:

Same exact problem on FC4x86_64 when trying to up2date openssh.

Comment 17 Steve Grubb 2005-10-13 18:39:55 UTC

Created attachment 119938 [details]
Patch avoiding audit calls on old kernels

I think this patch will solve the problem. If kernel 2.6.9-11 or lower, it
should not open the audit socket.

Comment 18 Damian Menscher 2005-10-14 00:08:52 UTC

That patch looks highly RHEL4-specific.  It should probably be generalized,
since apparently this issue affects FC4 also as per Comment #13.  Or is that a
different bug?

Comment 19 Steve Grubb 2005-10-14 01:54:01 UTC

The problem AFAICT is a kernel bug that affects early versions of the 2.6
kernel. I am working around the problem in RHEL4 because the number of kernel
releases are low and the U2 kernel should work fine.

The solution for FC4 is to use the most recent kernel. If the problem exists in
FC4, a new bug report should be opened since the kernel and versions of all
related software is different. I've seen hangs on FC4, too. They were futex related.

Comment 20 Steve Grubb 2005-10-14 14:48:11 UTC

Created attachment 119976 [details]
Revised patch to avoid audit calls on old kernels

After some testing the patch was revised slightly to make an exception for the
CAPP cert kernel which is based off of -11 series, but has a correct and
functioning audit system.

Comment 21 Damian Menscher 2005-10-17 23:48:24 UTC

Are there plans to push an updated util-linux package?  It occurs to me that the
bug probably affects i386 users also, but just not in a dramatic-enough way that
they have noticed.

Comment 22 Steve Grubb 2005-10-18 17:14:09 UTC

At this point there's no plans to push one. I haven't seen any issues related to
util-linux. The problem that we've seen so far only happens during an upgrade.
I'd be interested in any reports that occur after upgrading or with any other
package in RHEL4 U2.

Comment 23 John Vasileff 2005-10-18 17:28:29 UTC

Why would a new version not be pushed?  I'm sure not all machines have been
updated yet, and the failed up2date caused 130 duplicate packages on machines I
have updated that I am still working on cleaning up.

Also, if an update is not made available, is it not safe to run '2.6.9-11.ELsmp'
anymore?  I'm currently experiencing a production problem that started after the
U2 update that _may_ be related to the new kernel.  I'd like the ability to boot
to the older kernel if necessary.

Comment 24 Damian Menscher 2005-10-18 17:35:09 UTC

(In reply to comment #22)
> At this point there's no plans to push one. I haven't seen any issues related to
> util-linux. The problem that we've seen so far only happens during an upgrade.

You mean you haven't seen any issues *other than completely breaking all
x86_64-smp machines*, right?

You're making three poor assumptions:
 1: everyone has already upgraded
 2: everyone has already rebooted to the new kernel
 3: this bug only affects x86_64-smp machines

I think all three assumptions are false.  I understand not wanting to release a
kernel-specific patch, but I don't see any alternative.

Comment 25 David Lehman 2005-10-18 17:49:57 UTC

You're having communication problems -- the fix is to shadow-utils, not
util-linux. That's why no push of a new util-linux package. Furthermore, the fix
to shadow-utils (or the kernel for that matter) will only help in the general
case if special arrangements are made such that shadow-utils is the first
package upgraded in the transaction (the kernel is even worse in that a reboot
into the new kernel would be necessary). So just pushing the new shadow-utils
with Steve's patch from above is not enough to solve this one. 

I can't comment on what is going to be done (or has been done) but someone
needed to get you guys on the same page.

Comment 26 Damian Menscher 2005-10-18 18:03:20 UTC

Oh, sorry.  I should have said shadow-utils instead of util-linux above.

In any case, as I understand it, the bug is that the new shadow-utils package
doesn't work with the old kernel.  But if you push a newer shadow-utils package
that *does* work with the old kernel, wouldn't the newer one be picked in
preference to the other one during the upgrade process?  Seems like that would
save a lot of hassle for those who haven't upgraded yet.

I'm still curious whether this affects non-x86_64-smp users in any way.  It
seems strange to me that we can understand the bug, but not why it affects some
architectures differently than others.

Comment 27 Steve Grubb 2005-10-18 18:25:02 UTC

The plan is to push a new shadow-utils asap if that indeed solved the problem.
I'm still waiting for confirmation this fixes the problem. The patch above is
not specific to x86_64 so it would help anyone. The update should favor the
newer shadow-utils.

Comment 28 Suzanne Hillman 2005-10-18 19:43:04 UTC

*** Bug 171038 has been marked as a duplicate of this bug. ***

Comment 29 Bojan Smojver 2005-10-18 20:34:08 UTC

I agree with commnent #26. After I upgraded the systems I control (and suffered
for it), I warned other teams in my organisation to NOT upgrade due to this
particular bug. So, a shadow-utils that works with -11 kernels would be most
welcome.

Comment 38 Alan Cox 2005-11-06 22:20:29 UTC

If its actually eating CPU then one fix for folks who can't reboot after the
update sequence may be to renice the process to a very low priority, at that
point it will basically just replace the idle thread.

Comment 40 Jos Vos 2005-11-07 12:47:14 UTC

(In reply to comment #20)

> Created an attachment (id=119976) [edit]
> Revised patch to avoid audit calls on old kernels
> 
> After some testing the patch was revised slightly to make an exception for the
> CAPP cert kernel which is based off of -11 series, but has a correct and
> functioning audit system.

If I look at the patch, I think this wont works for 2.6.9-11.ELsmp (etc.)
kernels.  So the strcmp() should maybe be changed to something like

   strncmp(u.release, "2.6.9-11.EL", strlen("2.6.9-11.EL")) == 0)

Comment 41 Steve Grubb 2005-11-07 18:53:53 UTC

Created attachment 120787 [details]
New version of patch to avoid audit system on old kernels

This patch should handle smp & hugemem kernels.

Comment 44 Beth Nackashi 2005-11-15 16:43:20 UTC

*** Bug 168951 has been marked as a duplicate of this bug. ***

Comment 45 Ben Youngdahl 2005-11-16 03:29:59 UTC

This bug has caused me a lot of pain trying to do an update on a fresh FC4
install (x86_64).  I install FC4, then (whether I update the kernel or not),
"yum update" gets roughly half way through the updates and then sits there. 
Control-z-ing out and looking at "ps" shows me it hangs in useradd.  I've tried
a few different package sets, same thing.  Did many many attempts to install and
update FC4 today before seeing this bug listing (under RHEL of all places.) 
While hung, I cannot log in as "su".  If I do a shutdown, the message about the
scriptlet failing is shown.

I don't know if this is unique to my hardware setup, but this happened for me
doing a vanilla "workstation" install on x86_64 and then doing "update."  Nasty
stuff!

Comment 46 Ben Youngdahl 2005-11-16 03:40:25 UTC

I must not have updated the kernel, rebooted, and then proceeded with "yum";
going to try that now; also see bug 170098.

Maybe package dependencies for FC could flag this somehow.  Will post on other
bug; thanks for having this info here.

Comment 49 Red Hat Bugzilla 2005-11-18 14:33:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-842.html

Note You need to log in before you can comment on or make changes to this bug.

alan
alex
alikins
Axel.Thimm
bnackash
brilong
davej
john.lists
joshua.bakerlepain
jos
kbsingh
kekelley
laroche
marco
matthias
menscher
me
nobody+pnasrat
pasteur
pmatilai
rafiq_maniar
sgrubb