Bug 495083

Summary: [RHEL4] nscd uses 100% cpu and stops responding
Product: Red Hat Enterprise Linux 4 Reporter: Alan Matsuoka <alanm>
Component: glibcAssignee: Andreas Schwab <schwab>
Status: CLOSED DUPLICATE QA Contact: BaseOS QE <qe-baseos-auto>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: carls, cevich, Colin.Simpson, drepper, jakub, k.georgiou, ofourdan, sputhenp
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-07 12:45:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport-MFrohm.1907516-249173-7f48db.tar.bz2 none

Description Alan Matsuoka 2009-04-09 17:14:46 UTC
Created attachment 338949 [details]
sosreport-MFrohm.1907516-249173-7f48db.tar.bz2

> ##### General Escalation Information
>
> State the problem
>
> 1. Provide time and date of the problem

Sporadic

> 2. Indicate the platform(s) (architectures) the problem is being reported
> against.

RHEL 4.7 ES and AS i386

> 3. Provide clear and concise problem description as it is understood at the
> time of escalation
>
> * Observed behavior

nscd hangs on futex call and the nscd processes are using up 100% CPU on
several of our machines.  nscd isn't responding at all. 'service restart nscd'
is not able to stop the process and nscd will only respond to a 'kill -9'. We
are currently restarting nscd in a daily cronjob as a workaround.

We have also noticed that on the machines where the nscd processes are using up
100% CPU, 'lsof' shows two fd:s opens /var/run/nscd/socket. But on the machines
with a normal nscd 'lsof' shows only one opened /var/run/nscd/socket.

This problem occurs on machines both with and without LDAP connection.

>
> * Desired behavior

nscd should not use 100% CPU and should respond normally to kill signals etc

> 4. State specific action requested of SEG

Analyse the problem and advise if we can gather any extra data.

> 5. State whether or not a defect in the product is suspected

This is suspected to be a bug in both RHEL 4.7 and CentOS. This customer and others have actually opened a bug directly in bugzilla (which I have discouraged them from doing in future) and a CentOS bug tracker as well:

> * Provide Bugzilla if one already exists

https://bugzilla.redhat.com/show_bug.cgi?id=492581
N.B. This has already been assigned to Jakub Jelinek
http://bugs.centos.org/view.php?id=3373

> 8. This is especially important for severity one and two issues. What is the
> impact to the customer when they experience this problem?

This is happening frequently and is affecting users and is frustrating the customer.

> ##### Provide supporting info
>
> 1. State other actions already taken in working the problem:
>
> * tech-list, google searches, fulltext, consulting with another engineer
>
> * Provide any relevant data found
>
> 2. Attach sosreport

Attached an sosreport from an example system. It looks like they might be using a customer kernel, so I'm going to ask if they can reproduce with the stock kernel. However, since they can reproduce on multiple architectures, AS/ES and on CentOS and other people have reported the same behaviour, my suspicion is that it's not related to the kernel version and we should progress without quibbling.

> 3. Attach other supporting data

See Bugzilla referenced above
>
> 4. Provide issue repro information:

None applicable

> 5. List any known hot-fix packages on the system

None

> 6. List any customer applied changes from the last 30 days

None

Comment 2 Andreas Schwab 2009-09-07 12:45:57 UTC

*** This bug has been marked as a duplicate of bug 495082 ***