154044 – login to system does not complete

Bug 154044 - login to system does not complete

Summary: login to system does not complete

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	powerpc
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	David Howells
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-04-06 20:06 UTC by Dale Perkins
Modified:	2007-11-30 22:07 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:05:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
System Report From Node 1 pocppd01 - in normal operation (4.84 MB, application/octet-stream) 2005-04-06 21:55 UTC, Dale Perkins	no flags	Details
Systemreport for node 2 pocppd02 - login does not complete (4.45 MB, application/octet-stream) 2005-04-06 21:59 UTC, Dale Perkins	no flags	Details
System Report Node 3 pocppd03 - login does not complete (4.43 MB, application/octet-stream) 2005-04-06 22:04 UTC, Dale Perkins	no flags	Details
System Report Node 4 pocppd04 - in normal operation (4.61 MB, application/octet-stream) 2005-04-06 22:07 UTC, Dale Perkins	no flags	Details
View All

Description Dale Perkins 2005-04-06 20:06:57 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)

Description of problem:
System will not finish login. When a terminal session is opened the user is prompted for userid and password, but the initiation of a login never finishes and the command prompt is not received. Sessions that are already active continue to operate normally.!! There are 4 systems in an ORACLE RAC cluster, the database which is active continues to run normally. Frequency of occurrence is at least once per week. To recover the only option is to reboot the nodes, severely impacting production use of the machines.
Session initiation failure is for a) login b) scp c) su.
Similar consistent results occur when trying to su to oracle ... the su session initiation hangs and a command prompt is never received; and when trying to use SCP to transfer a file to the node. The SCP command from a remote system never completes. An SCP command from a session already active on the system runs normally.

Be aware that this environment is being used for a high volume Oracle RAC proof of concept for RedHAT, and therefore utilization of key Red Hat parameters may be excessively beyond normal guidelines.

Version-Release number of selected component (if applicable):
kernel-2.4.21-27.EL

How reproducible:
Couldn't Reproduce

Additional info:

I have sysreports for the 4 nodes I will attach to the problem. At the time the reports were run nodes 1 and 4 (pocppd01 and pocppd04) were running normally and node 2 and 3 (pocppd02 and pocppd03) were not accepting logins.

Please send this problem to Susan Proietti Conti

Comment 1 Ernie Petrides 2005-04-06 20:30:43 UTC

Hi, Dale.  Could you please explain what evidence there is of this
being a kernel problem?  Thanks.

Also, I don't understand your request to "send this problem to
Susan Proietti Conti" (who isn't a Red Hat employee).  You may
add anyone to the cc: list of this bug who has a Bugzilla account.

Comment 2 Dale Perkins 2005-04-06 21:55:03 UTC

Created attachment 112784 [details]
System Report From Node 1 pocppd01 - in normal operation

File created by SYSTEM Report Command

Comment 3 Dale Perkins 2005-04-06 21:59:36 UTC

Created attachment 112786 [details]
Systemreport for node 2 pocppd02 - login does not complete

file created by command sysreport

Comment 4 Dale Perkins 2005-04-06 22:04:19 UTC

Created attachment 112787 [details]
System Report Node 3 pocppd03 - login does not complete

created by command system report

Comment 5 Dale Perkins 2005-04-06 22:07:30 UTC

Created attachment 112788 [details]
System Report Node 4 pocppd04 - in normal operation 

Created by command systemreport

Comment 6 Dale Perkins 2005-04-07 01:40:03 UTC

Sorry for my confusion, Susan is part of the IBM Support Team for Redhat. And 
requested I put this into bugzilla. I believe that there may be another 
bugzilla site for IBM / Redhat support ... that I was intended to use.

This problem is for a proof of concept that seems to fall outside normal 
Redhat or IBM support. 
 
I have placed this against the kernel as I was not sure what other component 
to select.  We are heavily stressing the system and seem to be exhausting some 
system resource until we are unable to login ... and expect start any new 
process. 

Any assistance would be appreciated ...

Comment 7 David Howells 2005-04-11 10:21:53 UTC

Before you start running your test, can you log in (ssh or over the console) 
and start top running? Then keep an eye on what top shows... 
 
Can you also try and capture the kernel console output directly over console 
port from the HMC? If there's a critical resource problem, then it's entirely 
possible that klogd and syslogd wouldn't be able to capture it and write it to 
disk, but the kernel would write it to the serial console. 
 
Looking through your logs, I see lots of logins, but nothing especially 
obvious as to why thing don't work. 
 
Also, it seems one of the applications you are running is faulty or uses a 
faulty library: 
 
Mar 29 01:59:01 pocppd02 kernel: application bug: sqlplus(1697) has SIGCHLD 
set to SIG_IGN but calls wait(). 
 
Any idea what sqlplus is? 
 
I don't think this'll be the problem though.

Comment 8 Dale Perkins 2005-04-12 11:52:56 UTC

I am not sure the best way to define the issue, we are seeing the problem on a 
weekly basis ... that is to during a normal week one or two of the systems 
will exibit the symptom.  We normally discover it between tests when someone 
attempts to login to one of the systems and is not unable to so. At that point 
top does not show anything very interesting ... it is the top program.   To 
this point I have not been able to find anything in the logs explaining why.

The four systems are an ORACLE RAC cluster and sqlplus provides the ORACLE 
interactive SQL capability.

I will attempt to get the kernel console output the next time the problem 
occurs.

Comment 9 Dario Landazuri 2005-08-10 14:57:58 UTC

Dale (and the RH guys reading this):

I'm not sure if I'm seeing *exactly* the same behavior as you, but I have a
handful of machines here (Astronomy Dept. at UT - Austin) that exhibit behavior
that sounds similar.  

The machine can't be logged into when it goes into this state, either locally
(gdm, console) or remotely (ssh, rlogin/rsh).  If I leave a login open before
this happens, I can go in and see the load start to creep up.  When I had to
reboot one of these machines today, the load was over 8.  There were a bunch of
D-state processes (some sshd logins, some crond's, a backup rsync-based script)
which either turned into zombies when I tried to kill them or couldn't be killed.

There are four particular machines I've had this happen on.  The most recent one
is a dual Opteron box.  Two of the others are dual Xeon (Dell Precisions), and
one's a single Pentium 4 (Dell Dimension).

With all these machines, I can reboot the machine safely either via a reboot
command (if I still am in somehow), a ctrl-alt-del from console, or the reboot
option from the gdm screen.

Does this sound like what you've been experiencing?  Are you still having this
problem?

Thanks,
Dario

Comment 10 Dale Perkins 2005-08-15 16:16:30 UTC

The problem description sounds very similar.  We were doing proof of concept
testing and suceeded in the test despit not being able to resolve this problem.
We are not currently actively testing. We will be updating the systems and
trying to prepare for production this fall and testing to see if the problem is
still present.

Comment 11 Dario Landazuri 2005-08-15 16:22:00 UTC

Dale -

Ok.  On a lark, I turned off the auditd daemon on two of my affected machines
(the dual Opteron and the single P4).  I'd noticed that the hangs seemed to
happen right after auditd bombed out around 1AM or so in the logs.  Both systems
have stayed up and not exhibited symptoms for over 4 days now.  I'm not willing
to say they're fixed, because I don't understand well enough what LAUS does or
how it could have affected just these four systems and not others of mine (I've
got ~100 installed RHEL machines here), but maybe it's a data point.  *shrugs*

-Dario

Comment 12 Ernie Petrides 2005-08-15 18:26:41 UTC

Adding David Woodhouse to cc: list in case this is related to auditing.

Comment 13 David Woodhouse 2005-08-15 18:43:22 UTC

Not sure who owns auditing in RHEL3 but unless it's specifically been enabled by
the admin, it shouldn't be involved.

Comment 15 Dario Landazuri 2005-08-16 16:12:16 UTC

Guys,

A few more "data points."  I had to go reboot the other two machines I'd noticed
the behavior on (the dual Xeons) today.  On one of them, it continually hung at
the console login (it normally is supposed to go through to runlevel 5).  I
attempted an interactive login and didn't allow auditd to start, and the system
came up normally.

I'm fuzzy on what exactly LAUS is, and why it could be affecting four of my
boxes and not lots of the others.  By default on install, auditd runs on all my
RHEL3 boxes.

Should I be opening a separate bug under laus/auditd?

Thanks,
Dario

Comment 16 Jason Vas Dias 2005-08-16 16:34:14 UTC

FYI, LAuS WAS enabled by default pre RHEL-3-U5 ; so if you clean install a
RHEL-3-U5 system, LAuS will be disabled by default, but upgrading from 
RHEL-3-U4 will leave LAuS enabled by default.

Unless you require audit record data, and have a mechanism in place to deal with
rotation of audit logs, it makes sense NOT to enable auditing, ie. disable it:
  # chkconfig --del auditd
Once this is done, LAuS won't be re-enabled by upgrade, and you'd need to do:
  # chkconfig --level=2345 auditd on; chkconfig --level=016 auditd off
to enable it again.

The LAuS audit system will by default put the system into "Suspend Mode" when
it finds it has insufficient free disk space to save a rotated log file. By
default, this is in the /var/log/audit.d filesystem.

"Suspend Mode" blocks the current audited system call and any subsequent 
audited system calls in an uninterruptable state until sufficient disk space
exists to save the rotated audit log, at which point the suspended system 
calls are allowed to proceed and the system leaves suspend mode.

The suspend mode action is configured in /etc/audit/audit.conf and can be 
removed; but as that would lead to potential loss of audit data, suspend mode
must be the default when insufficient disk space is detected.
 
There are new mechanisms in place to assist with audit log rotation and ensure
auditd won't put the system into suspend mode - see 'man audbin' and the 
audbin -T and -N options, and 'man audit.conf' and the 'notify' option.

Comment 18 Jason Vas Dias 2005-10-13 16:16:49 UTC

RE: Comment #17 From David Howells :
> We should enquire at install time as to whether auditing should be enabled or 
> not. 

Actually, I disagree, because clicking on "Enable Auditing" during installation
by no means implies that users have read the LAuS documentation and configured
LAuS mechanisms to handle log file rotation and the out of disk space situation
that is the topic of this bug report.

Comment 19 David Woodhouse 2005-10-13 23:42:39 UTC

(In reply to comment #18)
> RE: Comment #17 From David Howells :
> > We should enquire at install time as to whether auditing should be enabled or 
> > not. 
> 
> Actually, I disagree, because clicking on "Enable Auditing" during installation
> by no means implies that users have read the LAuS documentation and configured
> LAuS mechanisms to handle log file rotation and the out of disk space situation


I'm inclined to agree. We shouldn't ask about it -- auditing should be
unconditionally disabled in the default install, and the user should need to go
out of their way to _enable_ it if required.

Comment 20 James Fillman 2006-03-29 23:10:16 UTC

Has anyone come to a resolution on this issue?  I have just begun to experience
the same problem on 3 of my 130 RHEL3 ES servers. It's happened maybe 4 or 5
times now.

If i'm logged into the system when it happens, everything seem normal and the
server is responsive. However, any ssh, console logins, or sudo attempts all
hang. I DON'T have auditd running on any of my systems. (ran into THAT system
suspend "problem" already ;)

I have a java application that runs on this server under tomcat 5.0.28. The
servers aren't even in production yet, so apart from some load tests, their
idle. One interesting thing i found was that during one of these occurences, i
killed tomcat and then suddenly I could ssh,login,sudo again.

What do all these processes have in common? Could there be a resource issue?
Could tomcat(my java app) be eating up some resource that prevents
ssh/sudo/login from forking a shell?  If so, what resource would that be?

I see that this bug is still 'ASSIGNED'.  Any insite?  This is, at the moment, a
show stopper for our application release. I was going to open a ticket with
redhat but then saw this bug.

Comment 21 Jason Vas Dias 2006-03-29 23:33:58 UTC

(In reply to comment #20, From James Fillman (jfillman) on 2006-03-29
18:10 EST):
> I have just begun to experience the same problem on 3 of my 130 RHEL3 ES
> servers. It's happened maybe 4 or 5 times now.
> 
Are you sure your this is the same problem ? 
ie. being unable to login because the LAuS auditd has put the system into
suspend mode.

It does not sound like it to me:
> If i'm logged into the system when it happens, everything seem normal and the
> server is responsive. However, any ssh, console logins, or sudo attempts all
> hang. I DON'T have auditd running on any of my systems. (ran into THAT system
> suspend "problem" already ;)

So it is NOT the same problem - you say LAuS is disabled on your system.

So, I'd suggest raising a separate bug report or (better) an issue tracker
ticket with your RedHat account manager. 

> One interesting thing i found was that during one of these occurences, i
> killed tomcat and then suddenly I could ssh,login,sudo again.

So perhaps consider raising a bug against "tomcat" ?


> Could there be a resource issue?
> Could tomcat(my java app) be eating up some resource that prevents
> ssh/sudo/login from forking a shell?  If so, what resource would that be?

$ ulimit -a | egrep -v 'unlimited| 0$'
pending signals                 (-i) 4087
max locked memory       (kbytes, -l) 32
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
max user processes              (-u) 4087

Are any of these limits being approached when the problem occurs ?
Do you have sufficient disk space on your /var paritition ?

Do logins hang for just one particular user, or type of user - eg. NIS users,
LDAP users, kerberos users ? Do you have any network lookup methods in your 
/etc/nsswitch.conf 'passwd:' or 'group:' databases before 'files' ? 

When the problem occurs, does restarting sshd cure the problem? 
Perhaps putting sshd into debug mode with '-ddd' might help debug the problem.

In short, there are many potential causes of a login hang - please raise a new
bug report or problem request ticket and we'll try to find the cause of your
problem - thanks.

Comment 22 RHEL Program Management 2007-10-19 19:05:11 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.