From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322) Description of problem: System will not finish login. When a terminal session is opened the user is prompted for userid and password, but the initiation of a login never finishes and the command prompt is not received. Sessions that are already active continue to operate normally.!! There are 4 systems in an ORACLE RAC cluster, the database which is active continues to run normally. Frequency of occurrence is at least once per week. To recover the only option is to reboot the nodes, severely impacting production use of the machines. Session initiation failure is for a) login b) scp c) su. Similar consistent results occur when trying to su to oracle ... the su session initiation hangs and a command prompt is never received; and when trying to use SCP to transfer a file to the node. The SCP command from a remote system never completes. An SCP command from a session already active on the system runs normally. Be aware that this environment is being used for a high volume Oracle RAC proof of concept for RedHAT, and therefore utilization of key Red Hat parameters may be excessively beyond normal guidelines. Version-Release number of selected component (if applicable): kernel-2.4.21-27.EL How reproducible: Couldn't Reproduce Additional info: I have sysreports for the 4 nodes I will attach to the problem. At the time the reports were run nodes 1 and 4 (pocppd01 and pocppd04) were running normally and node 2 and 3 (pocppd02 and pocppd03) were not accepting logins. Please send this problem to Susan Proietti Conti
Hi, Dale. Could you please explain what evidence there is of this being a kernel problem? Thanks. Also, I don't understand your request to "send this problem to Susan Proietti Conti" (who isn't a Red Hat employee). You may add anyone to the cc: list of this bug who has a Bugzilla account.
Created attachment 112784 [details] System Report From Node 1 pocppd01 - in normal operation File created by SYSTEM Report Command
Created attachment 112786 [details] Systemreport for node 2 pocppd02 - login does not complete file created by command sysreport
Created attachment 112787 [details] System Report Node 3 pocppd03 - login does not complete created by command system report
Created attachment 112788 [details] System Report Node 4 pocppd04 - in normal operation Created by command systemreport
Sorry for my confusion, Susan is part of the IBM Support Team for Redhat. And requested I put this into bugzilla. I believe that there may be another bugzilla site for IBM / Redhat support ... that I was intended to use. This problem is for a proof of concept that seems to fall outside normal Redhat or IBM support. I have placed this against the kernel as I was not sure what other component to select. We are heavily stressing the system and seem to be exhausting some system resource until we are unable to login ... and expect start any new process. Any assistance would be appreciated ...
Before you start running your test, can you log in (ssh or over the console) and start top running? Then keep an eye on what top shows... Can you also try and capture the kernel console output directly over console port from the HMC? If there's a critical resource problem, then it's entirely possible that klogd and syslogd wouldn't be able to capture it and write it to disk, but the kernel would write it to the serial console. Looking through your logs, I see lots of logins, but nothing especially obvious as to why thing don't work. Also, it seems one of the applications you are running is faulty or uses a faulty library: Mar 29 01:59:01 pocppd02 kernel: application bug: sqlplus(1697) has SIGCHLD set to SIG_IGN but calls wait(). Any idea what sqlplus is? I don't think this'll be the problem though.
I am not sure the best way to define the issue, we are seeing the problem on a weekly basis ... that is to during a normal week one or two of the systems will exibit the symptom. We normally discover it between tests when someone attempts to login to one of the systems and is not unable to so. At that point top does not show anything very interesting ... it is the top program. To this point I have not been able to find anything in the logs explaining why. The four systems are an ORACLE RAC cluster and sqlplus provides the ORACLE interactive SQL capability. I will attempt to get the kernel console output the next time the problem occurs.
Dale (and the RH guys reading this): I'm not sure if I'm seeing *exactly* the same behavior as you, but I have a handful of machines here (Astronomy Dept. at UT - Austin) that exhibit behavior that sounds similar. The machine can't be logged into when it goes into this state, either locally (gdm, console) or remotely (ssh, rlogin/rsh). If I leave a login open before this happens, I can go in and see the load start to creep up. When I had to reboot one of these machines today, the load was over 8. There were a bunch of D-state processes (some sshd logins, some crond's, a backup rsync-based script) which either turned into zombies when I tried to kill them or couldn't be killed. There are four particular machines I've had this happen on. The most recent one is a dual Opteron box. Two of the others are dual Xeon (Dell Precisions), and one's a single Pentium 4 (Dell Dimension). With all these machines, I can reboot the machine safely either via a reboot command (if I still am in somehow), a ctrl-alt-del from console, or the reboot option from the gdm screen. Does this sound like what you've been experiencing? Are you still having this problem? Thanks, Dario
The problem description sounds very similar. We were doing proof of concept testing and suceeded in the test despit not being able to resolve this problem. We are not currently actively testing. We will be updating the systems and trying to prepare for production this fall and testing to see if the problem is still present.
Dale - Ok. On a lark, I turned off the auditd daemon on two of my affected machines (the dual Opteron and the single P4). I'd noticed that the hangs seemed to happen right after auditd bombed out around 1AM or so in the logs. Both systems have stayed up and not exhibited symptoms for over 4 days now. I'm not willing to say they're fixed, because I don't understand well enough what LAUS does or how it could have affected just these four systems and not others of mine (I've got ~100 installed RHEL machines here), but maybe it's a data point. *shrugs* -Dario
Adding David Woodhouse to cc: list in case this is related to auditing.
Not sure who owns auditing in RHEL3 but unless it's specifically been enabled by the admin, it shouldn't be involved.
Guys, A few more "data points." I had to go reboot the other two machines I'd noticed the behavior on (the dual Xeons) today. On one of them, it continually hung at the console login (it normally is supposed to go through to runlevel 5). I attempted an interactive login and didn't allow auditd to start, and the system came up normally. I'm fuzzy on what exactly LAUS is, and why it could be affecting four of my boxes and not lots of the others. By default on install, auditd runs on all my RHEL3 boxes. Should I be opening a separate bug under laus/auditd? Thanks, Dario
FYI, LAuS WAS enabled by default pre RHEL-3-U5 ; so if you clean install a RHEL-3-U5 system, LAuS will be disabled by default, but upgrading from RHEL-3-U4 will leave LAuS enabled by default. Unless you require audit record data, and have a mechanism in place to deal with rotation of audit logs, it makes sense NOT to enable auditing, ie. disable it: # chkconfig --del auditd Once this is done, LAuS won't be re-enabled by upgrade, and you'd need to do: # chkconfig --level=2345 auditd on; chkconfig --level=016 auditd off to enable it again. The LAuS audit system will by default put the system into "Suspend Mode" when it finds it has insufficient free disk space to save a rotated log file. By default, this is in the /var/log/audit.d filesystem. "Suspend Mode" blocks the current audited system call and any subsequent audited system calls in an uninterruptable state until sufficient disk space exists to save the rotated audit log, at which point the suspended system calls are allowed to proceed and the system leaves suspend mode. The suspend mode action is configured in /etc/audit/audit.conf and can be removed; but as that would lead to potential loss of audit data, suspend mode must be the default when insufficient disk space is detected. There are new mechanisms in place to assist with audit log rotation and ensure auditd won't put the system into suspend mode - see 'man audbin' and the audbin -T and -N options, and 'man audit.conf' and the 'notify' option.
RE: Comment #17 From David Howells : > We should enquire at install time as to whether auditing should be enabled or > not. Actually, I disagree, because clicking on "Enable Auditing" during installation by no means implies that users have read the LAuS documentation and configured LAuS mechanisms to handle log file rotation and the out of disk space situation that is the topic of this bug report.
(In reply to comment #18) > RE: Comment #17 From David Howells : > > We should enquire at install time as to whether auditing should be enabled or > > not. > > Actually, I disagree, because clicking on "Enable Auditing" during installation > by no means implies that users have read the LAuS documentation and configured > LAuS mechanisms to handle log file rotation and the out of disk space situation I'm inclined to agree. We shouldn't ask about it -- auditing should be unconditionally disabled in the default install, and the user should need to go out of their way to _enable_ it if required.
Has anyone come to a resolution on this issue? I have just begun to experience the same problem on 3 of my 130 RHEL3 ES servers. It's happened maybe 4 or 5 times now. If i'm logged into the system when it happens, everything seem normal and the server is responsive. However, any ssh, console logins, or sudo attempts all hang. I DON'T have auditd running on any of my systems. (ran into THAT system suspend "problem" already ;) I have a java application that runs on this server under tomcat 5.0.28. The servers aren't even in production yet, so apart from some load tests, their idle. One interesting thing i found was that during one of these occurences, i killed tomcat and then suddenly I could ssh,login,sudo again. What do all these processes have in common? Could there be a resource issue? Could tomcat(my java app) be eating up some resource that prevents ssh/sudo/login from forking a shell? If so, what resource would that be? I see that this bug is still 'ASSIGNED'. Any insite? This is, at the moment, a show stopper for our application release. I was going to open a ticket with redhat but then saw this bug.
(In reply to comment #20, From James Fillman (jfillman) on 2006-03-29 18:10 EST): > I have just begun to experience the same problem on 3 of my 130 RHEL3 ES > servers. It's happened maybe 4 or 5 times now. > Are you sure your this is the same problem ? ie. being unable to login because the LAuS auditd has put the system into suspend mode. It does not sound like it to me: > If i'm logged into the system when it happens, everything seem normal and the > server is responsive. However, any ssh, console logins, or sudo attempts all > hang. I DON'T have auditd running on any of my systems. (ran into THAT system > suspend "problem" already ;) So it is NOT the same problem - you say LAuS is disabled on your system. So, I'd suggest raising a separate bug report or (better) an issue tracker ticket with your RedHat account manager. > One interesting thing i found was that during one of these occurences, i > killed tomcat and then suddenly I could ssh,login,sudo again. So perhaps consider raising a bug against "tomcat" ? > Could there be a resource issue? > Could tomcat(my java app) be eating up some resource that prevents > ssh/sudo/login from forking a shell? If so, what resource would that be? $ ulimit -a | egrep -v 'unlimited| 0$' pending signals (-i) 4087 max locked memory (kbytes, -l) 32 open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 10240 max user processes (-u) 4087 Are any of these limits being approached when the problem occurs ? Do you have sufficient disk space on your /var paritition ? Do logins hang for just one particular user, or type of user - eg. NIS users, LDAP users, kerberos users ? Do you have any network lookup methods in your /etc/nsswitch.conf 'passwd:' or 'group:' databases before 'files' ? When the problem occurs, does restarting sshd cure the problem? Perhaps putting sshd into debug mode with '-ddd' might help debug the problem. In short, there are many potential causes of a login hang - please raise a new bug report or problem request ticket and we'll try to find the cause of your problem - thanks.
This bug is filed against RHEL 3, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.