480330 – RHEL 4.8 kernel rhts audit-test seems to be hung

Bug 480330 - RHEL 4.8 kernel rhts audit-test seems to be hung

Summary: RHEL 4.8 kernel rhts audit-test seems to be hung

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.8
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Eric Paris
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-01-16 14:35 UTC by Vivek Goyal
Modified:	2012-06-20 16:19 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 16:19:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/var/log/messages (443.06 KB, application/x-gzip) 2009-02-06 09:31 UTC, Vitaly Mayatskikh	no flags	Details
View All

Description Vivek Goyal 2009-01-16 14:35:43 UTC

Description of problem:

during rhts runs, we observed that audit-test did not finish hence rhts localwatchdog hits. Needs to be investigated why this task is not finishing.

http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-78.30.EL&arch=i386&jobids=42002

Version-Release number of selected component (if applicable):

78.30.EL

How reproducible:

I have seen it 2-3 times in rhts.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Vivek Goyal 2009-01-19 15:18:28 UTC

Noticed three more instances of same issue during 79.EL run.

http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-79.EL&arch=i386&jobids=42262
http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-79.EL&arch=ia64&jobids=42262
http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-79.EL&arch=x86_64&jobids=42262

Comment 2 Vitaly Mayatskikh 2009-01-23 15:44:15 UTC

I have seen it also in 4.6.z kernel at 2008-04-17

Comment 3 Vitaly Mayatskikh 2009-02-06 09:31:02 UTC

Created attachment 331102 [details]
/var/log/messages

Network connection dies pretty often during audit-test. rhts script can't submit report to db before watchdog fires. Kernel and base system continues to work, but there are strange messages in system log:

Feb  5 09:24:59 nec-em16 ntpdate[3731]: step time server 10.16.71.254 offset -..7000pU.}9477656260.803217 sec

Comment 4 Jeff Burke 2009-02-26 14:20:35 UTC

Looks like we have seen it again with kernel 2.6.9-82.EL. This time it was on ppc.
Running on host squad7-lp1.rhts.bos.redhat.com base distro was RHEL4-U8-re20090223.nightly AS

Job   Recipe   Test                            Start             Status
48075 162168 /kernel/security/audit/audit-test 2009-25-02 14:50	 Aborted
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=7037395

Looking at the results it should have started the audit-test trustedprograms subtest. When you dig into the log you don't see that it actually wrote anything past the audit-test libpam subtest.
http://rhts.redhat.com/testlogs/48075/162168/1356615/TESTOUT.log
Note it the above log the last thing written was:

Testcase                                                     Result
--------                                                     ------
login                                                        

That output is from libpam not trustedprograms.

Comment 5 Vitaly Mayatskikh 2009-02-26 15:35:04 UTC

Jeff, if you'll run audit-test by hands in the loop, you'll see it kills network pretty often. That's why rhts scripts can't submit reports to remote db (hub). It's very easy to reproducer this bug.

Comment 6 Jeff Burke 2009-02-26 16:38:39 UTC

Vitaly,
   What do you mean the tests kills the network? Is the test shutting down the network interface of the system? Is that on purpose?

Comment 7 Vitaly Mayatskikh 2009-02-26 20:26:16 UTC

>   What do you mean the tests kills the network? 

Due to some reason, network is stopping to work (network interface goes down)

> Is the test shutting down the network interface of the system? 

By some side effect.

> Is that on purpose?

No, it's a bug.

Comment 8 Jeff Burke 2009-02-26 21:19:50 UTC

Vitaly,
   I think I see what you are saying in the attachment. 

----------------- snip -----------------
Feb  5 13:20:01 nec-em16 ntpd[17803]: Listening on interface wildcard, 0.0.0.0#123
Feb  5 13:20:01 nec-em16 ntpd[17803]: Listening on interface wildcard, ::#123
Feb  5 13:20:01 nec-em16 ntpd[17803]: Listening on interface lo, 127.0.0.1#123
Feb  5 13:20:01 nec-em16 ntpd[17803]: Listening on interface eth0, 10.16.64.139#123
Feb  5 13:20:01 nec-em16 ntpd[17803]: kernel time sync status 0040
Feb  5 13:20:01 nec-em16 ntpd[17803]: frequency initialized 22.110 PPM from /var/lib/ntp/drift
Feb  5 13:20:01 nec-em16 kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
Feb  5 13:20:01 nec-em16 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Feb  5 13:20:01 nec-em16 dhclient: send_packet: Network is down
Feb  5 13:20:02 nec-em16 ntpd[17803]: sendto(10.16.71.254): Invalid argument
Feb  6 13:17:00 nec-em16 ntpd[17803]: sendto(10.16.255.2): Invalid argument
Feb  6 13:17:01 nec-em16 ntpd[17803]: sendto(10.16.255.3): Invalid argument
Feb  6 13:17:01 nec-em16 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Feb  6 13:17:01 nec-em16 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Feb  6 13:18:03 nec-em16 ntpd[17803]: sendto(10.16.71.254): Invalid argument
Feb  6 13:18:03 nec-em16 ntpd[17803]: sendto(10.16.255.2): Invalid argument
Feb  6 13:18:07 nec-em16 ntpd[17803]: sendto(10.16.255.3): Invalid argument
Feb  6 13:18:13 nec-em16 auditd[16529]: The audit daemon is exiting.
Feb  6 13:18:13 nec-em16 kernel: audit(1170785893.691:197844): audit_pid=0 old=16529 by auid=0                                                   
Feb  6 13:18:13 nec-em16 auditd: auditd shutdown succeeded
----------------- /snip -----------------

I also see some segfaults when it is starting and stopping audit:

Feb  5 08:14:28 nec-em16 auditd: auditd startup succeeded
Feb  5 08:14:32 nec-em16 kernel: syscalls[5235]: segfault at 0000000000000000 rip 0000000000000000 rsp 0000000000514158 error 4
Feb  5 08:14:41 nec-em16 auditd[4870]: The audit daemon is exiting.

Comment 9 Vitaly Mayatskikh 2009-02-26 21:30:45 UTC

syscalls is a test program itself, it dies every time. There's usually much more strange message in log, see c#3.

Comment 11 Vivek Goyal 2010-10-13 15:48:27 UTC

Michal,

We have seen this issue more frequently recently and are little concerned. Is there any progress on this bz?

Vivek

Comment 13 Vivek Goyal 2010-10-20 14:14:10 UTC

I opened the vmcore with the help of crash and cpu seems to be idling. This is a single cpu machine.

crash-32> bt -a
PID: 0      TASK: c0371be0  CPU: 0   COMMAND: "swapper"
 #0 [c0410c98] netpoll_start_netdump at f89526f0
 #1 [c0410cb8] die at c0106862
 #2 [c0410ce8] do_page_fault at c011e0ac
 #3 [c0410dc0] error_code (via page_fault) at c0324a65
    EAX: 00000063  EBX: c038b094  ECX: 00000000  EDX: c03dbfb8  EBP: 00000000 
    DS:  007b      ESI: 00000063  ES:  007b      EDI: 00000006 
    CS:  0060      EIP: c0243403  ERR: ffffffff  EFLAGS: 00010046 
 #4 [c0410dfc] sysrq_handle_crash at c0243403
 #5 [c0410e00] __handle_sysrq at c02437b3
 #6 [c0410e3c] input_event at c02aa8c0
 #7 [c0410e58] hidinput_hid_event at c02a80b3
 #8 [c0410e7c] hid_process_event at c02a41cc
 #9 [c0410e9c] hid_input_field at c02a44a2
#10 [c0410ef4] hid_input_report at c02a4568
#11 [c0410f10] hid_irq_in at c02a461d
#12 [c0410f1c] usb_hcd_giveback_urb at c029a5c6
#13 [c0410f28] ehci_urb_done at f890d30a
#14 [c0410f70] scan_periodic at f89106e1
#15 [c0410fa8] ehci_work at f8911444
#16 [c0410fb4] ehci_irq at f891169f
--- <hard IRQ> ---
 #0 [c03dbfb8] do_IRQ at c0108764
 #1 [c03dbfb4] common_interrupt at c03249a3
    EAX: 00000000  EBX: c03db000  ECX: 00000000  EDX: 00000000  EBP: 0049b007 
    DS:  007b      ESI: 00099100  ES:  007b      EDI: c040c120 
    CS:  0060      EIP: c01040d6  ERR: ffffffc9  EFLAGS: 00000246 
 #2 [c03dbfe8] mwait_idle at c01040d6
 #3 [c03dbff4] start_kernel at c03dc6fb
 #4 [c03dbffc] is386 at c010019a

Comment 14 Eric Paris 2010-10-20 14:20:56 UTC

So everything was scheduled when they hit sysrq-c.  Any chance you could collect the equivalent of sysrq-t and add it as an attachment?

Comment 15 Vivek Goyal 2010-10-20 14:39:18 UTC

Sure. All the console logs are here. (including output of alt-sysrq-t).

http://rhts.redhat.com/testlogs/2010/10/176797/456989/3808943/console.txt

Still I would just extract the alt-sysrq-t output and attach the file with bz.

Comment 17 Jiri Pallich 2012-06-20 16:19:11 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.