Description of problem: during rhts runs, we observed that audit-test did not finish hence rhts localwatchdog hits. Needs to be investigated why this task is not finishing. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-78.30.EL&arch=i386&jobids=42002 Version-Release number of selected component (if applicable): 78.30.EL How reproducible: I have seen it 2-3 times in rhts. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Noticed three more instances of same issue during 79.EL run. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-79.EL&arch=i386&jobids=42262 http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-79.EL&arch=ia64&jobids=42262 http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/security/audit/audit-test&result=Fail&rwhiteboard=kernel%202.6.9-79.EL&arch=x86_64&jobids=42262
I have seen it also in 4.6.z kernel at 2008-04-17
Created attachment 331102 [details] /var/log/messages Network connection dies pretty often during audit-test. rhts script can't submit report to db before watchdog fires. Kernel and base system continues to work, but there are strange messages in system log: Feb 5 09:24:59 nec-em16 ntpdate[3731]: step time server 10.16.71.254 offset -..7000pU.}9477656260.803217 sec
Looks like we have seen it again with kernel 2.6.9-82.EL. This time it was on ppc. Running on host squad7-lp1.rhts.bos.redhat.com base distro was RHEL4-U8-re20090223.nightly AS Job Recipe Test Start Status 48075 162168 /kernel/security/audit/audit-test 2009-25-02 14:50 Aborted http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=7037395 Looking at the results it should have started the audit-test trustedprograms subtest. When you dig into the log you don't see that it actually wrote anything past the audit-test libpam subtest. http://rhts.redhat.com/testlogs/48075/162168/1356615/TESTOUT.log Note it the above log the last thing written was: Testcase Result -------- ------ login That output is from libpam not trustedprograms.
Jeff, if you'll run audit-test by hands in the loop, you'll see it kills network pretty often. That's why rhts scripts can't submit reports to remote db (hub). It's very easy to reproducer this bug.
Vitaly, What do you mean the tests kills the network? Is the test shutting down the network interface of the system? Is that on purpose?
> What do you mean the tests kills the network? Due to some reason, network is stopping to work (network interface goes down) > Is the test shutting down the network interface of the system? By some side effect. > Is that on purpose? No, it's a bug.
Vitaly, I think I see what you are saying in the attachment. ----------------- snip ----------------- Feb 5 13:20:01 nec-em16 ntpd[17803]: Listening on interface wildcard, 0.0.0.0#123 Feb 5 13:20:01 nec-em16 ntpd[17803]: Listening on interface wildcard, ::#123 Feb 5 13:20:01 nec-em16 ntpd[17803]: Listening on interface lo, 127.0.0.1#123 Feb 5 13:20:01 nec-em16 ntpd[17803]: Listening on interface eth0, 10.16.64.139#123 Feb 5 13:20:01 nec-em16 ntpd[17803]: kernel time sync status 0040 Feb 5 13:20:01 nec-em16 ntpd[17803]: frequency initialized 22.110 PPM from /var/lib/ntp/drift Feb 5 13:20:01 nec-em16 kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready Feb 5 13:20:01 nec-em16 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 Feb 5 13:20:01 nec-em16 dhclient: send_packet: Network is down Feb 5 13:20:02 nec-em16 ntpd[17803]: sendto(10.16.71.254): Invalid argument Feb 6 13:17:00 nec-em16 ntpd[17803]: sendto(10.16.255.2): Invalid argument Feb 6 13:17:01 nec-em16 ntpd[17803]: sendto(10.16.255.3): Invalid argument Feb 6 13:17:01 nec-em16 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Feb 6 13:17:01 nec-em16 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Feb 6 13:18:03 nec-em16 ntpd[17803]: sendto(10.16.71.254): Invalid argument Feb 6 13:18:03 nec-em16 ntpd[17803]: sendto(10.16.255.2): Invalid argument Feb 6 13:18:07 nec-em16 ntpd[17803]: sendto(10.16.255.3): Invalid argument Feb 6 13:18:13 nec-em16 auditd[16529]: The audit daemon is exiting. Feb 6 13:18:13 nec-em16 kernel: audit(1170785893.691:197844): audit_pid=0 old=16529 by auid=0 Feb 6 13:18:13 nec-em16 auditd: auditd shutdown succeeded ----------------- /snip ----------------- I also see some segfaults when it is starting and stopping audit: Feb 5 08:14:28 nec-em16 auditd: auditd startup succeeded Feb 5 08:14:32 nec-em16 kernel: syscalls[5235]: segfault at 0000000000000000 rip 0000000000000000 rsp 0000000000514158 error 4 Feb 5 08:14:41 nec-em16 auditd[4870]: The audit daemon is exiting.
syscalls is a test program itself, it dies every time. There's usually much more strange message in log, see c#3.
Michal, We have seen this issue more frequently recently and are little concerned. Is there any progress on this bz? Vivek
I opened the vmcore with the help of crash and cpu seems to be idling. This is a single cpu machine. crash-32> bt -a PID: 0 TASK: c0371be0 CPU: 0 COMMAND: "swapper" #0 [c0410c98] netpoll_start_netdump at f89526f0 #1 [c0410cb8] die at c0106862 #2 [c0410ce8] do_page_fault at c011e0ac #3 [c0410dc0] error_code (via page_fault) at c0324a65 EAX: 00000063 EBX: c038b094 ECX: 00000000 EDX: c03dbfb8 EBP: 00000000 DS: 007b ESI: 00000063 ES: 007b EDI: 00000006 CS: 0060 EIP: c0243403 ERR: ffffffff EFLAGS: 00010046 #4 [c0410dfc] sysrq_handle_crash at c0243403 #5 [c0410e00] __handle_sysrq at c02437b3 #6 [c0410e3c] input_event at c02aa8c0 #7 [c0410e58] hidinput_hid_event at c02a80b3 #8 [c0410e7c] hid_process_event at c02a41cc #9 [c0410e9c] hid_input_field at c02a44a2 #10 [c0410ef4] hid_input_report at c02a4568 #11 [c0410f10] hid_irq_in at c02a461d #12 [c0410f1c] usb_hcd_giveback_urb at c029a5c6 #13 [c0410f28] ehci_urb_done at f890d30a #14 [c0410f70] scan_periodic at f89106e1 #15 [c0410fa8] ehci_work at f8911444 #16 [c0410fb4] ehci_irq at f891169f --- <hard IRQ> --- #0 [c03dbfb8] do_IRQ at c0108764 #1 [c03dbfb4] common_interrupt at c03249a3 EAX: 00000000 EBX: c03db000 ECX: 00000000 EDX: 00000000 EBP: 0049b007 DS: 007b ESI: 00099100 ES: 007b EDI: c040c120 CS: 0060 EIP: c01040d6 ERR: ffffffc9 EFLAGS: 00000246 #2 [c03dbfe8] mwait_idle at c01040d6 #3 [c03dbff4] start_kernel at c03dc6fb #4 [c03dbffc] is386 at c010019a
So everything was scheduled when they hit sysrq-c. Any chance you could collect the equivalent of sysrq-t and add it as an attachment?
Sure. All the console logs are here. (including output of alt-sysrq-t). http://rhts.redhat.com/testlogs/2010/10/176797/456989/3808943/console.txt Still I would just extract the alt-sysrq-t output and attach the file with bz.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.