The customers system has crashed several times while running usermod. Below is the area of code it is crashing in. 971 static void audit_update_watch(struct audit_parent *parent, ... 999 list_for_each_entry_safe(r, nextr, &owatch->rules, rlist) { 1000 1001 oentry = container_of(r, struct audit_entry, rule); 1002 list_del(&oentry->rule.rlist); 1003 list_del_rcu(&oentry->list); 1004 1005 nentry = audit_dupe_rule(&oentry->rule, nwatch); 1006 if (unlikely(IS_ERR(nentry))) 1007 audit_panic("error updating watch, removing"); 1008 else { 1009 int h = audit_hash_ino((u32)ino); 1010 list_add(&nentry->rule.rlist, &nwatch->rules); 1011 list_add_rcu(&nentry->list, &audit_inode_hash[h]); 1012 } 1013 1014 call_rcu(&oentry->rcu, audit_free_rule_rcu); 1015 } 1016 1017 if (audit_enabled) { 1018 struct audit_buffer *ab; 1019 ab = audit_log_start(NULL, GFP_KERNEL, 1020 AUDIT_CONFIG_CHANGE); 1021 audit_log_format(ab, 1022 "op=updated rules specifying path="); 1023 audit_log_untrustedstring(ab, owatch->path); 1024 audit_log_format(ab, " with dev=%u ino=%lu\n", 1025 dev, ino); 1026 audit_log_format(ab, " list=%d res=1", r->listnr); 1027 audit_log_end(ab); 1028 } line 999 “r” is populated from owatch->rules. Line 1001, r is used to calculate the address of oentry. Line 1002, oentry is dereferenced. It is a valid address or the system would have crashed at that point. Since oentry is valid, then conversely, “r” is also valid. Between lines 1002 and 1014, oentry is used a few other times and is still valid. Line 1015 the “for” loop started on line 999 ends. Line 1026, the system crashes because “r” is now an invalid address. Also, oentry is no longer valid. They have been unmapped. It looks as if something else has been manipulating the audit rules for this inode while audit_update_watch() is working with it. https://bugzilla.redhat.com/show_bug.cgi?id=453042 may possibly be related since comment 4 mentions an item being removed from the watchlist when it is not suppose to be. But with the number of changes to this area of code, it is hard to say for sure. megatron.gsslab.rdu.redhat.com:/cores/20090810203804/work/ KERNEL: /cores/20090810203804/work/vmlinux DUMPFILE: /cores/20090810203804/work/2009-0803-0402.05-ssnladen.1.core CPUS: 8 DATE: Mon Aug 3 04:02:05 2009 UPTIME: 150 days, 11:47:42 LOAD AVERAGE: 0.24, 0.05, 0.02 TASKS: 211 NODENAME: ssnladen.ces.census.gov RELEASE: 2.6.18-92.1.10.el5xen VERSION: #1 SMP Wed Jul 23 04:11:52 EDT 2008 MACHINE: x86_64 (2800 Mhz) MEMORY: 58.6 GB PANIC: "Oops: 0000 [1] SMP " (check log for details) PID: 27986 COMMAND: "usermod" TASK: ffff880e70dc80c0 [THREAD_INFO: ffff880e7ef74000] CPU: 3 STATE: TASK_RUNNING (PANIC) crash> bt PID: 27986 TASK: ffff880e70dc80c0 CPU: 3 COMMAND: "usermod" #0 [ffff880e7ef75a10] xen_panic_event at ffffffff802705a3 #1 [ffff880e7ef75a40] notifier_call_chain at ffffffff80267749 #2 [ffff880e7ef75a60] panic at ffffffff8028ce39 #3 [ffff880e7ef75b50] oops_end at ffffffff802650e0 #4 [ffff880e7ef75b60] do_page_fault at ffffffff8026765f #5 [ffff880e7ef75c50] error_exit at ffffffff80260823 [exception RIP: audit_update_watch+618] RIP: ffffffff802ab7d5 RSP: ffff880e7ef75d08 RFLAGS: 00010296 RAX: ffff880e6db3d87c RBX: ffff880ea5929ac0 RCX: 0000000000000007 RDX: ffff880e6db3d85c RSI: 0000000000000020 RDI: ffff880e6db3d87b RBP: ffff880e72008fb0 R8: 00000000ffffffff R9: 0000000000000020 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880e75ed9dc0 R13: ffff880e720090c0 R14: ffffffff80685aa0 R15: ffff880e72008fb0 ORIG_RAX: ffffffffffffffff CS: e030 SS: e02b #6 [ffff880e7ef75d00] audit_update_watch at ffffffff802ab7d5 #7 [ffff880e7ef75d60] inotify_inode_queue_event at ffffffff8020d77c #8 [ffff880e7ef75dc0] vfs_rename at ffffffff8022b1f4 #9 [ffff880e7ef75e10] sys_renameat at ffffffff80237e16 #10 [ffff880e7ef75f80] tracesys at ffffffff802602f1 (via system_call) RIP: 0000003edb65f5b7 RSP: 00007fff4101e658 RFLAGS: 00000287 RAX: ffffffffffffffda RBX: ffffffff802602f1 RCX: ffffffffffffffff RDX: 0000000000000000 RSI: 0000000000611a00 RDI: 00007fff4101f790 RBP: 0000000000000001 R8: 0000000000000000 R9: 00000000ffffffff R10: 5a5969655a6f426a R11: 0000000000000287 R12: 00007fff4101f790 R13: 0000000000611a00 R14: 0000000000000000 R15: 0000000000611a00 ORIG_RAX: 0000000000000052 CS: 0033 SS: e02b
Created attachment 363484 [details] minimal backport of mainline fix
That's not a race, it's a straight free-then-use bug fixed in mainline by eparis; yes, we need it in 5.4, the minimal backport in the attachment above. Original mainline commit is commit e85188f424c8eec7f311deed9a70bec57aeed741 Author: Eric Paris <eparis> Date: Thu Jun 11 14:31:34 2009 -0400 Audit: dereferencing krule as if it were an audit_watch audit_update_watch() runs all of the rules for a given watch and duplicates them, attaches a new watch to them, and then when it finishes that process and has called free on all of the old rules (ok maybe still inside the rcu grace period) it proceeds to use the last element from list_for_each_entry_safe() as if it were a krule rather than being the audit_watch which was anchoring the list to output a message about audit rules changing. This patch unfies the audit message from two different places into a helper function and calls it from the correct location in audit_update_rules(). We will now get an audit message about the config changing for each rule (with each rules filterkey) rather than the previous garbage.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch posted
Event posted on 2009-10-14 12:13 EDT by jwilleford Still looking for this for 5.4.z. What is the update on this issue? This event sent from IssueTracker by jwilleford issue 325421
in kernel-2.6.18-170.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days