Description of problem: Seeing a strange panic when creating a socket, the machine panics with the error Unable to handle kernel paging request at 00000000bf3c48e5 This value is not random, and happens across multiple blades (although they all boot from the same pxe service). #3 [ffff8103b97ebe10] error_exit at ffffffff8005ede9 [exception RIP: __d_rehash+24] RIP: ffffffff8003a3f8 RSP: ffff8103b97ebec0 RFLAGS: 00010206 RAX: 00000000bf3cc8dd RBX: ffff8103e4adfdf8 RCX: 0000000000000015 RDX: ffff8103e4adfe10 RSI: ffff810001c88708 RDI: ffff8103e4adfdf8 RBP: ffff81042e544dc0 R8: 00000000ffffffff R9: 0000000000000020 R10: 0000000000000000 R11: ffffffff80128780 R12: ffff8103e4a865c0 R13: ffff8103e4a86610 R14: 00007ffff012df50 R15: 00002b843ca17d51 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff8103b97ebec0] d_rehash at ffffffff80042367 #5 [ffff8103b97ebed0] sock_attach_fd at ffffffff8022718e #6 [ffff8103b97ebf30] sock_map_fd at ffffffff8004d2be #7 [ffff8103b97ebf60] sys_socket at ffffffff802272e9 #8 [ffff8103b97ebf80] system_call at ffffffff8005e116 RIP: 00002b843f551b97 RSP: 00007ffff012dd40 RFLAGS: 00010246 RAX: 0000000000000029 RBX: ffffffff8005e116 RCX: 00626546006e614a RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000001 RBP: 0000000000000014 R8: 0000000000000070 R9: 7220726f66207965 R10: 6d6f726620746f6f R11: 0000000000000246 R12: ffff8103e4a865c0 ------- When we look at the disassembly of __d_rehash and its inline friends we can see that the value is stored in the rax. crash> dis -l __d_rehash /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/fs/dcache.c: 1525 0xffffffff8003a3e0 <__d_rehash>: andl $0xffffffffffffffef,0x4(%rdi) 0xffffffff8003a3e4 <__d_rehash+4>: lea 0x18(%rdi),%rdx /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/include/linux/list.h: 866 0xffffffff8003a3e8 <__d_rehash+8>: mov (%rsi),%rax /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/include/linux/list.h: 868 0xffffffff8003a3eb <__d_rehash+11>: mov %rsi,0x8(%rdx) /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/include/linux/list.h: 867 0xffffffff8003a3ef <__d_rehash+15>: mov %rax,0x18(%rdi) /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/include/linux/list.h: 870 0xffffffff8003a3f3 <__d_rehash+19>: test %rax,%rax 0xffffffff8003a3f6 <__d_rehash+22>: je 0xffffffff8003a3fc <__d_rehash+28> /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/include/linux/list.h: 871 0xffffffff8003a3f8 <__d_rehash+24>: mov %rdx,0x8(%rax) <-- We're looking at what is in RAX. static void __d_rehash(struct dentry * entry, struct hlist_head *list) { entry->d_flags &= ~DCACHE_UNHASHED; hlist_add_head_rcu(&entry->d_hash, list); <-- this is inline. } static inline void hlist_add_head_rcu(struct hlist_node *n, struct hlist_head *h) { struct hlist_node *first = h->first; n->next = first; n->pprev = &h->first; smp_wmb(); if (first) <- test %rax,%rax first->pprev = &n->next; <- mov %rdx,0x8(%rax) h->first = n; } So essentially we're looking at the %rax, which is ( 00000000bf3cc8d ) struct hlist_node ffff8103e4adfe10 <-- rdx struct hlist_node { next = 0xbf3cc8dd, ( pointer ) pprev = 0xffff810001c88708 (** pointer to a pointer) } It looks as though the value pointed to by the next pointer is incorrect (in the exception above). This same value appears in multiple panics, I can't figure out how it appears to over-write the next value. This is not the first socket, but may happens after some time. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install RHEL 5.5 on a 2 cpu westmere system (12 cores in total). 2. Login 3. run "while [ 1 ] ;do ssh localhost ls ;done" Actual results: Manchine panics with "Unable to handle kernel paging request at 00000000bf3c48e5" We have a vmcore available for inspection if required. Expected results: System to continue running as per normal. Additional info: I am unable to reproduce this issue, and don't know what can be causing the corruption of the next value. Customer believes that creating many files can cause the same problem. I tend to believe as it seems to exercise the same code.
Kernel: 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 Red Hat Enterprise Linux version: Red Hat Enterprise Linux Server release 5.5 (Tikanga) CPU model: Intel(R) Xeon(R) CPU L5638 @ 2.00GHz Memory: 32882100 kB
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
(In reply to comment #2) > Kernel: > 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 I've had a report of what could be the same problem. I'll attach a screenshot (which is all I have). Call trace was: d_rehash+0x21/0x34 create_write_pipe+0x155/0x1dc do_sigaction+0x76/0x199 do_pipe+0x16/0xee sys_pipe+0x13/0x4e sysenter_do_call+0x1e/0x76 RIP was: __d_rehash+0x18/0x20 Kernel was 2.6.18-194.17.1.el5 (CentOS compiled). System has four six-core Xeon E7- 4820, with 64G. Let me know if there's anything I can do to help.
(In reply to comment #9) > Kernel was 2.6.18-194.17.1.el5 (CentOS compiled). System has four six-core Xeon > E7- 4820, with 64G. Sorry, dual six-core, with HT, showing 32 CPUs.
Here's more information from syslog: Aug 9 11:44:30 micd-01 kernel: Unable to handle kernel paging request at 000000007f873133 RIP: Aug 9 11:44:30 micd-01 kernel: [<ffffffff8003a0cf>] __d_rehash+0x18/0x20 Aug 9 11:44:30 micd-01 kernel: PGD 1063c28067 PUD 0 Aug 9 11:44:30 micd-01 kernel: Oops: 0002 [1] SMP Aug 9 11:44:30 micd-01 kernel: last sysfs file: /devices/pci0000:00/0000:00:00.0/local_cpus Aug 9 11:44:30 micd-01 kernel: CPU 9 Aug 9 11:44:30 micd-01 kernel: Modules linked in: tun bonding ipv6 xfrm_nalgo crypto_api xt_multiport xt_connmark xt_CONNMARK ipt_REJECT ipt_M ASQUERADE xt_state ipt_TOS xt_tcpudp ip_nat_ftp ip_conntrack_ftp iptable_mangle iptable_nat ip_nat ip_conntrack nfnetlink iptable_filter ip_tab les ipt_ULOG x_tables loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acp i_memhotplug ac parport_pc lp parport sr_mod cdrom joydev sg bnx2 serio_raw pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_ piix libata shpchp megaraid_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Aug 9 11:44:30 micd-01 kernel: Pid: 19291, comm: calctimezoneoff Not tainted 2.6.18-194.17.1.el5 #1 Aug 9 11:44:30 micd-01 kernel: RIP: 0010:[<ffffffff8003a0cf>] [<ffffffff8003a0cf>] __d_rehash+0x18/0x20 Aug 9 11:44:30 micd-01 kernel: RSP: 0018:ffff8110650fbeb0 EFLAGS: 00010206 Aug 9 11:44:30 micd-01 kernel: RAX: 000000007f87312b RBX: ffff81106a14ab70 RCX: 0000000000000017 Aug 9 11:44:30 micd-01 kernel: RDX: ffff81106a14ab88 RSI: ffff810004013000 RDI: ffff81106a14ab70 Aug 9 11:44:30 micd-01 kernel: RBP: ffff81107d55a910 R08: 00000000ffffffff R09: 0000000000000020 Aug 9 11:44:30 micd-01 kernel: R10: 0000000000000000 R11: ffffffff80127cd8 R12: ffff81107ce38c80 Aug 9 11:44:30 micd-01 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff8110650fbf58
Wade, did you do a memory test? Have you learnt any more since June?
This crash on a gentoo system (EIP is at __d_rehash+0x1c/0x30) looks like it was a memory issue: http://forums.gentoo.org/viewtopic-t-438365-start-0.html
Charlie: According to the customer -- After a BIOS upgrade including new Intel microcode, the crash does not occur anymore. Thanks for your assistance. -- Because this did not happen in the same make/model machine in the labs with any version, the information is inconclusive. Maybe a bios update may solve the problem for you, however the bios update prevented this corruption from happening and solved the problem for the customer. Thanks.