Description of problem: Summary: This customer faced in the past a issue with invalid kernel page table entries. The behaviour was at that moment 6 server machines rebooting randomly from their environment each 7-10 days. Based on the lots of vmcores analyze, we provided the RCA and customer applied the current zstream kernel and we don't listen any problem until this new ticket. Tickets and BZs for the previous cases: ---------------------------------------------- SFDC: 362384 Desc: crash analysis on xen host ussp-pb20 SFDC: 00339543 Desc: crash analysis on xen hosts Splintering this off from case 00339543 to track the issue/crash specifically with ussp-pb20. Bugzilla opened for the above cases https://bugzilla.redhat.com/show_bug.cgi?id=645043 Desc: RHEL5.3 - xen host - crashing randomly Errata that customer applied: http://rhn.redhat.com/errata/RHSA-2009-1193.html Bugzilla that we pointed that solved the previous issue: RH5.3 x64 RC2 reboots while installing a virtual machine https://bugzilla.redhat.com/show_bug.cgi?id=479754 Currently customer is facing a new memory issue which is appearing between 3-5 week running 1 guest. They have 6 nodes and have thus far seen 3 crashes. The crashes weren't all on the same node, nor did they happen at the same time Data provide from customer: ================================== VMcore: ------------ 00395680_vmcore.gz # kinit <kerberos username> # ssh kerberos_username.rdu.redhat.com # cd /cores/20101222171342 Sosreport: ------------ sosreport-KVan.395680-445481-b0a7ca.tar.bz2 sosreport-DeepakYadav.00395680-954083-42b91f.tar.bz2 (Hypervisor) DeepakYadav: ------------------------------- $ cat uname Linux ussp-pb13 2.6.18-128.7.1.el5xen #1 SMP Wed Aug 19 04:17:26 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux $ cat etc/redhat-release Red Hat Enterprise Linux Server release 5.3 (Tikanga) Memory = 32G (Guest) KVan: -------------- $ cat uname Linux ussp-qi2ap00 2.6.18-128.1.1.el5xen #1 SMP Mon Jan 26 14:19:09 EST 2009 x86_64 x86_64 x86_64 GNU/Linux $ cat etc/redhat-release Red Hat Enterprise Linux Server release 5.3 (Tikanga) We have asked more info from customer: - new vmcores (in case it create more) - Add the line 'loglvl=all guest_loglvl=all' on there xen.gz command line (grub.conf) and when they start it again execute xm dmesg and provide it to us? - In case they crash and cannot see the issue setup serial console - Can we request to customer to disable the balloon and give a new try? Edit xend-config.sxp change dom0-min-mem to 0 and set in grub.conf the value for amount of memory for dom0: dom0_mem=Memory Value Additional info (Flavio's update): ---------------------------------------- crash> sys KERNEL: /cores/20101222171342/work/vmlinux DUMPFILE: /cores/20101222171342/work/00395680_vmcore CPUS: 4 DATE: Tue Dec 21 22:25:59 2010 UPTIME: 27 days, 09:44:29 LOAD AVERAGE: 0.05, 0.09, 0.12 TASKS: 170 NODENAME: ussp-pb13 RELEASE: 2.6.18-128.7.1.el5xen VERSION: #1 SMP Wed Aug 19 04:17:26 EDT 2009 MACHINE: x86_64 (3191 Mhz) MEMORY: 31 GB PANIC: "Oops: 0000 [1] SMP " (check log for details) The statement that the annoying message "peth0: too many iterations (6) in nv_nic_irq." can be ignored because it is unrelated to the crash. The oops happens at: Unable to handle kernel paging request at ffff8807bf502000 RIP: [<ffffffff80212542>] csum_partial+0x219/0x4bc crash> dis -l ffffffff80212542 /usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/lib/csum-partial.c: 98 0xffffffff80212542 <csum_partial+523>: add (%rdi),%r8 RDI: ffff8807bf502000 But it is accessible in crash: crash> kmem ffff8807bf502000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff880023f25870 7bf502000 0 0 733 10 ffff8807bf501fc0: 0000000000000000 0000000000000000 ................ ffff8807bf501fd0: 0000000000000000 0000000000000000 ................ ffff8807bf501fe0: 4e21525600000000 00000000000b8402 ....VR!N........ ffff8807bf501ff0: 0000000000000000 0000000000000000 ................ >ffff8807bf502000: 0100000098393bc0 02801c009b541000 .;9.......T..... ^^^^^^^^^^^^^^^^ ffff8807bf502010: 0000000000000000 0000000000000000 ................ ffff8807bf502020: 3e801c0000000000 000000000a000000 .......>........ ffff8807bf502030: 0000000000000000 0000000000000000 ................ though, as Fabio pointed out, PTE is 0. Anyway, bt command doesn't work (due to another problem maybe?), so using bt -T and trying to match with log. We can see this: [ffff8807aadc1510] skb_checksum at ffffffff8023ccd2 <--[0] [ffff8807aadc1550] ip_nat_fn at ffffffff8858f491 disassembling skb_checksum shows this: 0xffffffff8023cccd <skb_checksum+0x1c6>: callq <do_munmap+0x26c> <---[1] 0xffffffff8023ccd2<--[0] <skb_checksum+0x1cb>: testb $0x1,0xc(%rsp) well, that [1] actually should be csum_partial, and not do_munmap(). I couldn't find any logic in the bt -t/-T as it happens with the stack in the log output (including a apparently good skb in the stack). Thus, I'm not sure if all the mismatches are a post-mortem effect or caused the crash. Doing a sanity check in the csum_partial we see this at the start: 0xffffffff80212337 <csum_partial>: mov %rdi,%rbp 0xffffffff8021233a <csum_partial+0x3>: and $0x1,%ebp <--- [2] 0xffffffff8021233d <csum_partial+0x6>: jne 0xffffffff80212344 <csum_partial+0xd> 0xffffffff8021233f <csum_partial+0x8>: xor %r8d,%r8d 0xffffffff80212342 <csum_partial+0xb>: jmp 0xffffffff80212352 <csum_partial+0x1b> [2] static __force_inline unsigned do_csum(const unsigned char *buff, unsigned len) { unsigned odd, count; unsigned long result = 0; if (unlikely(len == 0)) return result; odd = 1 & (unsigned long) buff; <--- looks like this part: As far as I can tell, it is missing the usual push, push, sub sequence to start the function. I have another vmcore from another case which shows the following: RELEASE: 2.6.18-194.17.4.el5 0xffffffff80011efe <csum_partial+0>: push %rbp 0xffffffff80011eff <csum_partial+1>: xor %eax,%eax 0xffffffff80011f01 <csum_partial+3>: test %esi,%esi <-- looks like len == 0 test. 0xffffffff80011f03 <csum_partial+5>: push %rbx 0xffffffff80011f04 <csum_partial+6>: mov %edx,%ebx 0xffffffff80011f06 <csum_partial+0x8>: je 0xffffffff80012031 <csum_partial+0x133> 0xffffffff80011f0c <csum_partial+0xe>: mov %rdi,%rbp 0xffffffff80011f0f <csum_partial+0x11>: and $0x1,%ebp <--- same operation as above Well, it could be a compiler optimization, but there's another difference: crash> dis skb_checksum | grep "callq.*csum_partial" 0xffffffff8003b9ee <skb_checksum+0x4e>: callq 0xffffffff80011efe <csum_partial> 0xffffffff8003bae8 <skb_checksum+0x148>: callq 0xffffffff80011efe <csum_partial> and on this vmcore: crash> dis skb_checksum | grep "csum_partial" crash> Thus I'm not finding how the stack could show skb_checksum() calling csum_partial() if the assembly code doesn't do that. So, my guess here is that the XEN had changed some pages and somehow that didn't go well. Therefore, I would say to disable memory ballooning (main suspect when dealing with memory like that) for while and see if the customer can still reproduce this issue. (i.e. pass dom0_mem= to kernel and change dom0-min-mem to 0 in /etc/xen/xend-config.sxp) Also, I tried to focus on that valid skb in the stack, here are my findings: crash> sk_buff ffff88007c9312b0 sk = 0xffff88009d2bf380, ... dst = 0xffff88022afdde40, ... len = 0x5dc, data_len = 0x5a8, <-- non linear skb which means len - data_len in the main skb and the remaining in the frags: crash> eval 0x5dc - 0x5a8 hexadecimal: 34 decimal: 52 data = 0xffff8807bbd074cc "E", tail = 0xffff8807bbd07500 "\001", crash> eval 0xffff8807bbd07500 - 0xffff8807bbd074cc hexadecimal: 34 decimal: 52 Ok, going to find the frags as this is not the faulty mem area crash> sk_buff.end ffff88007c9312b0 end = 0xffff8807bbd07500 "\001" <-- here starts struct skb_shared_info however, the vmcore doesn't have this symbol, so we have to translate manually. Hopefully I did everything right: struct skb_shared_info { [0x0] atomic_t dataref; = 0 [0x4] short unsigned int nr_frags; = 2 [0x6] short unsigned int gso_size; = 0 [0x8] short unsigned int gso_segs; = 0 [0xa] short unsigned int gso_type; = 0 [0xc] unsigned int ip6_frag_id; = 0 [0x10] struct sk_buff *frag_list; = NULL [0x18] skb_frag_t frags[18]; -> SIZE 0x10 [0] crash> skb_frag_t ffff8807bbd07518 <-- first frag struct skb_frag_t { page = 0xffff88000c474928, page_offset = 0x8c, size = 0xa0 } [1] crash> skb_frag_t ffff8807bbd07528 <-- second frag struct skb_frag_t { page = 0xffff880023f25870, page_offset = 0x0, size = 0x508 } } The sizes match with the skb, so it's all good. Here are both struct pages: crash> page 0xffff88000c474928 struct page { flags = 0x0, _count = { counter = 0xb }, _mapcount = { counter = 0xffffffff }, { { private = 0x0, mapping = 0x0 } }, index = 0x2b0191cdc, lru = { next = 0x100100, prev = 0x200200 } } Also, here is the second one: crash> page 0xffff880023f25870 struct page { flags = 0x10, <-- PG_reserved (weird..) _count = { counter = 0x2dd }, _mapcount = { counter = 0xffffffff }, { { private = 0x0, mapping = 0x0 } }, index = 0x0, lru = { next = 0x100100, prev = 0x200200 } } and checking the bad address, we can see that the page is actually the second skb's fragment: crash> vtop ffff8807bf502000 <-- buggy address VIRTUAL PHYSICAL ffff8807bf502000 7bf502000 PML4 DIRECTORY: ffffffff80201000 PAGE DIRECTORY: 854e37067 [machine] PAGE DIRECTORY: 4e37000 PUD: 4e370f0 => 85b656067 [machine] PUD: 8a56000 PMD: 8a56fd0 => 85b451067 [machine] PMD: 8c51000 PTE: 8c51810 => 0 [machine] PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffff880023f25870 7bf502000 0 0 733 10 ^^^-- the second fragment page Well, at some point right before the data had been written into that fragment, and there's no apparent reason for those pages be invalid then. It sounds like a paging mapping problem (with sum up with PTE=0).
Does the NIC use scatter/gather? From dom0 check ethtool -i peth0 If so, we could try turning it off.
Err, use 'ethtool -k peth0' to check. Also, just to note, I think the suggestion to run with ballooning turned off is a good idea, as it should alleviate some risk of running into problems with a flipping netfront. Or just switch the vifs to copying.
Andrew, note that we do have a similar PTE=0 bug with copying: https://bugzilla.redhat.com/show_bug.cgi?id=629213
Changing scatter/gather to off (ethtool -K peth0 sg off) is definitely a good idea since Douglas's analysis points at problems in the fragments beyond the skb. As another semi-wild shot, I wonder if any of these net{front,back} corruptions we're seeing might be caused by BZ630129, which has been backported to 5.4.z and 5.5.z, but not 5.3.z. Therefore, I wonder what kernel versions are running in the _guests_ instead, for both the crashing and the non-crashing hosts. (Maybe the corruption in the guest breaks some invariant in the host as well---seems a bit far fetched, but the information should be easy to get).
Ping? There's really nothing I can do with this bug without being able to experiment with the particular box, or to get a system set up locally that reproduces the issue. Turning off scatter/gather also has a good chance of being a workaround and it should certainly be attempted as soon as possible. I'll leave this bug open for now for 5.7, but without further information I'll eventually have to close it as insufficient data.
The customer has closed their ticket and as I said in comment 6, without further information there's nothing that can be done. I'm closing this bug as insufficient data.