666453 – RHEL5.3 - xen host - random memory crash

Bug 666453 - RHEL5.3 - xen host - random memory crash

Summary: RHEL5.3 - xen host - random memory crash

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Jones
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	514489
TreeView+	depends on / blocked

Reported:	2010-12-30 21:53 UTC by Douglas Schilling Landgraf
Modified:	2018-11-27 19:24 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-05-09 12:07:45 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Douglas Schilling Landgraf 2010-12-30 21:53:38 UTC

Description of problem:

Summary:

This customer faced in the past a issue with invalid kernel page table entries.
The behaviour was at that moment 6 server machines rebooting randomly from their environment
each 7-10 days.

Based on the lots of vmcores analyze, we provided the RCA and customer applied the current zstream kernel and 
we don't listen any problem until this new ticket.

Tickets and BZs for the previous cases:
----------------------------------------------

SFDC: 362384
Desc: crash analysis on xen host ussp-pb20

SFDC: 00339543
Desc: crash analysis on xen hosts
      Splintering this off from case 00339543 to track the issue/crash specifically with ussp-pb20.

Bugzilla opened for the above cases
https://bugzilla.redhat.com/show_bug.cgi?id=645043
Desc:  RHEL5.3 - xen host - crashing randomly

Errata that customer applied:
http://rhn.redhat.com/errata/RHSA-2009-1193.html

Bugzilla that we pointed that solved the previous issue:
RH5.3 x64 RC2 reboots while installing a virtual machine
https://bugzilla.redhat.com/show_bug.cgi?id=479754

Currently customer is facing a new memory issue which is appearing between 3-5 week running 1 guest.
They have 6 nodes and have thus far seen 3 crashes. The crashes weren't all on the same node, nor did they happen at the same time

Data provide from customer:
==================================

VMcore:
------------

00395680_vmcore.gz

# kinit <kerberos username>
# ssh kerberos_username.rdu.redhat.com
# cd /cores/20101222171342

Sosreport:
------------

sosreport-KVan.395680-445481-b0a7ca.tar.bz2
sosreport-DeepakYadav.00395680-954083-42b91f.tar.bz2

(Hypervisor) DeepakYadav:
-------------------------------
$ cat uname 
Linux ussp-pb13 2.6.18-128.7.1.el5xen #1 SMP Wed Aug 19 04:17:26 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

$ cat etc/redhat-release 
Red Hat Enterprise Linux Server release 5.3 (Tikanga)

Memory = 32G

(Guest) KVan:
--------------
$ cat uname 
Linux ussp-qi2ap00 2.6.18-128.1.1.el5xen #1 SMP Mon Jan 26 14:19:09 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

$ cat etc/redhat-release 
Red Hat Enterprise Linux Server release 5.3 (Tikanga)

We have asked more info from customer:

      - new vmcores (in case it create more)
      - Add the line 'loglvl=all guest_loglvl=all' on there xen.gz command line (grub.conf)
        and when they start it again execute xm dmesg and provide it to us?
      - In case they crash and cannot see the issue setup serial console  
      - Can we request to customer to disable the balloon and give a new try?

         Edit xend-config.sxp change dom0-min-mem to 0 and set in grub.conf 
         the value for amount of memory for dom0:

           dom0_mem=Memory Value

Additional info (Flavio's update):
----------------------------------------

crash> sys
      KERNEL: /cores/20101222171342/work/vmlinux
    DUMPFILE: /cores/20101222171342/work/00395680_vmcore
        CPUS: 4
        DATE: Tue Dec 21 22:25:59 2010
      UPTIME: 27 days, 09:44:29
LOAD AVERAGE: 0.05, 0.09, 0.12
       TASKS: 170
    NODENAME: ussp-pb13
     RELEASE: 2.6.18-128.7.1.el5xen
     VERSION: #1 SMP Wed Aug 19 04:17:26 EDT 2009
     MACHINE: x86_64  (3191 Mhz)
      MEMORY: 31 GB
       PANIC: "Oops: 0000 [1] SMP " (check log for details)

The statement that the annoying message "peth0: too many iterations (6) in nv_nic_irq." can be ignored because it is unrelated to the crash.

The oops happens at:
Unable to handle kernel paging request at ffff8807bf502000 RIP: [<ffffffff80212542>] csum_partial+0x219/0x4bc
crash> dis -l ffffffff80212542
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/lib/csum-partial.c: 98
0xffffffff80212542 <csum_partial+523>:  add    (%rdi),%r8
RDI: ffff8807bf502000

But it is accessible in crash:
crash> kmem ffff8807bf502000
      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffff880023f25870 7bf502000                0        0 733 10

 ffff8807bf501fc0:  0000000000000000 0000000000000000   ................
 ffff8807bf501fd0:  0000000000000000 0000000000000000   ................
 ffff8807bf501fe0:  4e21525600000000 00000000000b8402   ....VR!N........
 ffff8807bf501ff0:  0000000000000000 0000000000000000   ................
>ffff8807bf502000:  0100000098393bc0 02801c009b541000   .;9.......T.....
                    ^^^^^^^^^^^^^^^^
 ffff8807bf502010:  0000000000000000 0000000000000000   ................
 ffff8807bf502020:  3e801c0000000000 000000000a000000   .......>........
 ffff8807bf502030:  0000000000000000 0000000000000000   ................

though, as Fabio pointed out, PTE is 0. Anyway, bt command doesn't work (due to another problem maybe?), so using bt -T and trying to match with log. We can see this:
 [ffff8807aadc1510] skb_checksum at ffffffff8023ccd2 <--[0]
 [ffff8807aadc1550] ip_nat_fn at ffffffff8858f491

disassembling skb_checksum shows this:
0xffffffff8023cccd       <skb_checksum+0x1c6>:        callq   <do_munmap+0x26c> <---[1]
0xffffffff8023ccd2<--[0] <skb_checksum+0x1cb>:        testb  $0x1,0xc(%rsp)

well, that [1] actually should be csum_partial, and not do_munmap(). I couldn't find any logic in the bt -t/-T as it happens with the stack in the log output (including a apparently good skb in the stack).

Thus, I'm not sure if all the mismatches are a post-mortem effect or caused the crash. 
Doing a sanity check in the csum_partial we see this at the start:
0xffffffff80212337 <csum_partial>:      mov    %rdi,%rbp
0xffffffff8021233a <csum_partial+0x3>:  and    $0x1,%ebp  <--- [2]
0xffffffff8021233d <csum_partial+0x6>:  jne    0xffffffff80212344 <csum_partial+0xd>
0xffffffff8021233f <csum_partial+0x8>:  xor    %r8d,%r8d
0xffffffff80212342 <csum_partial+0xb>:  jmp    0xffffffff80212352 <csum_partial+0x1b>

[2] 
static __force_inline unsigned do_csum(const unsigned char *buff, unsigned len)
{
        unsigned odd, count;
        unsigned long result = 0;

        if (unlikely(len == 0))
                return result;
        odd = 1 & (unsigned long) buff;  <--- looks like this part:

As far as I can tell, it is missing the usual push, push, sub sequence to start the function. I have another vmcore from another case which shows the following:
RELEASE: 2.6.18-194.17.4.el5
0xffffffff80011efe <csum_partial+0>:    push   %rbp
0xffffffff80011eff <csum_partial+1>:    xor    %eax,%eax
0xffffffff80011f01 <csum_partial+3>:    test   %esi,%esi  <-- looks like len == 0 test.
0xffffffff80011f03 <csum_partial+5>:    push   %rbx
0xffffffff80011f04 <csum_partial+6>:    mov    %edx,%ebx
0xffffffff80011f06 <csum_partial+0x8>:  je     0xffffffff80012031 <csum_partial+0x133>
0xffffffff80011f0c <csum_partial+0xe>:  mov    %rdi,%rbp
0xffffffff80011f0f <csum_partial+0x11>: and    $0x1,%ebp  <--- same operation as above

Well, it could be a compiler optimization, but there's another difference:
crash> dis skb_checksum | grep "callq.*csum_partial"
0xffffffff8003b9ee <skb_checksum+0x4e>: callq  0xffffffff80011efe <csum_partial>
0xffffffff8003bae8 <skb_checksum+0x148>:        callq  0xffffffff80011efe <csum_partial>

and on this vmcore:
crash> dis skb_checksum | grep "csum_partial"
crash> 

Thus I'm not finding how the stack could show skb_checksum() calling csum_partial() if the assembly code doesn't do that.

So, my guess here is that the XEN had changed some pages and somehow that didn't go well. Therefore, I would say to disable memory ballooning (main suspect when dealing with memory like that) for while and see if the customer can still reproduce this issue. (i.e. pass dom0_mem= to kernel and change dom0-min-mem to 0 in /etc/xen/xend-config.sxp)

Also, I tried to focus on that valid skb in the stack, here are my findings:

crash> sk_buff ffff88007c9312b0
  sk = 0xffff88009d2bf380,
...
  dst = 0xffff88022afdde40,
...
  len = 0x5dc,
  data_len = 0x5a8,  <-- non linear skb

which means len - data_len in the main skb and the remaining in the frags:
crash> eval 0x5dc - 0x5a8
hexadecimal: 34
    decimal: 52

  data = 0xffff8807bbd074cc "E",
  tail = 0xffff8807bbd07500 "\001",
crash> eval 0xffff8807bbd07500 - 0xffff8807bbd074cc
hexadecimal: 34
    decimal: 52

Ok, going to find the frags as this is not the faulty mem area

crash> sk_buff.end ffff88007c9312b0
  end = 0xffff8807bbd07500 "\001" <-- here starts struct skb_shared_info
however, the vmcore doesn't have this symbol, so we have to translate manually.
Hopefully I did everything right:

struct skb_shared_info {
    [0x0] atomic_t dataref; = 0
    [0x4] short unsigned int nr_frags;  = 2
    [0x6] short unsigned int gso_size;  = 0
    [0x8] short unsigned int gso_segs;  = 0
    [0xa] short unsigned int gso_type;  = 0
    [0xc] unsigned int ip6_frag_id; = 0
   [0x10] struct sk_buff *frag_list; = NULL
   [0x18] skb_frag_t frags[18]; -> SIZE 0x10
         [0] crash> skb_frag_t ffff8807bbd07518  <-- first frag
                  struct skb_frag_t {
                     page = 0xffff88000c474928,
                     page_offset = 0x8c,
                     size = 0xa0
                  }
         [1] crash> skb_frag_t ffff8807bbd07528  <-- second frag
                  struct skb_frag_t {
                     page = 0xffff880023f25870,
                     page_offset = 0x0,
                     size = 0x508
                  }

}

The sizes match with the skb, so it's all good. Here are both struct pages:
crash> page 0xffff88000c474928
struct page {
  flags = 0x0, 
  _count = {
    counter = 0xb
  }, 
  _mapcount = {
    counter = 0xffffffff
  }, 
  {
    {
      private = 0x0, 
      mapping = 0x0
    }
  }, 
  index = 0x2b0191cdc, 
  lru = {
    next = 0x100100, 
    prev = 0x200200
  }
}

Also, here is the second one:
crash> page 0xffff880023f25870
struct page {
  flags = 0x10,  <-- PG_reserved (weird..)
  _count = {
    counter = 0x2dd
  }, 
  _mapcount = {
    counter = 0xffffffff
  }, 
  {
    {
      private = 0x0, 
      mapping = 0x0
    }
  }, 
  index = 0x0, 
  lru = {
    next = 0x100100, 
    prev = 0x200200
  }
}

and checking the bad address, we can see that the page is actually the second skb's fragment:
crash> vtop ffff8807bf502000 <-- buggy address
VIRTUAL           PHYSICAL
ffff8807bf502000  7bf502000

PML4 DIRECTORY: ffffffff80201000
PAGE DIRECTORY: 854e37067 [machine]
PAGE DIRECTORY: 4e37000
   PUD: 4e370f0 => 85b656067 [machine]
   PUD: 8a56000
   PMD: 8a56fd0 => 85b451067 [machine]
   PMD: 8c51000
   PTE: 8c51810 => 0 [machine]

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffff880023f25870 7bf502000                0        0 733 10
 ^^^-- the second fragment page

Well, at some point right before the data had been written into that fragment, and there's no apparent reason for those pages be invalid then.
It sounds like a paging mapping problem (with sum up with PTE=0).

Comment 1 Andrew Jones 2011-01-03 10:07:58 UTC

Does the NIC use scatter/gather?

From dom0 check
ethtool -i peth0

If so, we could try turning it off.

Comment 2 Andrew Jones 2011-01-03 11:55:17 UTC

Err, use 'ethtool -k peth0' to check.

Also, just to note, I think the suggestion to run with ballooning turned off is a good idea, as it should alleviate some risk of running into problems with a flipping netfront. Or just switch the vifs to copying.

Comment 3 Paolo Bonzini 2011-01-03 13:31:50 UTC

Andrew,

note that we do have a similar PTE=0 bug with copying: 

https://bugzilla.redhat.com/show_bug.cgi?id=629213

Comment 4 Paolo Bonzini 2011-01-04 16:41:14 UTC

Changing scatter/gather to off (ethtool -K peth0 sg off) is definitely a good idea since Douglas's analysis points at problems in the fragments beyond the skb.

As another semi-wild shot, I wonder if any of these net{front,back} corruptions we're seeing might be caused by BZ630129, which has been backported to 5.4.z and 5.5.z, but not 5.3.z.  Therefore, I wonder what kernel versions are running in the _guests_ instead, for both the crashing and the non-crashing hosts.  (Maybe the corruption in the guest breaks some invariant in the host as well---seems a bit far fetched, but the information should be easy to get).

Comment 6 Andrew Jones 2011-01-17 12:06:45 UTC

Ping? There's really nothing I can do with this bug without being able to experiment with the particular box, or to get a system set up locally that reproduces the issue. Turning off scatter/gather also has a good chance of being a workaround and it should certainly be attempted as soon as possible. I'll leave this bug open for now for 5.7, but without further information I'll eventually have to close it as insufficient data.

Comment 7 Andrew Jones 2011-05-09 12:07:45 UTC

The customer has closed their ticket and as I said in comment 6, without further information there's nothing that can be done. I'm closing this bug as insufficient data.

Note You need to log in before you can comment on or make changes to this bug.