646649 – DomU crash - "Unable to reduce memory reservation"

Bug 646649 - DomU crash - "Unable to reduce memory reservation"

Summary: DomU crash - "Unable to reduce memory reservation"

Keywords:
Status:	CLOSED DUPLICATE of bug 653262
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.5
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Xen Maintainance List
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	514489
TreeView+	depends on / blocked

Reported:	2010-10-25 20:10 UTC by asilva
Modified:	2018-11-14 16:57 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-03-08 16:36:50 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dom0 sosreport (1.71 MB, application/x-bzip2) 2010-11-23 19:15 UTC, John Ruemker	no flags	Details
View All

Description asilva 2010-10-25 20:10:17 UTC

> Description of problem:

This customer reported that the server got crash rebooted on Sep 27 at 18:50:06 hours. We could see that the kernel crashed with following message:

<snip>
Initializing CPU#1
netfront: device eth0 has flipping receive path.
netfront: device eth1 has flipping receive path.
Initializing CPU#1
Kernel panic - not syncing: Unable to reduce memory reservation
</snip>

So we know where the panic occurred but we still don't know why.

> Version-Release:
Dom0:
Red Hat Enterprise Linux 5.5
Kernel: 2.6.18-194.11.1.el5xen

DomU:
Red Hat Enterprise Linux 5.5
Kernel: 2.6.18-194.8.1.el5xen

> Steps to Reproduce:

We couldn't reproduce it, however here is the domU config:

name = "test0000"
uuid = "c768369e-f781-2f19-7c6f-6d229320660b"
maxmem = 4096
memory = 4096
vcpus = 2
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "destroy"
vfb = [  ]
disk = [ "phy:/dev/mapper/mpath175,xvda,w","phy:/dev/mapper/mpath233,xvdb,w", "phy:/dev/mapper/mpath248,xvdc,w" ]
vif = [ "mac=00:16:3e:66:10:f5,bridge=xenvlanbr204" ,"mac=00:16:3e:63:fb:ad,bridge=xenbr3" ]
vfb = ["type=vnc,vncunused=1,keymap=en-us"]

Dom0 Hardware:

Manufacturer: HP
Product Name: ProLiant DL580 G5 
Family: ProLiant

<snip>
22:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
22:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
25:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
25:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
</snip>

# cat /etc/modprobe.conf
alias eth0 e1000e
alias eth1 e1000e
alias eth2 e1000e
alias eth3 e1000e
alias eth4 bnx2
alias eth5 bnx2
alias scsi_hostadapter cciss
alias scsi_hostadapter1 usb-storage
alias bond0 bonding
alias bond1 bonding
options bonding miimon=100 mode=active-backup
install pppox /bin/true
install bluetooth /bin/true
options scsi_mod dev_flags="EMC:SYMMETRIX:0x20240,DGC:RAID 3:0x240,DGC:RAID 5:0x240" max_luns=1024
options lpfc lpfc_topology=0x02 lpfc_max_luns=1024 lpfc_nodev_tmo=10 

> Expected results:

Understanding why Xen is experiencing this behavior and fix it.


> Additional info:

The log shows "Unable to reduce memory reservation".  We've found that the only place where this message occurs in the code is in network_alloc_rx_buffers.  We've confirmed that in the vmcore as well.

crash> dis -l network_alloc_rx_buffers
...
0xffffffff8811668d <network_alloc_rx_buffers+0x494>:    mov    $0xffffffff8811918b,%rdi
0xffffffff88116694 <network_alloc_rx_buffers+0x49b>:    xor    %eax,%eax
0xffffffff88116696 <network_alloc_rx_buffers+0x49d>:    callq  0xffffffff8028d114 <panic>

crash> rd 0xffffffff8811918b 10
ffffffff8811918b:  7420656c62616e55 656375646572206f   Unable to reduce
ffffffff8811919b:  2079726f6d656d20 7461767265736572    memory reservat
ffffffff881191ab:  006425000a6e6f69 2d65727574616566   ion..%d.feature-
ffffffff881191bb:  7574616566006773 742d6f73672d6572   sg.feature-gso-t
ffffffff881191cb:  3e313c0034767063 5f6b726f7774656e   cpv4.<1>network_
crash>

The network driver must be passing a page frame number to Xen which does not belong to that domain. As you may well know, incoming network packets are received into buffers in the backend and those buffers are then page flipped into the appropriate destination. Therefore, domains with virtual interfaces must return pages to Xen (by reducing their reservation) in return for the pages that are flipped in.

The net effect is that both backend and frontend domains vary in size by a small, bounded amount and that copies on the receive path are avoided. DomU is reducing its reservation to give receive buffers to the backend driver in Dom0.

I can see in file netfront.c (Virtual network driver for conversing with remote driver backends) in function network_alloc_rx_buffers() check if return status of HYPERVISOR_memory_op() is OK.

<snip>
            /* Check return status of HYPERVISOR_memory_op(). */
            if (unlikely(np->rx_mcl[i].result != i))
                panic("Unable to reduce memory reservation\n");
        } else {
            if (HYPERVISOR_memory_op(XENMEM_decrease_reservation,
                         &reservation) != i)
                panic("Unable to reduce memory reservation\n");
        }
 </snip>

HYPERVISOR_memory_op() is responsible to increase or decrease number of frames.


Best Regards,
Alberto dos Santos Silva Jr

Comment 1 Andrew Jones 2010-11-03 15:45:57 UTC

There aren't that many ways for this hypercall to fail, so this is strange. We could narrow it down some more if we had 'xm dmesg', but 'xm dmesg' with more logging turned on, i.e. boot with 'loglvl=all guest_loglvl=all' on Xen's command line. Checking the load and memory usage of the host during the problem might be interesting as well.

Comment 2 Andrew Jones 2010-11-03 16:01:45 UTC

One thing we may be able to check with the core of the guest we have is if all the pages in the rx_pfn_array look valid.

Comment 4 Andrew Jones 2010-11-22 19:15:57 UTC

Did we get a sos report from the host at the time of the crash to look at?

I don't think the guest core will tell us any more, although you could check if the balloon driver is running on the long-shot that this crash is a side-effect of bug 653262, possibly making this bug a dup of that bug. dmesg and 'xm dmesg' from the host would be useful in checking that theory too, as another guest could have been ballooning.

Comment 5 Paolo Bonzini 2010-11-23 18:28:26 UTC

I don't think the balloon driver is involved.  The only cases in which *de*crease_reservation fails, is if the arguments are wrong.  The guest core could help us find why guest_remove_page is failing (that's the workhorse of decrease_reservation in the hypervisor).

If the customer desires a workaround, I suggest that the customer switches the netfront module to copying by using the "rx_copy=1" module option.  However, a guest without this option and running on a hypervisor with "loglvl=all guest_loglvl=all" is necessary in order to collect the required information.

Comment 6 John Ruemker 2010-11-23 19:13:32 UTC

(In reply to comment #4)
> Did we get a sos report from the host at the time of the crash to look at?
> 
> I don't think the guest core will tell us any more, although you could check if
> the balloon driver is running on the long-shot that this crash is a side-effect
> of bug 653262, possibly making this bug a dup of that bug. dmesg and 'xm dmesg'
> from the host would be useful in checking that theory too, as another guest
> could have been ballooning.

We did get a sosreport from dom0, but its from a while after the initial incident.  The core is from Sep 27 and the sos from Oct 19.  However I am attaching it anyways.

Comment 7 John Ruemker 2010-11-23 19:15:04 UTC

Created attachment 462417 [details]
dom0 sosreport

Comment 8 Andrew Jones 2010-11-24 12:12:03 UTC

Unfortunately nothing interesting in the sos :(

Comment 14 Paolo Bonzini 2011-01-18 13:04:14 UTC

Waiting from  info on 5.6 beta.  As a workaround, the guest netfront module could also be started with the rx_copy=1 option.

Note You need to log in before you can comment on or make changes to this bug.