> Description of problem: This customer reported that the server got crash rebooted on Sep 27 at 18:50:06 hours. We could see that the kernel crashed with following message: <snip> Initializing CPU#1 netfront: device eth0 has flipping receive path. netfront: device eth1 has flipping receive path. Initializing CPU#1 Kernel panic - not syncing: Unable to reduce memory reservation </snip> So we know where the panic occurred but we still don't know why. > Version-Release: Dom0: Red Hat Enterprise Linux 5.5 Kernel: 2.6.18-194.11.1.el5xen DomU: Red Hat Enterprise Linux 5.5 Kernel: 2.6.18-194.8.1.el5xen > Steps to Reproduce: We couldn't reproduce it, however here is the domU config: name = "test0000" uuid = "c768369e-f781-2f19-7c6f-6d229320660b" maxmem = 4096 memory = 4096 vcpus = 2 bootloader = "/usr/bin/pygrub" on_poweroff = "destroy" on_reboot = "restart" on_crash = "destroy" vfb = [ ] disk = [ "phy:/dev/mapper/mpath175,xvda,w","phy:/dev/mapper/mpath233,xvdb,w", "phy:/dev/mapper/mpath248,xvdc,w" ] vif = [ "mac=00:16:3e:66:10:f5,bridge=xenvlanbr204" ,"mac=00:16:3e:63:fb:ad,bridge=xenbr3" ] vfb = ["type=vnc,vncunused=1,keymap=en-us"] Dom0 Hardware: Manufacturer: HP Product Name: ProLiant DL580 G5 Family: ProLiant <snip> 22:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 22:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 25:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 25:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) </snip> # cat /etc/modprobe.conf alias eth0 e1000e alias eth1 e1000e alias eth2 e1000e alias eth3 e1000e alias eth4 bnx2 alias eth5 bnx2 alias scsi_hostadapter cciss alias scsi_hostadapter1 usb-storage alias bond0 bonding alias bond1 bonding options bonding miimon=100 mode=active-backup install pppox /bin/true install bluetooth /bin/true options scsi_mod dev_flags="EMC:SYMMETRIX:0x20240,DGC:RAID 3:0x240,DGC:RAID 5:0x240" max_luns=1024 options lpfc lpfc_topology=0x02 lpfc_max_luns=1024 lpfc_nodev_tmo=10 > Expected results: Understanding why Xen is experiencing this behavior and fix it. > Additional info: The log shows "Unable to reduce memory reservation". We've found that the only place where this message occurs in the code is in network_alloc_rx_buffers. We've confirmed that in the vmcore as well. crash> dis -l network_alloc_rx_buffers ... 0xffffffff8811668d <network_alloc_rx_buffers+0x494>: mov $0xffffffff8811918b,%rdi 0xffffffff88116694 <network_alloc_rx_buffers+0x49b>: xor %eax,%eax 0xffffffff88116696 <network_alloc_rx_buffers+0x49d>: callq 0xffffffff8028d114 <panic> crash> rd 0xffffffff8811918b 10 ffffffff8811918b: 7420656c62616e55 656375646572206f Unable to reduce ffffffff8811919b: 2079726f6d656d20 7461767265736572 memory reservat ffffffff881191ab: 006425000a6e6f69 2d65727574616566 ion..%d.feature- ffffffff881191bb: 7574616566006773 742d6f73672d6572 sg.feature-gso-t ffffffff881191cb: 3e313c0034767063 5f6b726f7774656e cpv4.<1>network_ crash> The network driver must be passing a page frame number to Xen which does not belong to that domain. As you may well know, incoming network packets are received into buffers in the backend and those buffers are then page flipped into the appropriate destination. Therefore, domains with virtual interfaces must return pages to Xen (by reducing their reservation) in return for the pages that are flipped in. The net effect is that both backend and frontend domains vary in size by a small, bounded amount and that copies on the receive path are avoided. DomU is reducing its reservation to give receive buffers to the backend driver in Dom0. I can see in file netfront.c (Virtual network driver for conversing with remote driver backends) in function network_alloc_rx_buffers() check if return status of HYPERVISOR_memory_op() is OK. <snip> /* Check return status of HYPERVISOR_memory_op(). */ if (unlikely(np->rx_mcl[i].result != i)) panic("Unable to reduce memory reservation\n"); } else { if (HYPERVISOR_memory_op(XENMEM_decrease_reservation, &reservation) != i) panic("Unable to reduce memory reservation\n"); } </snip> HYPERVISOR_memory_op() is responsible to increase or decrease number of frames. Best Regards, Alberto dos Santos Silva Jr
There aren't that many ways for this hypercall to fail, so this is strange. We could narrow it down some more if we had 'xm dmesg', but 'xm dmesg' with more logging turned on, i.e. boot with 'loglvl=all guest_loglvl=all' on Xen's command line. Checking the load and memory usage of the host during the problem might be interesting as well.
One thing we may be able to check with the core of the guest we have is if all the pages in the rx_pfn_array look valid.
Did we get a sos report from the host at the time of the crash to look at? I don't think the guest core will tell us any more, although you could check if the balloon driver is running on the long-shot that this crash is a side-effect of bug 653262, possibly making this bug a dup of that bug. dmesg and 'xm dmesg' from the host would be useful in checking that theory too, as another guest could have been ballooning.
I don't think the balloon driver is involved. The only cases in which *de*crease_reservation fails, is if the arguments are wrong. The guest core could help us find why guest_remove_page is failing (that's the workhorse of decrease_reservation in the hypervisor). If the customer desires a workaround, I suggest that the customer switches the netfront module to copying by using the "rx_copy=1" module option. However, a guest without this option and running on a hypervisor with "loglvl=all guest_loglvl=all" is necessary in order to collect the required information.
(In reply to comment #4) > Did we get a sos report from the host at the time of the crash to look at? > > I don't think the guest core will tell us any more, although you could check if > the balloon driver is running on the long-shot that this crash is a side-effect > of bug 653262, possibly making this bug a dup of that bug. dmesg and 'xm dmesg' > from the host would be useful in checking that theory too, as another guest > could have been ballooning. We did get a sosreport from dom0, but its from a while after the initial incident. The core is from Sep 27 and the sos from Oct 19. However I am attaching it anyways.
Created attachment 462417 [details] dom0 sosreport
Unfortunately nothing interesting in the sos :(
Waiting from info on 5.6 beta. As a workaround, the guest netfront module could also be started with the rx_copy=1 option.