+++ This bug was initially created as a clone of Bug #212826 +++ {If a broken domU can take down bridged networking for all guests, then that's something that definitely needs fixed for RHEL-5 too. --sct} Description of problem: Networking between all domains fails. The following message is flodding all domains ( dmo0 + domU ) /var/log/messages : "xen_net: Memory squeeze in netback driver." Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. make network traffic between domU and dom0. For example: root@xendom0# ping xendomU1 Actual results: /var/log/messages get's flooded on dom0 and on all domU ( even those not on the same virtual-network (bridge) ) Expected results: Should work :-) Additional info: -- Additional comment from sct on 2006-10-30 06:32 EST -- When do these occur? On boot? On domain creation? Under load? What networking configuration are you using? -- Additional comment from buffalo on 2006-10-30 16:51 EST -- They occure when the first few bytes are going over any bridge. It doesn't matter from which system the data is comming from or going to as long at least one of them is on xenbr(0-2), which actually means any domain in my configuration. The System has 2 phys. networkinterfaces. Each of them is connected to a bridge. One bridge for the LAN, another for the internet connection. There is also a pure virtual bridge for the DMZ. dom0 domU*(multiple internal systems) | | xenbr0-eth0 (LAN) | domU(router/fw) | xenbr1 (DMZ) - domU*(multiple DMZ systems) | domU(router/fw) | xenbr2-eth1 (internet) | internet -- Additional comment from buffalo on 2006-11-01 05:21 EST -- Problem solved. domU running kernel-xenU-2.6.17-1.2174_FC5 breaks networking with dom0 2.6.18-1.2798.fc6xen. When all domUs are running 2.6.17-1.2187_FC5xenU everything works fine. Booting one with kernel-xenU-2.6.17-1.2174_FC5 breaks everything on all xenbr*. After stopping the "bad" domU everything resumed working. -- Additional comment from sct on 2006-11-01 08:10 EST -- So when you have one 2174 kernel running, all other domains fail all the time? That one domain breaks all the others? Or is only the 2174 domain itself broken? -- Additional comment from buffalo on 2006-11-01 09:00 EST -- Yes. All interfaces connected to a xenbridge become unavailable. Even those which are not on the same xenbridge. The only devices that are still working are the physical devices on dom0. I've removed all domains from autostart and started each domU manually. First i've started my firewall/router domU which seperates the LAN from the DMZ and pinged it from dom0 and it worked. Then started the outer firewall/router domU and it was pingable too. After starting a domU connected to (xenbr0) with the 2174 kernel the pings to my outer router stopped. After shutting it down the pings made theire way again. Since it was the only domU still on 2174 i've upgraded it and tried again and everything worked. Since i have the old kernel still installed it's very easy form me to reproduce. If you want any traces please let me know. Regards Heiko
I can provide more info on this.. What would help? I *am* seeing this in RHEL 5.1. kernel-xen-2.6.18-53.el5 xen-libs-3.0.3-41.el5 xen-3.0.3-41.el5 I am able to reproduce with 9 VMs running. Each have 2 interfaces presented. As soon as I add the 10th VM (19th and 20th vif) I get the memory squeeze and lose network connection to all VMs.
Mark, your problem sounds quite different. I think in your case the HV is simply running out of memory. With the default RHEL setup, we assign almost all the available RAM to guests, leaving the HV with very little. This is simply broken as the HV needs to have some free memory so that things like networking can operate. The original problem here is a suspicion that a broken domU can bring down the whole machine. However, there is currently no proof of that. So if you want to persue your issue please open a new bug against the xen regarding memory distribution between the HV and the domains. In any case, if you adjust your memory allocation (by shrinking your guests or dom0) so that the HV has some memory (64M should be more than enough) then it should work correctly.
Herbert, I just ran the VMs all up on the same system again. I'll get the logs, xm list, and networking configs for you today because I believe this is my bug. I have 10 VMs running right now. The total memory assigned to them is 11G. I have a 32G system. Domain-0 in xm list reports 20552 memory available. I was watching one of the VMs which is in a cluster through it's console. As soon as I started up the 11th VM, which brought up vif 20 and 21 in this case, I see my clustered VM lose all it's DLM connections. That shows me a definite loss of networking. I've also tested this with pings. As soon as I shut down that 11th VM, bring my total VIF count to 19, connections are re-established and the memory squeeze error stops.
Herbert, I just re-read your comment after thinking a bit more. Are you saying I need to shrink dom0 itself to keep it from taking all the system memory? If so, do I do that in /etc/xen/xend-config.sxp?
Yes you need to shrink dom0. The easiest way is probably "xm mem-set" or its virsh equivalent.
Which would be virsh setmem.
I tried both ways (virsh setmem and xm mem-set) and set the Domain-0 down to 25G. I started up the same amount of VMs, which only take a total of 11G, and I get the same error "memory squeeze in netback driver". Then I lose all my network connections to the VMs. After about 30 seconds, I also lost my ssh connection, though I could still ping the system.
You've got a 32G system and you set dom0 to 25G. That leaves only 7G free for the other guests. You then start 11G worth of guests, which means now the HV has almost no memory. So I suggest that you try setting dom0 down to 10G as a test. Thanks!
OK, Herbert straightened me out in e-mails about what is going on with the memory, I wasn't understanding properly. I set dom0 down to 10G, then started up 13 domU systems at a total of 14G and do not have the memory squeeze. Sorry for the misunderstanding, I've set this bug back to medium/medium.
Since the original bug report has now been closed and I've not received any new info indicating any bugs in xen netfront/netback, I'm going to close this bug. In conclusion the original issue was mostly like due to an incorrect memory assignment, i.e., leaving too little memory for the hypervisor.
I have at least three customers (fourth case about to be escalated) still reporting this problem in RHEL5.1 and RHEL5.2 beta. It is true that the problem arises if the allocated memory for Dom0 and DomUs becomes close to the total physical memory. However I would expect - maybe I am wrong - from an Enterprise OS that xen is able to prevent the hypervisor from running out of memory in the first place.
While the conclusion of comment 12 is correct, newer RHELs are sidestepping the problem by disabling flipping. So, re-closing as a dup of the bug about flipping-induced network failures. *** This bug has been marked as a duplicate of bug 648763 ***
*** Bug 723919 has been marked as a duplicate of this bug. ***