Bug 213430 - xen_net: Memory squeeze in netback driver.
Summary: xen_net: Memory squeeze in netback driver.
Keywords:
Status: CLOSED DUPLICATE of bug 648763
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Herbert Xu
QA Contact:
URL:
Whiteboard:
Depends On: 212826
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-11-01 14:27 UTC by Stephen Tweedie
Modified: 2011-07-21 15:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-04-01 12:21:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Stephen Tweedie 2006-11-01 14:27:05 UTC
+++ This bug was initially created as a clone of Bug #212826 +++

{If a broken domU can take down bridged networking for all guests, then that's
something that definitely needs fixed for RHEL-5 too. --sct}

Description of problem:

Networking between all domains fails. The following message is flodding all
domains ( dmo0 + domU ) /var/log/messages : "xen_net: Memory squeeze in netback
driver."

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. make network traffic between domU and dom0. For example:
root@xendom0# ping xendomU1

  
Actual results:
/var/log/messages get's flooded on dom0 and on all domU ( even those not on the
same virtual-network (bridge) )


Expected results:
Should work :-)

Additional info:

-- Additional comment from sct on 2006-10-30 06:32 EST --
When do these occur?  On boot?  On domain creation?  Under load?  What
networking configuration are you using?


-- Additional comment from buffalo on 2006-10-30 16:51 EST --
They occure when the first few bytes are going over any bridge.
It doesn't matter from which system the data is comming from or going to as long
at least one of them is on xenbr(0-2), which actually means any domain in my
configuration.

The System has 2 phys. networkinterfaces. Each of them is connected to a bridge.
One bridge for the LAN, another for the internet connection. There is also a
pure virtual bridge for the DMZ.

dom0         domU*(multiple internal systems)
 |             |
xenbr0-eth0 (LAN)
 |            
domU(router/fw)     
 |
xenbr1 (DMZ) - domU*(multiple DMZ systems) 
 |
domU(router/fw)
 |
xenbr2-eth1 (internet)
 |
internet

-- Additional comment from buffalo on 2006-11-01 05:21 EST --
Problem solved.
domU running kernel-xenU-2.6.17-1.2174_FC5 breaks networking with dom0
2.6.18-1.2798.fc6xen. 

When all domUs are running 2.6.17-1.2187_FC5xenU everything works fine. Booting
one with kernel-xenU-2.6.17-1.2174_FC5 breaks everything on all xenbr*. After
stopping the "bad" domU everything resumed working.




-- Additional comment from sct on 2006-11-01 08:10 EST --
So when you have one 2174 kernel running, all other domains fail all the time? 
That one domain breaks all the others?  Or is only the 2174 domain itself broken?

-- Additional comment from buffalo on 2006-11-01 09:00 EST --
Yes. All interfaces connected to a xenbridge become unavailable. Even those
which are not on the same xenbridge. The only devices that are still working are
the physical devices on dom0. 

I've removed all domains from autostart and started each domU manually.

First i've started my firewall/router domU which seperates the LAN from the DMZ
and pinged it from dom0 and it worked. Then started the outer firewall/router
domU and it was pingable too. After starting a domU connected to (xenbr0) with
the 2174 kernel the pings to my outer router stopped. After shutting it down the
pings made theire way again. Since it was the only domU still on 2174 i've
upgraded it and tried again and everything worked. Since i have the old kernel
still installed it's very easy form me to reproduce. If you want any traces
please let me know.

Regards Heiko

Comment 3 Mark Nielsen 2007-11-14 14:29:17 UTC
I can provide more info on this.. What would help? I *am* seeing this in RHEL 5.1.
kernel-xen-2.6.18-53.el5
xen-libs-3.0.3-41.el5
xen-3.0.3-41.el5

I am able to reproduce with 9 VMs running. Each have 2 interfaces presented. As
soon as I add the 10th VM (19th and 20th vif) I get the memory squeeze and lose
network connection to all VMs.

Comment 4 Herbert Xu 2007-11-15 06:00:09 UTC
Mark, your problem sounds quite different.  I think in your case the HV is
simply running out of memory.  With the default RHEL setup, we assign almost all
the available RAM to guests, leaving the HV with very little.  This is simply
broken as the HV needs to have some free memory so that things like networking
can operate.

The original problem here is a suspicion that a broken domU can bring down the
whole machine.  However, there is currently no proof of that.

So if you want to persue your issue please open a new bug against the xen
regarding memory distribution between the HV and the domains.

In any case, if you adjust your memory allocation (by shrinking your guests or
dom0) so that the HV has some memory (64M should be more than enough) then it
should work correctly.

Comment 5 Mark Nielsen 2007-11-15 13:15:49 UTC
Herbert, I just ran the VMs all up on the same system again. I'll get the logs,
xm list, and networking configs for you today because I believe this is my bug.
I have 10 VMs running right now. The total memory assigned to them is 11G. I
have a 32G system. Domain-0 in xm list reports 20552 memory available. I was
watching one of the VMs which is in a cluster through it's console. As soon as I
started up the 11th VM, which brought up vif 20 and 21 in this case, I see my
clustered VM lose all it's DLM connections. That shows me a definite loss of
networking. I've also tested this with pings. As soon as I shut down that 11th
VM, bring my total VIF count to 19, connections are re-established and the
memory squeeze error stops.

Comment 6 Mark Nielsen 2007-11-15 13:37:45 UTC
Herbert, I just re-read your comment after thinking a bit more. Are you saying I
need to shrink dom0 itself to keep it from taking all the system memory? If so,
do I do that in /etc/xen/xend-config.sxp?

Comment 7 Herbert Xu 2007-11-15 13:42:53 UTC
Yes you need to shrink dom0.  The easiest way is probably "xm mem-set" or its
virsh equivalent.

Comment 8 Herbert Xu 2007-11-15 13:45:36 UTC
Which would be virsh setmem.

Comment 9 Mark Nielsen 2007-11-15 20:38:28 UTC
I tried both ways (virsh setmem and xm mem-set) and set the Domain-0 down to
25G. I started up the same amount of VMs, which only take a total of 11G, and I
get the same error "memory squeeze in netback driver". Then I lose all my
network connections to the VMs. After about 30 seconds, I also lost my ssh
connection, though I could still ping the system.

Comment 10 Herbert Xu 2007-11-16 02:02:48 UTC
You've got a 32G system and you set dom0 to 25G.  That leaves only 7G free for
the other guests.  You then start 11G worth of guests, which means now the HV
has almost no memory.

So I suggest that you try setting dom0 down to 10G as a test.

Thanks!

Comment 11 Mark Nielsen 2007-11-16 13:09:37 UTC
OK, Herbert straightened me out in e-mails about what is going on with the
memory, I wasn't understanding properly. I set dom0 down to 10G, then started up
13 domU systems at a total of 14G and do not have the memory squeeze. Sorry for
the misunderstanding, I've set this bug back to medium/medium.

Comment 12 Herbert Xu 2008-04-01 12:21:10 UTC
Since the original bug report has now been closed and I've not received any new
info indicating any bugs in xen netfront/netback, I'm going to close this bug.

In conclusion the original issue was mostly like due to an incorrect memory
assignment, i.e., leaving too little memory for the hypervisor.

Comment 13 Michael Mayer 2008-04-11 11:49:22 UTC
I have at least three customers (fourth case about to be escalated) still
reporting this problem in RHEL5.1 and RHEL5.2 beta. It is true that the problem
arises if the allocated memory for Dom0 and DomUs becomes close to the total
physical memory. However I would expect - maybe I am wrong - from an Enterprise
OS that xen is able to prevent the hypervisor from running out of memory in the
first place. 

Comment 14 Paolo Bonzini 2011-01-25 13:57:16 UTC
While the conclusion of comment 12 is correct, newer RHELs are sidestepping the problem by disabling flipping.  So, re-closing as a dup of the bug about flipping-induced network failures.

*** This bug has been marked as a duplicate of bug 648763 ***

Comment 15 Andrew Jones 2011-07-21 15:04:50 UTC
*** Bug 723919 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.