Bug 714283

Summary: Packet loss on Xen vif interface
Product: Red Hat Enterprise Linux 5 Reporter: Simon Gao <gao>
Component: kernel-xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 5.5CC: drjones, gm.outside+redhat, gozen, jburke, jzheng, leiwang, lersek, mrezanin, pbonzini, pcao, qguan, qwan, vkuznets, xen-maint
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-05 11:41:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 514491    
Attachments:
Description Flags
dmesg.714283
none
add some info messages to netback/netif_be_start_xmit()
none
Remove the 500ms timeout to restart the netif queue (backport 1d453c86 from Jeremy's tree)
none
Xen top none

Description Simon Gao 2011-06-17 19:25:42 UTC
Description of problem:

Many vif interfaces has packet losses in TX. Physical and bonding interfaces are clean.

Version-Release number of selected component (if applicable):

5.5 2.6.18-194.3.1.el5xen

How reproducible:


Steps to Reproduce:
1. Install xen host server with 5.5 2.6.18-194.3.1.el5xen
2. Install xen a dozen VMs  
3. Generate network traffic
  
Actual results:

Vif interfaces  for VMs showed packet loss in TX

Expected results:

No packet losses in TX or RX.

Additional info:

Comment 1 Andrew Jones 2011-06-20 07:05:27 UTC
Hi Simon,

is this a regression? Have you tried RHEL 5.6?

Thanks,
Drew

Comment 3 Laszlo Ersek 2011-06-20 08:23:10 UTC
Hi Simon,

I can see two drop causes in netfront / network_start_xmit():

  932    frags += (offset + len + PAGE_SIZE - 1) / PAGE_SIZE;
  933    if (unlikely(frags > MAX_SKB_FRAGS + 1)) {
  934      printk(KERN_ALERT "xennet: skb rides the rocket: %d frags\n",
  935             frags);
  936      dump_stack();
  937      goto drop;
  938    }
  939  
  940    spin_lock_irq(&np->tx_lock);
  941  
  942    if (unlikely(!netif_carrier_ok(dev) ||
  943           (frags > 1 && !xennet_can_sg(dev)) ||
  944           netif_needs_gso(dev, skb))) {
  945      spin_unlock_irq(&np->tx_lock);
  946      goto drop;
  947    }

 1019   drop:
 1020    np->stats.tx_dropped++;
 1021    dev_kfree_skb(skb);
 1022    return 0;

Do you see messages implying the first kind? Thanks.

Comment 4 Simon Gao 2011-06-20 16:33:28 UTC
(In reply to comment #1)
> Hi Simon,
> 
> is this a regression? Have you tried RHEL 5.6?
> 
> Thanks,
> Drew

Vif packet loss is also seen on another host server running RHEL 5.6.

Simon

Comment 5 Simon Gao 2011-06-20 16:36:11 UTC
(In reply to comment #3)
> Hi Simon,
> 
> I can see two drop causes in netfront / network_start_xmit():
> 
>   932    frags += (offset + len + PAGE_SIZE - 1) / PAGE_SIZE;
>   933    if (unlikely(frags > MAX_SKB_FRAGS + 1)) {
>   934      printk(KERN_ALERT "xennet: skb rides the rocket: %d frags\n",
>   935             frags);
>   936      dump_stack();
>   937      goto drop;
>   938    }
>   939  
>   940    spin_lock_irq(&np->tx_lock);
>   941  
>   942    if (unlikely(!netif_carrier_ok(dev) ||
>   943           (frags > 1 && !xennet_can_sg(dev)) ||
>   944           netif_needs_gso(dev, skb))) {
>   945      spin_unlock_irq(&np->tx_lock);
>   946      goto drop;
>   947    }
> 
>  1019   drop:
>  1020    np->stats.tx_dropped++;
>  1021    dev_kfree_skb(skb);
>  1022    return 0;
> 
> Do you see messages implying the first kind? Thanks.

"xm log" or "xm dmesg" does not show any packet drop in logs. Do I need to turn up log level?

Comment 6 Laszlo Ersek 2011-06-20 16:46:56 UTC
(In reply to comment #5)
> (In reply to comment #3)

> > I can see two drop causes in netfront / network_start_xmit():

> >   934      printk(KERN_ALERT "xennet: skb rides the rocket: %d frags\n",
> >   935             frags);

> > Do you see messages implying the first kind? Thanks.
> 
> "xm log" or "xm dmesg" does not show any packet drop in logs. Do I need to turn
> up log level?

No, this message is logged (if it is logged) by the guest kernel, and on the second highest criticality level. You probably don't have this message and we take the other branch; I'd just like to be sure. Thanks.

Comment 7 Simon Gao 2011-06-20 16:52:09 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #3)
> 
> > > I can see two drop causes in netfront / network_start_xmit():
> 
> > >   934      printk(KERN_ALERT "xennet: skb rides the rocket: %d frags\n",
> > >   935             frags);
> 
> > > Do you see messages implying the first kind? Thanks.
> > 
> > "xm log" or "xm dmesg" does not show any packet drop in logs. Do I need to turn
> > up log level?
> 
> No, this message is logged (if it is logged) by the guest kernel, and on the
> second highest criticality level. You probably don't have this message and we
> take the other branch; I'd just like to be sure. Thanks.

There are no packet drop on guest interface. So guests are not aware of any packet loss on vif interfaces.

Comment 8 Laszlo Ersek 2011-06-20 20:53:48 UTC
(In reply to comment #7)

> There are no packet drop on guest interface. So guests are not aware of any
> packet loss on vif interfaces.

It wasn't clear to me from comment 0 ("Vif interfaces  for VMs showed packet loss in TX") that it's about netback.

Netback also seems to have two drop cases:

   239  int netif_be_start_xmit(struct sk_buff *skb, struct net_device *dev)
   240  {
   241    netif_t *netif = netdev_priv(dev);
   242  
   243    BUG_ON(skb->dev != dev);
   244  
   245    /* Drop the packet if the target domain has no receive buffers. */
   246    if (unlikely(!netif_schedulable(dev) || netbk_queue_full(netif)))
   247      goto drop;
   248  
   249    /*
   250     * Copy the packet here if it's destined for a flipping interface
   251     * but isn't flippable (e.g. extra references to data).
   252     */
   253    if (!netif->copying_receiver) {
   254      struct sk_buff *nskb = netbk_copy_skb(skb);
   255      if ( unlikely(nskb == NULL) )
   256        goto drop;

   273      if (netbk_queue_full(netif)) {
   274        netif_stop_queue(dev);
   275        /*
   276         * Schedule 500ms timeout to restart the queue, thus
   277         * ensuring that an inactive queue will be drained.
   278         * Packets will be immediately be dropped until more
   279         * receive buffers become available (see
   280         * netbk_queue_full() check above).
   281         */

   293   drop:
   294    netif->stats.tx_dropped++;
   295    dev_kfree_skb(skb);
   296    return 0;

I think the "netbk_queue_full(netif)" case is likely; that IMHO can be a consequence of CPU starvation of the guest. (Note that bug 681962 / bug 456171 took care of any evtchn notification imbalance, so I'm fairly sure the packet loss is evenly distributed among the guests, proportionally to the generated network/CPU load.)

If the host machine is massively over-subscribed (from comment 0: "2. Install xen a dozen VMs"), then a domU may not get a chance to drain its queue / rx shared ring -- it may remain unscheduled for a longer time. In this case netback drops the packets on the floor, and the transport level (TCP) or the application level (UDP) is supposed to retransmit.

Does the packet loss worsen as the number of guests increases, assuming the same relative load on each guest? If so, I think there won't be much we can do and what you see is designed-in "graceful" degradation.

Comment 9 Miroslav Rezanina 2011-06-22 07:06:23 UTC
Simon,
is this the case described by Laszlo?

(In reply to comment #8)
> 
> If the host machine is massively over-subscribed (from comment 0: "2. Install
> xen a dozen VMs"), then a domU may not get a chance to drain its queue / rx
> shared ring -- it may remain unscheduled for a longer time. In this case
> netback drops the packets on the floor, and the transport level (TCP) or the
> application level (UDP) is supposed to retransmit.
> 
> Does the packet loss worsen as the number of guests increases, assuming the
> same relative load on each guest? If so, I think there won't be much we can do
> and what you see is designed-in "graceful" degradation.

Comment 10 Simon Gao 2011-06-28 17:57:37 UTC
There are 8 VMs running on the host server. Out of 14 vif interfaces, TX packet drops are found on 6 vif interfaces (on two vifs, very mininal). 

The host server has 2 QuadCore CPUs. So CPU wise, it's not over-subscribed.

Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      894     8 r----- 1637600.6
xen_vm1                                   48      775     1 ------  19360.5
xen_vm2                                   72      512     1 -b----    188.7
xen_vm3                                   30     1023     1 -b---- 235466.4
xen_vm4                                   63     2048     1 -b----  12394.1
xen_vm5                                   71     2047     2 -b----    514.5
xen_vm6                                   55     2055     1 -b---- 192625.5
xen_vm7                                   69     2047     1 -b----    876.3
xen_vm8                                   64     4095     2 -b----  74964.2

Simon

Comment 11 Paolo Bonzini 2011-07-13 12:24:19 UTC
Can you post the output of xentop while packet drops are happening?

Comment 13 Paolo Bonzini 2011-08-08 11:06:46 UTC
> the dropped packets will not change after guest booted.

Laszlo, could this be fixed by your change to move the Connected status?

Comment 19 Qin Guan 2011-08-10 05:37:22 UTC
Created attachment 517519 [details]
dmesg.714283

Comment 20 Laszlo Ersek 2011-08-10 08:31:06 UTC
Created attachment 517544 [details]
add some info messages to netback/netif_be_start_xmit()

This is the patch that generated the new output in dmesg.

(In reply to comment #18)

> 4 out of 8 vifs have TX packet drops

Actually, five vifs show up with "netif_be_start_xmit(): ...":

    vif1.0, vif2.0, vif6.0, vif7.0, vif8.0

I grepped the dmesg separately for each of them, and the output always looks like this:

    device vif<X>.0 entered promiscuous mode
    ADDRCONF(NETDEV_UP): vif<X>.0: link is not ready
    ADDRCONF(NETDEV_CHANGE): vif<X>.0: link becomes ready
    xenbr0: port <Z>(vif<X>.0) entering forwarding state
    netif_be_start_xmit(): vif<X>.0: can_queue && full
    netif_be_start_xmit(): vif<X>.0: stopping queue

This is where the vif1.0 grep output starts, but the other four continue like this:

    netif_be_start_xmit(): vif<X>.0: schedulable=1 full=1 txdrop=0
    netif_be_start_xmit(): vif<X>.0: schedulable=1 full=1 txdrop=1
    ...
    netif_be_start_xmit(): vif<X>.0: schedulable=1 full=1 txdrop=<Y>

vif8.0 stops here, but vif2.0, vif6.0, and vif7.0 end with the additional entries

    netif_be_start_xmit(): vif<X>.0: can_queue && full
    netif_be_start_xmit(): vif<X>.0: stopping queue

Note that these entries are all interleaved in the dmesg. Unfortunately, I didn't ask for the /var/log/messages file instead of the dmesg, but I suspect these were all very close to each other in time.

I figure some guests were started at the same time (based on the adjacent NETDEV_UP messages as well), and during that short time frame, the processors *were* over-subscribed, or at least the guests were not advancing for some reason. I think what netback does is (was...) intended, since these logs perfectly match the comment near where I added the "stopping queue" message:

/*
 * Schedule 500ms timeout to restart the queue, thus
 * ensuring that an inactive queue will be drained.
 * Packets will be immediately be dropped until more
 * receive buffers become available (see
 * netbk_queue_full() check above).
 */

Instead of stopping the queue, we could try to crash the guest immediately from netback (and configure the vm to have xend dump its core on crash). This way we could look at the guest state (post-mortem).

Hm, instead of that, perhaps netback is wrong to sleep, see 

http://lists.xensource.com/archives/html/xen-devel/2010-12/msg00921.html
http://lists.xensource.com/archives/html/xen-devel/2010-12/msg01035.html
http://lists.xensource.com/archives/html/xen-devel/2010-12/msg01080.html

and

http://git.kernel.org/?p=linux/kernel/git/jeremy/xen.git;a=commitdiff;h=1d453c86

It wasn't backported to linux-2.6.18-xen.hg though.

Comment 21 Laszlo Ersek 2011-08-10 09:01:08 UTC
If I understand the upstream patch in comment 20, the problem is not the sleep -- the problem is that after 500ms of sleep, we prematurely re-enable the queue. The guest could be unresponsive for a bit longer than that, and so netback throws away packets destined for the guest.

The upstream patch lets the queue disabled as long as necessary, and the packets destined for the guest sit tight wherever they are. Once the guest returns to life, the net_rx_action() tasklet function should wake the queue again.

        if (netif_queue_stopped(netif->dev) &&
            netif_schedulable(netif->dev) &&
            !netbk_queue_full(netif))
                netif_wake_queue(netif->dev);

Comment 22 Laszlo Ersek 2011-08-10 09:33:44 UTC
Created attachment 517553 [details]
Remove the 500ms timeout to restart the netif queue (backport 1d453c86 from Jeremy's tree)

"It is generally unhelpful as it results in a massive tail-drop should a guest
become unresponsive for a relatively short period of time and no back-pressure
(other than that caused by a higher layer protocol) is applied to the sender."

"The code to restart the queue when space becomes available in the shared  ring
is already there [in net_rx_action()]. Removal of the timeout simply means that
the queue will  not be restarted until that code runs."

Comment 23 Paolo Bonzini 2011-08-10 09:35:13 UTC
I agree.  Since the code is under netbk_can_queue(dev), it's surely better to let netfront handle flow control, rather than re-enabling after an arbitrary amount of time.  It looks like the code was added to band-aid a separate bug.

Comment 30 Laszlo Ersek 2011-08-20 19:39:25 UTC
Simon, can you please describe what kind of actual problems the TX drop caused? Thank you.

Comment 33 Laszlo Ersek 2011-08-25 12:23:07 UTC
Hello Simon,

(In reply to comment #30)
> Simon, can you please describe what kind of actual problems the TX drop caused?
> Thank you.

again, can you please elaborate on this? Our internal testing and the candidate upstream patch's commit log seem to confirm that the TX droppage is caused by short periods of guest unresponsiveness. The testing done on this end seems to imply that such unresponsiveness is mostly present at guest bootup, thus the harm caused by the TX drop appears to have low impact.

OTOH, applying the patch (attachment 517553 [details]) caused hard-to-reproduce & hard-to-debug, but recurring network hangs in our automated test environment. The hangs are most probably inherent in the *removal* of the automatic queue wakeup, which automatic wakeup is also the root cause of the TX droppage.

So, unless you state this problem does serious damage to your operations, I'm going to close this with WONTFIX resolution tomorrow. Thank you.

Comment 35 Simon Gao 2011-10-20 18:13:52 UTC
Packet loss on the Vifs caused performance degradation, intermittent service interruption on the VMs.

These are serious problems since there would be no guarantee for stable and reliable services provided by the affected VMs.

As a matter of fact, the more heavily used by an VM, the more packet losses could be seen on that vif interface.

Simon

Comment 36 Laszlo Ersek 2011-10-20 18:35:25 UTC
Hello Simon,

(In reply to comment #35)

> As a matter of fact, the more heavily used by an VM, the more packet losses
> could be seen on that vif interface.

At least this confirms comment 8. Ie. the more heavily a VM is used, the less time it can spend on draining the queue of its netfront driver, and so netback drops packets. I'll try to look into netbk_queue_full() with the help of bug 697991 comment 18.

Can you tell us more about the vm configs and the kind of load on the VMs, so that we may be able to reproduce it?

Does it help if you increase the VCPU number in the guest (so that the host is still not overloaded)? Or if you decrease the thread count in the application(s) that the VM runs?

When the VM is heavily loaded, what are the sys/user percentages? What is the biggest sys time consumer? (Perhaps we'll have to look at this with oprofile.)

Thanks.

Comment 37 Andrew Jones 2011-10-21 06:19:20 UTC
If this issue can be avoided by adding more slack to the system, i.e. by reducing the number of running VMs by one or by some other load reduction. Then I'm inclined to leave it in WONTFIX. It's already too late for any fix of this nature to go to the next minor release, and in general fixes that have potential to introduce regressions (as we've seen with the initial trial patch for this bug) are way too risky for RHEL Xen at this stage in its life.

Simon,
Please find the percentage of load where all your VMs are able to perform well. Naturally, if this percentage is way too low, then we still need to investigate this issue. However, if you can keep all the VMs happy with a high, but perhaps not full, load on the host, then I'd say we should take that as the solution.

Comment 38 Simon Gao 2011-10-21 15:31:44 UTC
Created attachment 529525 [details]
Xen top

Comment 39 Paolo Bonzini 2011-10-21 15:41:42 UTC
There is no CPU overcommit in this xentop output.

Comment 40 Simon Gao 2011-10-21 18:55:17 UTC
More details about the host server:

PowerEdge R710 RHEL 5.6 2.6.18-238.9.1.el5xen, 2 Quad-core CPUs, 24G memory.

Networking: Bonding two network interfaces, 5 vlans (one default).


The VM config is like:

name = "xenvm7"
memory = "4096"
disk = [ 'tap:aio:/nfs/vms/xenvm7.img,xvda,w', 'tap:aio:/nfs/vms/xenvm7_2.img,xvdb,w', 'tap:aio:/nfs/vms/xenvm7_3.img,xvdc,w', 'tap:aio:/nfs/vms/xenvm7_4.img,xvdd,w' ]
vif = [ 'mac=00:16:e3:15:50:4f, bridge=xenbr120', ]
bootloader="/usr/bin/pygrub"
vcpus=2
on_reboot = 'restart'
on_crash = 'restart'

Comment 41 Simon Gao 2011-10-21 19:09:12 UTC
> Does it help if you increase the VCPU number in the guest (so that the host is
> still not overloaded)? Or if you decrease the thread count in the
> application(s) that the VM runs?


I can't test them easily since those VMs are production machines.

 
> When the VM is heavily loaded, what are the sys/user percentages? What is the
> biggest sys time consumer? (Perhaps we'll have to look at this with oprofile.)

I'd be happy to gather additional data if you let me know how.

Comment 43 Laszlo Ersek 2011-11-02 14:05:34 UTC
Hello Simon,

unfortunately we can't dedicate any further resources to this BZ. If you have a RHEL subscription, you might want open a Customer Portal case, and then Global Support Services could become involved. Thank you for your understanding.

Comment 44 Simon Gao 2011-11-04 22:55:42 UTC
(In reply to comment #43)
> Hello Simon,
> 
> unfortunately we can't dedicate any further resources to this BZ. If you have a
> RHEL subscription, you might want open a Customer Portal case, and then Global
> Support Services could become involved. Thank you for your understanding.

This issue is more like software bug than system configuration problem. That's reason why I filed the bug here. For configuration related issue, I would contact Global Support Services. 

I really appreciate if you can re-open this bug and fix it.

Simon

Comment 45 Andrew Jones 2011-11-05 11:41:25 UTC
Simon,

I still haven't seen a response to my inquiry in comment 37. Without knowing exactly how much load is on the system when experiencing packet loss then we're inclined to believe it is a config issue, as this is the only report of this kind. Also, even if a software change could allow the load to increase without packet loss, then if the patch is too risky for RHEL5 Xen then we wont be putting it in, and a config change would still be necessary as a workaround. This is why we've recommended contacting GSS in order to get help with your config based on some best practices they may be aware of.

I'm closing this bug again. Please don't reopen it. After working with GSS if all parities believe a bug should be opened, then a new bug can be opened at that time.

Drew

Comment 46 (GalaxyMaster) 2014-09-22 03:54:02 UTC
I believe this bug was just recently fixed by the following patch:

https://lkml.org/lkml/2014/7/30/315 by Zoltan Kiss (xen-netfront: Fix handling packets on compound pages with skb_segment).

Could we get a fix in RHEL kernels too please?