Bug 453574

Summary: virtual ethernet device stops working on reception of duplicate backend state change signals
Product: Red Hat Enterprise Linux 5 Reporter: Alex Zeffertt <alex.zeffertt>
Component: kernel-xenAssignee: Markus Armbruster <armbru>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: low    
Version: 5.2CC: ijc, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
A race condition could occur when creating and destroying virtual network devices. In some circumstances — especially high load situations — this would cause the virtual device to not respond. In this update, the state of the virtual device is checked to prevent the race condition from occurring.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:21:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 449772, 454962, 464676    
Attachments:
Description Flags
[NET] front: Fix crashes when xenstore watches fire multiple times. none

Description Alex Zeffertt 2008-07-01 12:59:10 UTC
Description of problem:

A virtual interface is created and plugged into the VM.  Within the VM it is
given an IP address using ifconfig.  A ping is then attempted but there appears
to be no network connectivity.

This happens infrequently during stress testing.  /var/log/messages show that
the occurrences coincide with drivers/xen/netfront/netfront.c:network_connect()
being called twice.  This suggests the problem is that the netfront driver is
receiving a duplicate "backend_changed" signal, the second of which it should be
ignoring.

The duplicate "backend_changed" signals are a known issue, and there is a
xen-3.1 guest kernel patch to protect against them.  However, it looks like this
hasn't been applied in either the RHEL4.6 or RHEL5.2 kernel-xen packages.

I'll attach the patch to this bug report.


Version-Release number of selected component (if applicable):

RHEL4.6: kernel-2.6.9-67.0.20.EL
RHEL5.2: kernel-2.6.18-92.1.6.el5

How reproducible:

Sporadic.  difficult.

Steps to Reproduce:
1. Create VIF (in VM management tool)
2. assign virtual interface IP address (in VM)
3. try to ping known IP address on same network
  
Actual results:

ping says host "Unreachable"

Expected results:

ping contacts host

Additional info:

Comment 1 Alex Zeffertt 2008-07-01 12:59:10 UTC
Created attachment 310658 [details]
[NET] front: Fix crashes when xenstore watches fire multiple times.

Comment 2 Markus Armbruster 2008-07-16 22:04:46 UTC
Alex,

Many thanks for the patch.  I can see the first and the third patch hunk in the
drivers I got from http://xenbits.xensource.com/linux-2.6.18-xen.hg, but not the
second.  How come?  Could you point me to the relevant upstream changeset(s)?

Comment 3 Ian Campbell 2008-07-21 14:14:23 UTC
Alex is away at the moment but let me try and answer.

The upstream changeset is
http://xenbits.xensource.com/xen-unstable.hg?rev/79315be2c9b9

The second hunk is indeed not present any longer. I had a dig and found that it
was subsequently removed by
http://xenbits.xensource.com/xen-unstable.hg?rev/e99ba0c6c046
which came out of
http://lists.xensource.com/archives/html/xen-devel/2006-12/msg00843.html

We haven't observed that failure though (I don't know if we test for it though)

Comment 4 RHEL Program Management 2008-07-31 13:01:17 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Don Zickus 2008-09-15 14:17:59 UTC
in kernel-2.6.18-115.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 10 Ryan Lerch 2008-11-07 00:15:14 UTC
This bug has been marked for inclusion in the Red Hat Enterprise Linux 5.3
Release Notes.

To aid in the development of relevant and accurate release notes, please fill
out the "Release Notes" field above with the following 4 pieces of information:


Cause:   What actions or circumstances cause this bug to present.

Consequence:  What happens when the bug presents.

Fix:   What was done to fix the bug.

Result:  What now happens when the actions or circumstances above occur. (NB:
this is not the same as 'the bug doesn't present anymore')

Comment 12 Markus Armbruster 2008-11-07 19:45:39 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cause: A race condition exists in the Xenbus protocols for device
creation and destruction.  The netfront driver didn't cope with it.

Consequence: Network device creation can result in a device that is
hung.  Happens rarely, typically when stress testing.

Fix: Backport fix from upstream.

Result: Network device creation works reliably, even when stress
testing.

Comment 14 Ryan Lerch 2008-11-17 01:42:55 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,10 +1 @@
-Cause: A race condition exists in the Xenbus protocols for device
+A race condition could occur when creating and destroying virtual network devices. In some circumstances — especially high load situations — this would cause the virtual device to not respond. In this update, the state of the virtual device is checked to prevent the race condition from occurring.-creation and destruction.  The netfront driver didn't cope with it.
-
-Consequence: Network device creation can result in a device that is
-hung.  Happens rarely, typically when stress testing.
-
-Fix: Backport fix from upstream.
-
-Result: Network device creation works reliably, even when stress
-testing.

Comment 17 errata-xmlrpc 2009-01-20 20:21:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html