Bug 683171

Summary: Windows PV netfront driver spams event channel/xenstored on startup
Product: Red Hat Enterprise Linux 5 Reporter: Jacob Hunt <jhunt>
Component: xenpv-winAssignee: Paolo Bonzini <pbonzini>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.6CC: drjones, jwest, jzheng, leiwang, msw, pbonzini, qwan, tburke, yuzhou
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-01 11:00:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 584249    
Bug Blocks: 518407, 807971    

Description Jacob Hunt 2011-03-08 18:13:46 UTC
Description of problem:

The Windows PV netfront driver has this little nugget of code:

do {
prevState = curState;
curState = xenbus_read_driver_state(info->xbdev->otherend);
RhelDbgPrint(TRACE_LEVEL_INFORMATION,
("Device state is %s.\n", xenbus_strstate(curState)));

if (prevState != curState) {
backend_changed(info->xbdev, curState);
}
} while (curState == XenbusStateInitWait ||
curState == XenbusStateInitialised);

This means that the driver will check the backend state in a tight loop until it shows as XenbusStateConnected. On the dom0 side, this results in extreme event channel traffic and xenstored spending unusual amounts of CPU time returning nodes from the xenstore (e.g., /local/domain/0/backend/vif/144/0/state)

If for some reason hotplug scripts fail to bring up the backend side correctly, the Windows driver will never recover. xenstored will continue to be pummeled with requests, spending precious dom0 CPU cycles.

Why isn't this using a xenbus_watch* function?

Comment 1 Paolo Bonzini 2011-03-09 11:08:37 UTC
I've never seen it in practice.  This piece of code is _very_ heavily hit by WHQL tests and they pass (though admittedly they do so on an otherwise idle machine).  Still, I am aware of it.

Historically it was because this code was run with interrupts disabled.  It was even worse, because the Windows guest would have been completely hung.  Newer versions of the drivers changed it so that at least the Windows guest will "just" hammer on dom0.

The block drivers have the same problem too (and with the complete-hang behavior, unfortunately).

Comment 3 Matt Wilson 2011-03-09 17:46:37 UTC
This may be triggered by some problem in the backend bringup. When the Windows driver is in this state, we see:

# xenstore-ls /local/domain/0/backend/vif/488 
0 = "" 
domain = "dom_93116181" 
handle = "0" 
uuid = "8fca2ad4-5cc0-034a-5d3a-ee3f7860ba0a" 
script = "/etc/xen/scripts/vif-route" 
state = "2" 
frontend = "/local/domain/488/device/vif/0" 
mac = "12:31:3D:04:69:F4" 
online = "1" 
frontend-id = "488" 
type = "front" 
feature-sg = "1" 
feature-gso-tcpv4 = "1" 
feature-rx-copy = "1" 
hotplug-status = "connected" 

# xenstore-ls /local/domain/488/device/vif/0 
backend-id = "0" 
mac = "12:31:3D:04:69:F4" 
handle = "0" 
state = "4" 
backend = "/local/domain/0/backend/vif/488/0" 
tx-ring-ref = "916" 
rx-ring-ref = "797" 
event-channel = "5" 
request-rx-copy = "1" 
feature-rx-notify = "1" 
feature-sg = "0" 
feature-gso-tcpv4 = "0"

Comment 23 RHEL Program Management 2012-04-02 10:29:33 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 25 Paolo Bonzini 2012-08-01 11:00:33 UTC
Destabilizing change, closing as WONTFIX.