| Summary: | Windows PV netfront driver spams event channel/xenstored on startup | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Jacob Hunt <jhunt> |
| Component: | xenpv-win | Assignee: | Paolo Bonzini <pbonzini> |
| Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 5.6 | CC: | drjones, jwest, jzheng, leiwang, msw, pbonzini, qwan, tburke, yuzhou |
| Target Milestone: | rc | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-08-01 11:00:33 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | 584249 | ||
| Bug Blocks: | 518407, 807971 | ||
I've never seen it in practice. This piece of code is _very_ heavily hit by WHQL tests and they pass (though admittedly they do so on an otherwise idle machine). Still, I am aware of it. Historically it was because this code was run with interrupts disabled. It was even worse, because the Windows guest would have been completely hung. Newer versions of the drivers changed it so that at least the Windows guest will "just" hammer on dom0. The block drivers have the same problem too (and with the complete-hang behavior, unfortunately). This may be triggered by some problem in the backend bringup. When the Windows driver is in this state, we see: # xenstore-ls /local/domain/0/backend/vif/488 0 = "" domain = "dom_93116181" handle = "0" uuid = "8fca2ad4-5cc0-034a-5d3a-ee3f7860ba0a" script = "/etc/xen/scripts/vif-route" state = "2" frontend = "/local/domain/488/device/vif/0" mac = "12:31:3D:04:69:F4" online = "1" frontend-id = "488" type = "front" feature-sg = "1" feature-gso-tcpv4 = "1" feature-rx-copy = "1" hotplug-status = "connected" # xenstore-ls /local/domain/488/device/vif/0 backend-id = "0" mac = "12:31:3D:04:69:F4" handle = "0" state = "4" backend = "/local/domain/0/backend/vif/488/0" tx-ring-ref = "916" rx-ring-ref = "797" event-channel = "5" request-rx-copy = "1" feature-rx-notify = "1" feature-sg = "0" feature-gso-tcpv4 = "0" This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Destabilizing change, closing as WONTFIX. |
Description of problem: The Windows PV netfront driver has this little nugget of code: do { prevState = curState; curState = xenbus_read_driver_state(info->xbdev->otherend); RhelDbgPrint(TRACE_LEVEL_INFORMATION, ("Device state is %s.\n", xenbus_strstate(curState))); if (prevState != curState) { backend_changed(info->xbdev, curState); } } while (curState == XenbusStateInitWait || curState == XenbusStateInitialised); This means that the driver will check the backend state in a tight loop until it shows as XenbusStateConnected. On the dom0 side, this results in extreme event channel traffic and xenstored spending unusual amounts of CPU time returning nodes from the xenstore (e.g., /local/domain/0/backend/vif/144/0/state) If for some reason hotplug scripts fail to bring up the backend side correctly, the Windows driver will never recover. xenstored will continue to be pummeled with requests, spending precious dom0 CPU cycles. Why isn't this using a xenbus_watch* function?