Description of problem:
My goal was to add several devices to a kvm-guest using VT-d.
When I attempt to remove a pci device from the host and add it to pci-stub the system will hang.
For example, the following hangs on the HP P812 Controller which has the following info:
slot 08: 08:00.0 103c:323a-103c:3249 AM312A [HP PCIe SAS SA P812 1GB Flash Cache.] (cciss)
I try to execute the following commands:
echo "103c 323a" > /sys/bus/pci/drivers/pci-stub/new_id
echo 0000:08:00.0 > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
echo 0000:08:00.0 > /sys/bus/pci/drivers/pci-stub/bind
However the system just hangs after the first echo. I can hit enter and a new line scrolls on the console but ctrl-c will not exit this state. I can open a new ssh connection to the system and its still running. The original terminal where I tried to echo to new_id is still hung.
This happens on many storage cards. I have gotten past the first echo however the system soon hung after the next command.
These exact same commands work with SLES11SP1.. which just pulled in the qemu-kvm .12 stable branch.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. On DL980G7 install RHEL5.5 to core storage. Install KVM.
2. Install Several storage HBAs, such as HP-P812, Emulex 8gig FC cards etc.
3. Begin the process of assigning storage to a kvm guest using vt-d.
Console hangs on an echo to new_id
No hang should occur.
I setup a basic config on other machines to verify this.. it appears that the hang occurs most often after the echo to "unbind" then to "new_id".
I also realize that the version of qemu-kvm doesn't matter here.. this is kernel related. So I guess I mean that whatever sles11sp1 is doing works for this.. maybe that helps in debug.
does the following sequence work?: unbind; new_id
I believe adding a new_id to pci-stub & having it unbound from its original driver will cause a probe of the driver; iow, the bind isn't necesssary.
btw the original description & c#1 seem to conflict as to the order you
did the cmds. did you do: new_id, unbind, bind; or unbind, new_id, <shell-hang> ?
or, if you tried multiple sequences, pls list which were tried.
Here are some details on this one:
"does the following sequence work?: unbind; new_id"
The echo to unbind causes a hang right away. The system doesn't appear to be doing anything.. just sitting idle.. but that console will only scroll blank lines on "enter". No ctrl-c or anything similar appears to have any impact.
"I believe adding a new_id to pci-stub & having it unbound from its original
driver will cause a probe of the driver; iow, the bind isn't necessary."
If I do this, how does the pci-stub know which card to bind to?
In this system, 3 smart array cards share the address "103c 323a", but only the p812 is located in 0000:08:00.0
Or does that even matter... Maybe it just matters which pci id you pass to the guest to boot with, with VT-D?
Here is an example of what I mean by "share the address "103c 323a"
103c:323a-103c:3249 AM312A [HP PCIe SAS SA P812 1GB Flash Cache.] (cciss)
103c:323a-103c:3247 AM311A [HP PCIe SAS SA P411 256MB Ctlr] (cciss)
103c:323a-103c:3241 SA-P212 [HP PCIe SAS SA P212 Ctlr] (cciss)
"btw the original description & c#1 seem to conflict as to the order you
did the cmds. did you do: new_id, unbind, bind; or unbind, new_id,
I always have done new_id, unbind, bind. In C#1 I meant to say that the hang is after unbind and not new_id.
Thanks for detailed info... it explains possible issues.
So, some background on new_id, bind, unbind....
the echo "vid did" adds that 'vid did' to the pci table that the driver
can 'match on' for a given PCI device, and thus the driver's probe routine
is called when such a device is scanned (or "bind"-ed) *if* the device
doesn't already have a driver associated with it (have a driver ... er, um,
driving it .... already).
So, adding a vid-did pairs to new_id should never cause a hang because it just
expands a table for possible device<->driver matching.
Now, the unbind for a given device will invoke the 'remove' entry point of a driver for a given device.
Given you are seeing console hangs at this point, I would surmise your driver is
hanging during this function(stack), and you have a driver problem.
Tracing the remove code flow in the driver should show where the hang is.
Bind will cause the driver to see if that device has a matching vid-did, and if so, invoke that driver's probe routine. for pci-stub, it does zippo to the device, but tags pci-stub as the in-use driver for that device (but invoking pci_register_driver()), so a pci scan won't try to attach another (the original) driver to that device.
Last but not least, the vid-did-svid-sdid you show in c#3 does not appear to be an 'expected' use of vid-did-svid-sdid. A vid-did pair should uniquely identify a device; svid-sdid should be used to identify variations of a PCI device, like size of buffer ram provided, (sub-)vendor unique tweaks (like SROM attached or not attached if a device has such variances & cant be i-d'd through other registers). It appears from the listing above that you have 3 different devices that use the same driver, but have the same vid-did. The pci driver tables are designed to handle this case by assigning a uniques did to each device, and in the case above, list 3 vid-did pairs in an array of pci_device_id structs, registered by the driver with the pci subsystem (via pci_register_driver() ).
Although it shouldn't be a problem in the above scenario (since you unbind a specific device via it's BDF, and the bind uses the same one), it's not the typical use of vid-did-svid-sdid.
So, I'm guessing the cciss driver is not designed for hot-plug, or else
it would fail on unplug (which invokes remove as well; possibly suspend before the remove too).
Do you have a system you can try hot-adding this cciss device to & from & see if it mimics this hang behavior (at unplug time)?
For Smart Array we key off the subsystem ID. The device ID identifies a family of Smart Array controllers. IOW, 103c323a covers the P410, P410i, P411,P212, and P812. The 103c3249 identifies this controller as a P812. Not sure if that helps at all.
The cciss driver is not designed for hot-plug.
(In reply to comment #5)
> For Smart Array we key off the subsystem ID. The device ID identifies a family
> of Smart Array controllers. IOW, 103c323a covers the P410, P410i, P411,P212,
> and P812. The 103c3249 identifies this controller as a P812. Not sure if that
> helps at all.
> The cciss driver is not designed for hot-plug.
And this last sentence is the arrow in the heart:
A driver must be designed for hot-plug (support remove) in order
to do device-assignment.
Closing this bz as "NOTABUG" wrt device-assignment, since it is a driver issue.
Feel free to re-open if you have further data stating (proving it works in hw hot-plug configuration) otherwise.