Bug 1319544
| Summary: | nodedev-destroy sometimes failed when create vHBA via /sys/class/fc_host/hostN/vport_create | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Han Han <hhan> | ||||
| Component: | systemd | Assignee: | systemd-maint | ||||
| Status: | CLOSED WONTFIX | QA Contact: | qe-baseos-daemons | ||||
| Severity: | low | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.3 | CC: | dyuan, hhan, systemd-maint-list, udev-maint-list, xuzhang, yisun | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Mac OS | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-12-15 07:40:30 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Running 'vport_create' by hand, then removing via libvirt borders on fringes of what is really supported. I further assume based on prior experience that the environment you're running is a script. That makes things even more tenuous. If a customer wants to manage it on their own by using vport_create, then it stands to reason the customer would perform the vport_delete when they are done with it - that's why there's the managed='no' attribute. Using virsh nodedev-delete may or may not work depending on a number of factors... When the nodedev-create is used, libvirt will call find_new_device() in src/node_device/node_device_driver.c. This routine's sole purpose is to fill in the internal data by calling nodeDeviceSysfsGetSCSIHostCaps from within nodeDeviceLookupSCSIHostByWWN. This will store the wwnn/wwpn and set a specific flag indicating that we "know" this is a FC scsi_host (and that's the bit that's failing on the destroy path causing the message you see). It's all remarkably complex, but yet fragile w/r/t timing conditions. When using "vport_create" by hand the nodeDeviceSysfsGetSCSIHostCaps will get called when udevEventHandleCallback eventually gets around to processing a new device event. But if virsh nodedev-destroy is run before that completes, then the correct bits won't be set and libvirt will rightfully choose to not destroy the node device giving the error you got (because libvirt doesn't have enough information to do the destroy yet since the create hasn't told us it's done). Even if it's called, it possible that the 'fabric_wwn' and/or 'wwnn/wwpn' aren't filled in properly - there was a udev/kernel bug (bz 1240912) dealing with fabric_wwn and in my testing environment, the callback function ends up filling the wwnn/wwpn with wwwn=ffffffffffffffff wwpn=ffffffffffffffff. I think there was another bug on that, but I don't have those details. If I cat the /sys/class/fc_host/hostN/{node_name,port_name} files after the vport_create, the have the wwnn/wwpn I used for the vport_create, but internally I didn't get the right answer. So of course, on deletion I get a different error (invalid wwnn/wwpn). A "work-around" of sorts is to run a 'virsh nodedev-dumpxml scsi_hostN' on the scsi_hostN that was created by the vport_create command. That'll also work to fill in the flag bits and wwnn, wwpn, fabric_name, etc. Although it is susceptible to timing. That is - if the udev callback hasn't run yet, then the data may not be right - it could be, but it may not be. I'm hesitant to put code into the destroy function to perform the check as I really don't think it's needed and could eventually be susceptible to a timing issue given the size of the udev database... I guess I'd be curious to find out that in the event you get that message - does retrying the virsh nodedev-delete ever work? That is if you get that specific failure, is there any point that a followup attempt to perform the same command would succeed? Can the output of the node_name and port_name files be displayed in the script. I'm not going to close this yet, but I am leaning towards NOTABUG Well, since the first nodedev-destroy failure, the following nodedev-destroy will always fail until libvirtd restarted.
I running the following to catch the error and node_name/port_name:
#!/bin/bash
for i in {166..200};do
echo '2101001b32a90001:2100001b32a9da4e' > /sys/class/fc_host/host4/vport_create
sleep 3
echo "node_name:$(cat /sys/class/fc_host/host$i/node_name) port_name:$(cat /sys/class/fc_host/host$i/port_name)"
virsh nodedev-destroy scsi_host$i
if [ $? -ne 0 ];then
echo error
break
fi
sleep 3
done
The output is:
node_name:0x2100001b32a9da4e port_name:0x2101001b32a90001
Destroyed node device 'scsi_host166'
node_name:0x2100001b32a9da4e port_name:0x2101001b32a90001
Destroyed node device 'scsi_host167'
node_name:0x2100001b32a9da4e port_name:0x2101001b32a90001
error: Failed to destroy node device 'scsi_host168'
error: internal error: Device is not a fibre channel HBA
error
Still leaning towards NOTABUG. Even though you put in a 3 second "pause" between the vport_create and the echo of the wwwn/wwpn, it doesn't mean that things are completely set up properly by the underlying infrastructure (e.g. udev) when libvirt checks.
The first check is the "/sys/class/fc_host/host%d" doesn't exist, then we're not an FC. The 3 second pause in the script means nothing as it's a relationship between udev/libvirt not related to the script pause.
Sure by the time the script gets its look, things look right
Does a 'virsh nodedev-dumpxml scsi_host$i either prior to the destroy or after a failed destroy, then retry the destroy make things work?
in pseudo code:
while virsh nodedev-destroy scsi_host$i is fail:
virsh nodedev-dumpxml scsi_host$i
The dumpxml checks the environment again.
Other assignment priorities have subsided a bit and I'm looking at this again.
If I alter your script just slightly to :
virsh nodedev-dumpxml scsi_host$i
virsh nodedev-destroy scsi_host$i
in the if/error path, then the scsi_host$i does delete properly. This is a timing issue related to valid data in the files. I can check for valid data rather easily and perform a retry, although in a way it may be overkill, but it does work in my limited test.
I would like to test my adjustments in whatever environment you have. No hurry since I'll be away from Jun 30 through July 8
I posted a patch upstream : http://www.redhat.com/archives/libvir-list/2016-June/msg02213.html which after a gentle nudge, got a response: http://www.redhat.com/archives/libvir-list/2016-July/msg00912.html indicating the fix should come from kernel / udev for firing off the add event before it's fully set up. Reassigning to udev for further triage. Moving to systemd as udev bugs now live under systemd component. Description of problem: As subject Version-Release number of selected component (if applicable): libgudev1-219-30.el7.x86_64 libvirt-3.2.0-4.el7.x86_64 qemu-kvm-rhev-2.9.0-3.el7.x86_64 kernel-3.10.0-568.el7.bz1421008.x86_64 How reproducible: 50% Steps to Reproduce: 1. frequently nodedev-create/destroy vHBA through the script: the xml file: <device> <parent>scsi_host7</parent> <capability type='scsi_host'> <capability type='fc_host'> <wwnn>20000024ff370144</wwnn> <wwpn>2101001b32a9f013</wwpn> </capability> </capability> </device> the script: #!/bin/bash -x hba_xml=./a.xml for i in {1..1000} do virsh nodedev-create $hba_xml node_name=scsi_$(ls --time=ctime /sys/class/fc_host/|head -1) echo $node_name virsh nodedev-destroy $node_name done 2. execute the script , some error info dispalys as follws: error: Failed to destroy node device 'scsi_host191' error: Write of '2101001b32a9f012:2001001b32a9da4e' to '/sys/class/fc_host/host7/vport_delete' during vport create/delete failed: Resource temporarily unavailable error: Failed to create node device from ./a.xml error: Write of '2101001b32a9f012:2001001b32a9da4e' to '/sys/class/fc_host/host7/vport_create' during vport create/delete failed: No such file or directory Actual results: As step2. Expected results: nodedev-destroy and nodedev-create successfully After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |
Created attachment 1138390 [details] The libvirtd log of nodedev-destroy command Description of problem: As subject Version-Release number of selected component (if applicable): libgudev1-219-19.el7_2.4.x86_64 libvirt-1.3.2-1.el7.x86_64 qemu-kvm-rhev-2.5.0-2.el7.x86_64 kernel-3.10.0-363.el7.x86_64 How reproducible: 50% Steps to Reproduce: 1. Create a vHBA via /sys/class/fc_host/hostN/vport_create # echo '2101001b32a90000:2100001b32a9da4e' > /sys/class/fc_host/host5/vport_create # ls /sys/class/fc_host/ host11 host4 host5 2. Wait for a while , destroy the vHBA via nodedev-destroy # virsh nodedev-destroy scsi_host11 error: Failed to destroy node device 'scsi_host11' error: internal error: Device is not a fibre channel HBA Actual results: As step2. Expected results: nodedev-destroy success Additional info: After libvirtd restart, the nodedev-destroy will be successful. If the vHBA create&destroy both via /sys/class/fc_host/host5/vport_{create,delete} or nodedev-{create,destroy}. There is no bug.