Robb/Jeremy, According to the bug description all your troubles happened on the compute node, at least that what I understand. Please, confirm that.
Hi Sergey; Sure; let me summarize what we were able to determine over the weekend. In a three-node controller / five node compute setup, with cinder running on the controllers (backing storage provided by iSCSI to the compute nodes, as dictated by cinder on the controllers), suddenly a series of volumes were unable to be attached or detached from a running instance. Only a single instance was tested to my knowledge and I don't know if any other instances were running on this particular compute node. The crux of this was controller01, wherein it was serving compute01 as it's cinder `client`, where the instance was hosted at that time. During operation, the controller, which is part of a high-availability cinder setup via pacemaker, hit a libqp bug that caused the controller01 node to reboot as described in this document: https://access.redhat.com/solutions/1415463 During the time of the reboot, the state of the storage from horizon could not be altered. The service successfully migrated over to controller02, the pacemaker standby node, and now the question remains as to why, after migration, the new controller running cinder was unable to alter the state of the storage. The condition was cleared by rebooting the instance, which to my knowledge was a RHEL6 VM / KVM instance. I'll test this myself as well, just to see if this is reproducible on similar versions.
Robb/Jeremy, According to the bug description, Nova compute node lost iSCSI connection to the iSCSI targets. Most probably it has nothing to do with the Cinder. Cinder is responsible to create volumes and export their connection information to Nova, from that point Nova is directly connected to the iSCSI targets. So if detach operations fails on the Nova compute node that probably means that something happened between compute node and iSCSI targets. Unfortunately, I can't say more relevant nova compute and messages logs from the compute node. Sosreports contain only controller logs (nova-api and scheduler). Please, get and upload sosreports from the compute nodes ASAP.
Hi Sergey: > According to the bug description, Nova compute node lost iSCSI connection to > the iSCSI targets. Not exactly; the iSCSI target appears to have remain connected to the compute01 host despite the fact that it's Cinder controller has gone offline due to the node reboot. The issue only appears, in our case, where once controller02 took over Cinder from the new-rebooting controller01, we were unable to attach/detach the storage from the compute01 host until the instance was rebooted. We unfortunately weren't there when they rebooted the instance, unfortunately. This could be tested, I think, by having an HA Cinder setup, backing storage presented to the compute nodes via iSCSI, and then using pcs to fence the currently active Cinder node and checking the results/ability to attach/detach shared storage from running instances. Of course, if this is racey at all, we may not catch it in the same scenario/operations it was in the middle of when the currently active Cinder service went down, so perhaps this may be difficult to reproduce as described. > Unfortunately, I can't say more relevant nova compute and messages logs from > the compute node. Sosreports contain only controller logs (nova-api and > scheduler). Please, get and upload sosreports from the compute nodes ASAP. Jeremy, could you do this for us please? It'd probably be best to host them on a system for Sergey to look at them, as we also collected all of /var during the meantime. Thanks!
Robb/Jeremy, The link to the controller logs in collab shell from comment #1 doesn't work for me anymore. Can you, please, restore it?