Description of problem: Customer is unable to remove cinder volume in available state after taking the following actions: - creation of a new Ceph backed volume (same tenant) - build of an instance that uses this new volume at boot - shutoff (no graceful) of the compute node where this new instance was running at - power on the compute node - manual start of the instance - instance deletion - volume deletion Cinder fails to delete volume (logs and related commands to be provided privately) Version-Release number of selected component (if applicable): cinder-volume 13.0-99
[1] provides some useful background info on what appears to be a dangling RBD lock, and the ceph caps required to break the lock. I checked the sosreport attached to the case, and it doesn't include a copy of ceph.client.openstack.keyring so I cannot tell if the openstack client as "profile rbd" or "allow r". If it's the latter then that explains why the openstack client cannot break the lock. The next thing to try is to use the "admin" client, which should have the necessary caps to break the lock. However, this case comment [2] indicates there's no admin keyring on the controller. I don't know if this is a director based or external ceph deployment, but you might find the admin keyring on an OSD node. [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020690.html [2] https://access.redhat.com/support/cases/#/case/02653676?commentId=a0a2K00000V1BZSQA3
Alan, thank you for checking this bugzilla. I asked customer to provide information about client.openstack user's permissions. It looks like our official document [1] tells to create client.openstack and overcloud doesn't actually get admin access to Ceph cluster. Let's wait and see. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/integrating_an_overcloud_with_an_existing_red_hat_ceph_cluster/index#config-ceph-cluster
We get confirmation from customer that profile rbd is there [1]. I am wondering, what should be our next steps? Trying to delete affected volume using admin credentials or collect more data? [1] [client.openstack] key = ## our 40 chars lenght key here ## caps mgr = "allow *" caps mon = "profile rbd" caps osd = "profile rbd pool=volumes, profile rbd pool=backups, profile rbd pool=vms, profile rbd pool=images, profile rbd pool=metrics" Kind Regards, Alex.
Because this is an external cluster, it makes sense that the OSP nodes don't have an admin key. Based on what I read in comment #7 [1], I hoped the "profile rbd" setting in the openstack keyring would be sufficient to break the lock, but maybe a ceph expert can provide deeper insight. Meanwhile, if breaking the lock requires an admin keyring, then that would have to be done on an external (non-OSP) node that has that keyring. I'm adding a few people from the Ceph squad in case they have other suggestions.
Alan, we get the following KCS [1] from Ceph support: it is likely related to the issue. At the same time I am confused why the issue is there after a long time. I asked customer to provide the following outputs: ceph auth list | grep client.openstack -A3 rados listwatchers volumes/volume-VOLID [1] https://access.redhat.com/solutions/3391211
It looks like "rados listwatchers volumes/volume-VOLID" is not consistent: it is empty for affected volume, but it is also empty for volume that actually have watchers according to "rbd status volumes/volume-VOLID" output. I will return case back to collaboration with Ceph support, hope they will provide more insights...
Hi Alan and Jon. Customer confirmed that KCS [1] fixed the issue: after updating MON caps for openstack client to [2] we were able to successfully run ""rbd .. rm ..."" command on controller node. I am wondering if we need to update our "INTEGRATING AN OVERCLOUD WITH AN EXISTING RED HAT CEPH CLUSTER" guide [3] to request proper captions? Kind Regards, Alex. [1] https://access.redhat.com/solutions/3391211 [2] caps mon: allow r, allow command "osd blacklist" [3] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/integrating_an_overcloud_with_an_existing_red_hat_ceph_cluster/index
I'm flipping this one to the Ceph squad. Is it something that could be handled in our ceph-external THT?
I have reported bug #1844360 for external scenario, so documentation team would have a separate one.
Note. Customer confirmed that after tuning caps original problem is no longer there and compute power outage doesn't cause any locks of RBD images.
It looks like "profile rbd" is already configured by TripleO, so I guess that this bug can be closed (but I could be wrong here)
(In reply to Alex Stupnikov from comment #33) > It looks like "profile rbd" is already configured by TripleO, so I guess > that this bug can be closed (but I could be wrong here) As per comment #25 I think the fix we need is to add 'allow command "osd blacklist"' to the jewel (osp10/rhcs2) keyring