Bug 1838145 - Instructions for creation of ceph.client.openstack keyring on external Ceph cluster don't use "profile rbd"
Summary: Instructions for creation of ceph.client.openstack keyring on external Ceph c...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 13.0 (Queens)
Hardware: All
OS: All
medium
medium
Target Milestone: z13
: 13.0 (Queens)
Assignee: ndeevy
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-20 14:25 UTC by Alex Stupnikov
Modified: 2023-10-06 20:09 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-27 15:33:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3391211 0 None None None 2020-06-04 12:21:46 UTC

Description Alex Stupnikov 2020-05-20 14:25:37 UTC
Description of problem:

Customer is unable to remove cinder volume in available state after taking the following actions:

- creation of a new Ceph backed volume (same tenant)
- build of an instance that uses this new volume at boot
- shutoff (no graceful) of the compute node where this new instance was running at
- power on the compute node
- manual start of the instance
- instance deletion
- volume deletion

Cinder fails to delete volume (logs and related commands to be provided privately)

Version-Release number of selected component (if applicable):

cinder-volume 13.0-99

Comment 7 Alan Bishop 2020-05-28 14:27:32 UTC
[1] provides some useful background info on what appears to be a dangling RBD lock, and the ceph caps required to break the lock. I checked the sosreport attached to the case, and it doesn't include a copy of ceph.client.openstack.keyring so I cannot tell if the openstack client as "profile rbd" or "allow r". If it's the latter then that explains why the openstack client cannot break the lock. 

The next thing to try is to use the "admin" client, which should have the necessary caps to break the lock. However, this case comment [2] indicates there's no admin keyring on the controller. I don't know if this is a director based or external ceph deployment, but you might find the admin keyring on an OSD node.

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020690.html
[2] https://access.redhat.com/support/cases/#/case/02653676?commentId=a0a2K00000V1BZSQA3

Comment 8 Alex Stupnikov 2020-05-29 10:14:21 UTC
Alan, thank you for checking this bugzilla. I asked customer to provide information about client.openstack user's permissions. It looks like our official document [1] tells to create client.openstack and overcloud doesn't actually get admin access to Ceph cluster.

Let's wait and see.

[1]
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/integrating_an_overcloud_with_an_existing_red_hat_ceph_cluster/index#config-ceph-cluster

Comment 9 Alex Stupnikov 2020-05-29 13:35:05 UTC
We get confirmation from customer that profile rbd is there [1]. I am wondering, what should be our next steps? Trying to delete affected volume using admin credentials or collect more data?

[1]
[client.openstack]
        key = ## our 40 chars lenght key here ##
        caps mgr = "allow *"
        caps mon = "profile rbd"
        caps osd = "profile rbd pool=volumes, profile rbd pool=backups, profile rbd pool=vms, profile rbd pool=images, profile rbd pool=metrics"

Kind Regards, Alex.

Comment 10 Alan Bishop 2020-05-29 14:16:46 UTC
Because this is an external cluster, it makes sense that the OSP nodes don't have an admin key. Based on what I read in comment #7 [1], I hoped the "profile rbd" setting in the openstack keyring would be sufficient to break the lock, but maybe a ceph expert can provide deeper insight. Meanwhile, if breaking the lock requires an admin keyring, then that would have to be done on an external (non-OSP) node that has that keyring.

I'm adding a few people from the Ceph squad in case they have other suggestions.

Comment 11 Alex Stupnikov 2020-06-01 08:31:50 UTC
Alan, we get the following KCS [1] from Ceph support: it is likely related to the issue. At the same time I am confused why the issue is there after a long time.

I asked customer to provide the following outputs:

ceph auth list | grep client.openstack -A3
rados listwatchers volumes/volume-VOLID

[1]
https://access.redhat.com/solutions/3391211

Comment 12 Alex Stupnikov 2020-06-01 14:20:55 UTC
It looks like "rados listwatchers volumes/volume-VOLID" is not consistent: it is empty for affected volume, but it is also empty for volume that actually have watchers according to "rbd status volumes/volume-VOLID" output.

I will return case back to collaboration with Ceph support, hope they will provide more insights...

Comment 13 Alex Stupnikov 2020-06-04 12:21:32 UTC
Hi Alan and Jon.

Customer confirmed that KCS [1] fixed the issue: after updating MON caps for openstack client to [2] we were able to successfully run ""rbd .. rm ..."" command on controller node. I am wondering if we need to update our "INTEGRATING AN OVERCLOUD WITH AN EXISTING RED HAT CEPH CLUSTER" guide [3] to request proper captions?

Kind Regards, Alex.

[1]
https://access.redhat.com/solutions/3391211
[2]
        caps mon: allow r, allow command "osd blacklist"
[3]
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/integrating_an_overcloud_with_an_existing_red_hat_ceph_cluster/index

Comment 14 Alan Bishop 2020-06-04 17:14:19 UTC
I'm flipping this one to the Ceph squad. Is it something that could be handled in our ceph-external THT?

Comment 16 Alex Stupnikov 2020-06-05 08:16:51 UTC
I have reported bug #1844360 for external scenario, so documentation team would have a separate one.

Comment 17 Alex Stupnikov 2020-06-08 09:28:34 UTC
Note. Customer confirmed that after tuning caps original problem is no longer there and compute power outage doesn't cause any locks of RBD images.

Comment 33 Alex Stupnikov 2020-06-12 07:04:44 UTC
It looks like "profile rbd" is already configured by TripleO, so I guess that this bug can be closed (but I could be wrong here)

Comment 35 Giulio Fidente 2020-06-15 12:05:15 UTC
(In reply to Alex Stupnikov from comment #33)
> It looks like "profile rbd" is already configured by TripleO, so I guess
> that this bug can be closed (but I could be wrong here)

As per comment #25 I think the fix we need is to add 'allow command "osd blacklist"' to the jewel (osp10/rhcs2) keyring


Note You need to log in before you can comment on or make changes to this bug.