1765178 – [OSP 17.0.0] Instances with encrypted boot volumes are unable to be started after a hypervisor crash

Bug 1765178 - [OSP 17.0.0] Instances with encrypted boot volumes are unable to be started after a hypervisor crash

Summary: [OSP 17.0.0] Instances with encrypted boot volumes are unable to be started a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	beta
Target Release:	17.0
Assignee:	Lee Yarwood
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1905010 1905016 1905017
TreeView+	depends on / blocked

Reported:	2019-10-24 13:18 UTC by Elf Lewis
Modified:	2023-03-21 19:24 UTC (History)
CC List:	13 users (show)
Fixed In Version:	openstack-nova-23.0.3-0.20210908140341.e39bbdc.el9ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1905010 (view as bug list)
Environment:
Last Closed:	2022-09-21 12:07:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	764246	None	MERGED	libvirt: Skip encryption metadata lookups if secret already exists on host	2021-01-26 10:53:24 UTC
Red Hat Issue Tracker	OSP-2003	None	None	None	2021-11-18 14:46:37 UTC
Red Hat Product Errata	RHEA-2022:6543	None	None	None	2022-09-21 12:09:15 UTC

Description Elf Lewis 2019-10-24 13:18:05 UTC

Description of problem:t
Customer is testing cinder/barbican LUKS encryption. They have followed the RH guides on setting this up, the only changes we have made are the following to barbicans policies.

{"creator":"role:_member_","secret:decrypt":"rule:secret_decrypt_non_private_read or rule:secret_project_creator or rule:secret_project_admin or rule:secret_acl_read or role:admin","secret:delete":"rule:secret_project_admin or rule:secret_project_match or rule:secret_project_creator or role:admin","secret:get":"rule:secret_non_private_read or rule:secret_project_creator or rule:secret_project_admin or rule:secret_acl_read or role:admin"}

This is to allow the member user to be a creator and consumer of barbican secrets. It also allows the cinder/nova servers to handle secrets, they needed these changes to allow live migration to work with encrypted volumes.

they also have "resume_guest_on_boot" for nova set to true.

Following the following procedure to test:

1. Create an encrypted boot volume
2. boot the instance from this volume
3. check instance is working
4. Power off the hypervisor (hard reboot - poweroff0
5. Power on the hypervisor

At this point the following is seen in the nova-compute log:

2019-10-22 13:24:29.707 1 ERROR os_brick.encryptors [req-16cd0b51-6fd3-40b4-ac54-c4486c9d8e1b - - - - -] Failed to retrieve encryption metadata for volume <volume ID>: Unknown auth type: None (HTTP 401): Unauthorized: Unknown auth type: None (HTTP 401)

Now the instance is shown as error and is now not able to be started.

What we have tried
1. Reset state, reboot
2. reset state, reboot --hard

Also at the same time, a virsh list --all seems to show that the instance has been removed completely from the hypervisor, and as far as we can see the domain xml file is also missing.

A soft reboot (os shutdown) works as expected.

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Always

Steps to Reproduce:
1. Deploy instance from encrypted cinder volume
2. Perform a power off on the hypervisor by removing power (Not an os shutdown)
3. Bring host back online
4. try to power on instance

Actual results:
Instance does not boot, is in error state

Expected results:
instance booting normally

Additional info:

Comment 2 Eric Harney 2019-10-24 13:24:42 UTC

What version of Nova is installed here?  I'd like to know if

https://review.opendev.org/#/c/656464/

is present in their environment already.

Comment 4 Elf Lewis 2019-10-24 16:19:46 UTC

one further note: 

Further testing shows that setting "resume_guest_on_boot" to false does not manifest the problem.  If set to false, we can hard power off hosts and everything works as expected (although we obviously then have to manually restart instances).  It is only when "resume_guest_on_boot" is set to true that the problem occurs.

Comment 5 Alan Bishop 2019-10-29 15:59:21 UTC

The upstream patch Eric referenced in comment #2 seems relevant, and looks to be included in the upcoming 13z9 release (it does not appear to be in z8).

I'd like the nova team to confirm all of this is true.

Comment 6 Lee Yarwood 2019-11-01 10:18:21 UTC

https://review.opendev.org/#/c/656464/ could potentially workaround this but I think the issue is slightly different. I believe the issue here is that we don't have the required user or admin context to satisfy the b-api policy at n-cpu start up, even with this change in place I think we still call out to b-api to fetch the encryption metadata so this might still continue to fail. IIRC when testing upstream with devstack I didn't hit this issue so I wonder if service tokens are the real solution here with TripleO?

Comment 7 Lee Yarwood 2020-07-16 19:45:45 UTC

I'm going to close this out as WONTFIX as the real solution here is to use service tokens which are enabled by default from OSP 16.0 onwards downstream.

Outside of that in OSP 13 users will need to manually start instances using encrypted volumes after the compute has restarted in order for n-cpu to fetch encryption keys from b-api.

Comment 8 Lee Yarwood 2020-10-09 10:01:28 UTC

(In reply to Lee Yarwood from comment #7)
> I'm going to close this out as WONTFIX as the real solution here is to use
> service tokens which are enabled by default from OSP 16.0 onwards downstream.
> 
> Outside of that in OSP 13 users will need to manually start instances using
> encrypted volumes after the compute has restarted in order for n-cpu to
> fetch encryption keys from b-api.

I'm reopening this bug after some related issues have been raised upstream even with the policy changes and service tokens being enabled as discussed here.

Comment 15 Steve Relf 2021-04-08 07:27:56 UTC

Any update on this backport to OSP13?

Would be great to be able to get my instances to auto restart on a hypervisor crash.

Happy to test some stuff if needed, i have a dedicated osp13 test platform.

Comment 16 Lee Yarwood 2021-04-08 08:38:02 UTC

(In reply to Steve Relf from comment #15)
> Any update on this backport to OSP13?
> 
> Would be great to be able to get my instances to auto restart on a
> hypervisor crash.
> 
> Happy to test some stuff if needed, i have a dedicated osp13 test platform.

Hey Steve, 

This was fixed and released as part of OSP 13 z15 via bug #1905017 (linked in the blocks field).

Please let me know if you have any additional issues with that version of the fix in that bug.

Regards,

Lee

Comment 25 errata-xmlrpc 2022-09-21 12:07:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Note You need to log in before you can comment on or make changes to this bug.