1253033 – User tries to create snapshot of instance and it stays queued in saving state.

Bug 1253033 - User tries to create snapshot of instance and it stays queued in saving state.

Summary: User tries to create snapshot of instance and it stays queued in saving state.

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-glance
Sub Component:
Version:	5.0 (RHEL 7)
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	5.0 (RHEL 7)
Assignee:	Flavio Percoco
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-12 19:09 UTC by Jeremy
Modified:	2022-05-16 13:22 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-15 13:45:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-15243	0	None	None	None	2022-05-16 13:22:10 UTC

Description Jeremy 2015-08-12 19:09:31 UTC

Description of problem:  User tries to create snapshot of instance and it stays queued in saving state. 


Version-Release number of selected component (if applicable):  rhos 5.0 on rhel7


How reproducible: 100%


Steps to Reproduce:
1. try to create snapshot of an instance
2. note image stays in saving state for days
3.

Actual results:
snapshot creation stays in saving state for days

Expected results:
snapshot should be created and in active state after completion

Additonal Info:

attachments are here: http://collab-shell.usersys.redhat.com/01486503/

Comment 3 Flavio Percoco 2015-08-13 08:25:47 UTC

Hey Jeremy,

Thanks for the report. I'll need a couple of things from you in order to be able to debug this issue properly:

1) Set the services to debug=True (at the very least nova and glance have to be in debug mode)

2) Get the logs from the compute nodes as well.

Thanks

Comment 5 Sergey Gotliv 2015-08-25 11:41:22 UTC

Dave/Jeremy,

Any updates from the customer? According to the bug description the case is reproducible 100%, so why they can't upload the relevant logs?

Comment 6 Sergey Gotliv 2015-09-10 10:01:16 UTC

We still missing the logs requested a month ago. Please, reopen that case if the issue still happens.

Comment 8 Flavio Percoco 2015-09-22 09:38:16 UTC

Jeremy,

I've checked the configs and logs. The former seems to be ok but the later does not contain debug info. Was glance restarted after updating the config files?

Comment 9 Jeremy 2015-09-23 13:48:47 UTC

Hello,
I instructed the customer to restart glance after enabling debugging. Now we have logs with debug enabled .

 http://collab-shell.usersys.redhat.com/01486503/

Info on the tests:

Today, 2015-09-23, I made two tests:

- First test: Based on the instance 'testsnapshot1' (Details in testsnapshot1_details.pdf and testsnapshot1_details.png), I've created a snapshot snapshot 'snap150923' 
  - 10:10 -> snap150923_initial.png (instance 'testsnapshot1'  shut down)
  - 11:30: snapshot 'snap150923' - status deleted (but button 'delete image' still visible in interface )
-> snap150923_deleted.png

- Second test: Based on the instance 'ls' (Details in ls_details.png and ls_details.pdf), I've created a snapshot snap1230915 '
- 11:34: create snapshot 'snap230915' from another image 
-> 02.ls_Instance.png (instance 'ls'  shut down)
-> snap230915_initial.png (snapshot taken, state queued)

- 12:15: snapshot 'snap230915' with status active
-> snap230915_active.png

Notes:
- All instances are based on the same Flavor (R7ApplicationServer) with a 130Gb hard disk
- All images have been created by me on the same computer, and uploaded by me.
- Instance 'testsnapshot1' is running on compute node 'oscomp3'
- Instance 'ls' is running on on compute node 'oscomp4'

Attached are the sosreports of the 3 controller nodes and two compute nodes (oscomp3 and oscomp4)

Comment 10 Flavio Percoco 2015-09-23 16:26:06 UTC

Jeremy,

I tracked down the issue and the snapshopt creation is failing with a PermissionError. This leads me to think that the `images` user doesn't have access to the `volumes` pool. This causes the rbd driver in glance to fail reading the snapshot and therfore, the image creation fails as well.

You can verify the permissions of the `images` user on the `volumes` pool by checking the `ceph auth list` or by simply trying to list volumes in the `volumes` pool: `rbd --id images -p volumes ls`

The `images` rbd user must have read access to the `volumes` pool. I'd recommend giving it write access as well.

Comment 11 Jeremy 2015-09-24 13:54:25 UTC

Hello, the customer is able to to create small snapshots (40G) but can't do large snapshots. So this doesn't seem to be a permission problem for them. This problem only occurs with large snapshots.

Comment 14 Flavio Percoco 2015-09-25 10:25:14 UTC

Jeremy,

I'll take another look at this but it'd also be helpful if you'd provide more info about the environment.

For example, instead of providing an image name "ls" (which is hard to grep in logs), it'd be useful to have the IDs of the instances the snapshot creation is failing for and, if available, the IDs of the images resulting for those operations.

PNGs are nice to have but it's certainly better to have that information on the bug.

Comment 15 Flavio Percoco 2015-09-25 15:52:53 UTC

I checked the logs again and it's possible that David is correct and this is being caused by a non-valid log:

./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:26:17.598 8020 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:26:39.169 8018 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:27:19.405 8024 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:27:19.761 8016 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500


Increasing the token expiration timeout should help,
Thanks, David

Comment 17 Jeremy 2015-10-08 13:41:42 UTC

Changing token expiration does not seem to fix the issue. Although I do not see token expiration messages now.

Comment 29 John Chronister 2015-10-29 08:23:45 UTC

What additional information would be helpful?  I will attempt to get more information.

-chron

Comment 30 Flavio Percoco 2015-11-09 14:03:23 UTC

Hey John,

At this point, I think we need to try to replicate the issue in-house. 

That said, I think it'd be super useful to get fresh logs and the list of glance, ceph and nova packages the customer is currently using.

Comment 31 Flavio Percoco 2015-11-09 14:05:28 UTC

Also,

Let's make sure all the glance-api and nova-compute nodes are in debug mode.

Set this in the config files:

debug=True
verbose=True

Comment 37 nlevinki 2015-12-15 13:45:29 UTC

Not relevant anymore

Comment 38 ing.mohamed.miladi@gmail.com 2022-05-16 13:15:11 UTC

(In reply to Jeremy from comment #11)
> Hello, the customer is able to to create small snapshots (40G) but can't do
> large snapshots. So this doesn't seem to be a permission problem for them.
> This problem only occurs with large snapshots.


Hello,
I face the same problem on our platform, so we are able to to create small snapshots (40G) but can't do
large snapshots. Thanks to advice how to solve this problem

Note You need to log in before you can comment on or make changes to this bug.