Bug 1253033

Summary: User tries to create snapshot of instance and it stays queued in saving state.
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-glanceAssignee: Flavio Percoco <fpercoco>
Status: CLOSED DEFERRED QA Contact: nlevinki <nlevinki>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.0 (RHEL 7)CC: dhill, dmaley, eglynn, fpercoco, ing.mohamed.miladi, jchronis, jmelvin, jobernar, jwaterwo, sgotliv, yeylon
Target Milestone: ---Keywords: Reopened, Unconfirmed, ZStream
Target Release: 5.0 (RHEL 7)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-15 13:45:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeremy 2015-08-12 19:09:31 UTC
Description of problem:  User tries to create snapshot of instance and it stays queued in saving state. 


Version-Release number of selected component (if applicable):  rhos 5.0 on rhel7


How reproducible: 100%


Steps to Reproduce:
1. try to create snapshot of an instance
2. note image stays in saving state for days
3.

Actual results:
snapshot creation stays in saving state for days

Expected results:
snapshot should be created and in active state after completion

Additonal Info:

attachments are here: http://collab-shell.usersys.redhat.com/01486503/

Comment 3 Flavio Percoco 2015-08-13 08:25:47 UTC
Hey Jeremy,

Thanks for the report. I'll need a couple of things from you in order to be able to debug this issue properly:

1) Set the services to debug=True (at the very least nova and glance have to be in debug mode)

2) Get the logs from the compute nodes as well.

Thanks

Comment 5 Sergey Gotliv 2015-08-25 11:41:22 UTC
Dave/Jeremy,

Any updates from the customer? According to the bug description the case is reproducible 100%, so why they can't upload the relevant logs?

Comment 6 Sergey Gotliv 2015-09-10 10:01:16 UTC
We still missing the logs requested a month ago. Please, reopen that case if the issue still happens.

Comment 8 Flavio Percoco 2015-09-22 09:38:16 UTC
Jeremy,

I've checked the configs and logs. The former seems to be ok but the later does not contain debug info. Was glance restarted after updating the config files?

Comment 9 Jeremy 2015-09-23 13:48:47 UTC
Hello,
I instructed the customer to restart glance after enabling debugging. Now we have logs with debug enabled .

 http://collab-shell.usersys.redhat.com/01486503/

Info on the tests:

Today, 2015-09-23, I made two tests:

- First test: Based on the instance 'testsnapshot1' (Details in testsnapshot1_details.pdf and testsnapshot1_details.png), I've created a snapshot snapshot 'snap150923' 
  - 10:10 -> snap150923_initial.png (instance 'testsnapshot1'  shut down)
  - 11:30: snapshot 'snap150923' - status deleted (but button 'delete image' still visible in interface )
-> snap150923_deleted.png

- Second test: Based on the instance 'ls' (Details in ls_details.png and ls_details.pdf), I've created a snapshot snap1230915 '
- 11:34: create snapshot 'snap230915' from another image 
-> 02.ls_Instance.png (instance 'ls'  shut down)
-> snap230915_initial.png (snapshot taken, state queued)

- 12:15: snapshot 'snap230915' with status active
-> snap230915_active.png

Notes:
- All instances are based on the same Flavor (R7ApplicationServer) with a 130Gb hard disk
- All images have been created by me on the same computer, and uploaded by me.
- Instance 'testsnapshot1' is running on compute node 'oscomp3'
- Instance 'ls' is running on on compute node 'oscomp4'

Attached are the sosreports of the 3 controller nodes and two compute nodes (oscomp3 and oscomp4)

Comment 10 Flavio Percoco 2015-09-23 16:26:06 UTC
Jeremy,

I tracked down the issue and the snapshopt creation is failing with a PermissionError. This leads me to think that the `images` user doesn't have access to the `volumes` pool. This causes the rbd driver in glance to fail reading the snapshot and therfore, the image creation fails as well.

You can verify the permissions of the `images` user on the `volumes` pool by checking the `ceph auth list` or by simply trying to list volumes in the `volumes` pool: `rbd --id images -p volumes ls`

The `images` rbd user must have read access to the `volumes` pool. I'd recommend giving it write access as well.

Comment 11 Jeremy 2015-09-24 13:54:25 UTC
Hello, the customer is able to to create small snapshots (40G) but can't do large snapshots. So this doesn't seem to be a permission problem for them. This problem only occurs with large snapshots.

Comment 14 Flavio Percoco 2015-09-25 10:25:14 UTC
Jeremy,

I'll take another look at this but it'd also be helpful if you'd provide more info about the environment.

For example, instead of providing an image name "ls" (which is hard to grep in logs), it'd be useful to have the IDs of the instances the snapshot creation is failing for and, if available, the IDs of the images resulting for those operations.

PNGs are nice to have but it's certainly better to have that information on the bug.

Comment 15 Flavio Percoco 2015-09-25 15:52:53 UTC
I checked the logs again and it's possible that David is correct and this is being caused by a non-valid log:

./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:26:17.598 8020 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:26:39.169 8018 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:27:19.405 8024 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:27:19.761 8016 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500


Increasing the token expiration timeout should help,
Thanks, David

Comment 17 Jeremy 2015-10-08 13:41:42 UTC
Changing token expiration does not seem to fix the issue. Although I do not see token expiration messages now.

Comment 29 John Chronister 2015-10-29 08:23:45 UTC
What additional information would be helpful?  I will attempt to get more information.

-chron

Comment 30 Flavio Percoco 2015-11-09 14:03:23 UTC
Hey John,

At this point, I think we need to try to replicate the issue in-house. 

That said, I think it'd be super useful to get fresh logs and the list of glance, ceph and nova packages the customer is currently using.

Comment 31 Flavio Percoco 2015-11-09 14:05:28 UTC
Also,

Let's make sure all the glance-api and nova-compute nodes are in debug mode.

Set this in the config files:

debug=True
verbose=True

Comment 37 nlevinki 2015-12-15 13:45:29 UTC
Not relevant anymore

Comment 38 ing.mohamed.miladi@gmail.com 2022-05-16 13:15:11 UTC
(In reply to Jeremy from comment #11)
> Hello, the customer is able to to create small snapshots (40G) but can't do
> large snapshots. So this doesn't seem to be a permission problem for them.
> This problem only occurs with large snapshots.


Hello,
I face the same problem on our platform, so we are able to to create small snapshots (40G) but can't do
large snapshots. Thanks to advice how to solve this problem