Red Hat Bugzilla – Bug 1253033
User tries to create snapshot of instance and it stays queued in saving state.
Last modified: 2016-04-26 18:58:37 EDT
Description of problem: User tries to create snapshot of instance and it stays queued in saving state.
Version-Release number of selected component (if applicable): rhos 5.0 on rhel7
How reproducible: 100%
Steps to Reproduce:
1. try to create snapshot of an instance
2. note image stays in saving state for days
snapshot creation stays in saving state for days
snapshot should be created and in active state after completion
attachments are here: http://collab-shell.usersys.redhat.com/01486503/
Thanks for the report. I'll need a couple of things from you in order to be able to debug this issue properly:
1) Set the services to debug=True (at the very least nova and glance have to be in debug mode)
2) Get the logs from the compute nodes as well.
Any updates from the customer? According to the bug description the case is reproducible 100%, so why they can't upload the relevant logs?
We still missing the logs requested a month ago. Please, reopen that case if the issue still happens.
I've checked the configs and logs. The former seems to be ok but the later does not contain debug info. Was glance restarted after updating the config files?
I instructed the customer to restart glance after enabling debugging. Now we have logs with debug enabled .
Info on the tests:
Today, 2015-09-23, I made two tests:
- First test: Based on the instance 'testsnapshot1' (Details in testsnapshot1_details.pdf and testsnapshot1_details.png), I've created a snapshot snapshot 'snap150923'
- 10:10 -> snap150923_initial.png (instance 'testsnapshot1' shut down)
- 11:30: snapshot 'snap150923' - status deleted (but button 'delete image' still visible in interface )
- Second test: Based on the instance 'ls' (Details in ls_details.png and ls_details.pdf), I've created a snapshot snap1230915 '
- 11:34: create snapshot 'snap230915' from another image
-> 02.ls_Instance.png (instance 'ls' shut down)
-> snap230915_initial.png (snapshot taken, state queued)
- 12:15: snapshot 'snap230915' with status active
- All instances are based on the same Flavor (R7ApplicationServer) with a 130Gb hard disk
- All images have been created by me on the same computer, and uploaded by me.
- Instance 'testsnapshot1' is running on compute node 'oscomp3'
- Instance 'ls' is running on on compute node 'oscomp4'
Attached are the sosreports of the 3 controller nodes and two compute nodes (oscomp3 and oscomp4)
I tracked down the issue and the snapshopt creation is failing with a PermissionError. This leads me to think that the `images` user doesn't have access to the `volumes` pool. This causes the rbd driver in glance to fail reading the snapshot and therfore, the image creation fails as well.
You can verify the permissions of the `images` user on the `volumes` pool by checking the `ceph auth list` or by simply trying to list volumes in the `volumes` pool: `rbd --id images -p volumes ls`
The `images` rbd user must have read access to the `volumes` pool. I'd recommend giving it write access as well.
Hello, the customer is able to to create small snapshots (40G) but can't do large snapshots. So this doesn't seem to be a permission problem for them. This problem only occurs with large snapshots.
I'll take another look at this but it'd also be helpful if you'd provide more info about the environment.
For example, instead of providing an image name "ls" (which is hard to grep in logs), it'd be useful to have the IDs of the instances the snapshot creation is failing for and, if available, the IDs of the images resulting for those operations.
PNGs are nice to have but it's certainly better to have that information on the bug.
I checked the logs again and it's possible that David is correct and this is being caused by a non-valid log:
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:26:17.598 8020 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:26:39.169 8018 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:27:19.405 8024 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
./osctrl2.ecloud.siemens.de/var/log/glance/api.log:2015-09-23 10:27:19.761 8016 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 500
Increasing the token expiration timeout should help,
Changing token expiration does not seem to fix the issue. Although I do not see token expiration messages now.
What additional information would be helpful? I will attempt to get more information.
At this point, I think we need to try to replicate the issue in-house.
That said, I think it'd be super useful to get fresh logs and the list of glance, ceph and nova packages the customer is currently using.
Let's make sure all the glance-api and nova-compute nodes are in debug mode.
Set this in the config files:
Not relevant anymore