Bug 1503352 - Cinder backup on in-use volume from Ceph backend failure
Summary: Cinder backup on in-use volume from Ceph backend failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-os-brick
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: beta
: 13.0 (Queens)
Assignee: Gorka Eguileor
QA Contact: Avi Avraham
Don Domingo
URL:
Whiteboard:
Depends On: 1375207 1710946 1790752
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-17 23:34 UTC by James Biao
Modified: 2020-01-14 05:55 UTC (History)
9 users (show)

Fixed In Version: python-os-brick-2.3.0-0.20180211233135.7dd2076.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:37:31 UTC
Target Upstream Version:
lkuchlan: automate_bug+


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 476503 None MERGED Fix ceph incremental backup fail 2020-03-27 19:04:40 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:38:31 UTC

Description James Biao 2017-10-17 23:34:15 UTC
Description of problem:

This is a further query from customer on Bug 1501637.

The upstream gerrit https://review.openstack.org/#/c/476503/ has been tried and Customer has the following further findings / enquiries,

Need to understand the logic of the current cinder backup service on ceph. Based on the coding logic of cinder/backup/manager.py, cinder/volume/driver.py, cinder/backup/drivers/ceph.py, the in-use volume backup won't work properly

In the this section of code,
/usr/lib/python2.7/site-packages/cinder/backup/manager.py:        backup_dic = self.volume_rpcapi.get_backup_device(context,

Cinder is requesting volume driver/service to return a device object for volume backup request.  This device object will be used later in cinder/backup/drivers/ceph.py at,

def _backup_rbd(self, backup_id, volume_id, volume_file, volume_name, length):

where volume_file is passed as returned object from get_backup_device(). And a snapshot is created for this volume by the following codes,

    597         source_rbd_image = volume_file.rbd_image
.....
     643         source_rbd_image.create_snap(new_snap)

So by the logic of ceph.py, it really should just use the original source volume which the user calls for. And the snapshot will be used for "rbd export-diff". Instead, when volume is "in-use" status, get_backup_device() call to cinder volume service creates a snap-clone for the original ceph volume and return the snap-clone's object handle, which creates two problems,

(a) rbd export-diff in ceph.py gets the source volume mixed up, and tries to use original cinder volume as the source of the snapshot in the CMD1, but the snapshot is nowhere to be found for the original cinder volume. The new_snap creation call from above is actually run against the snap_clone volume. As a result, this differential backup will fail.

(b) when the above step fails, the code path in ceph.py is trying to perform a full backup with a brutal force copy (block by block) from the true original cinder volume to the backup ceph volume. For active volume ("in-use"), this is clearly not a crash consistent volume but tenant gets no warning and will get impression the backup is fully successful. This should not be the case.

If non-disruptive backup mode is offered as a service under newton. This will have to be fixed.








Version-Release number of selected component (if applicable):

OSP 10 

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 James Biao 2017-10-23 13:05:00 UTC
On OSP 11 I was able to reproduce as the following. In general, the "incremental" backup is not really an increment but a full copy of the volume.

1. This is the volume to be attached
[stack@instack ~]$ openstack volume list
+--------------------------------------+--------------+-----------+------+-------------+
| ID                                   | Display Name | Status    | Size | Attached to |
+--------------------------------------+--------------+-----------+------+-------------+
| 1d12eb29-08bd-4457-98fe-0debf8dbcf59 | backupvol    | available |   10 |             |
+--------------------------------------+--------------+-----------+------+-------------+

2. Attaching it to my instance

stack@instack ~]$ openstack server add volume rhel7-volume-backup backupvol
[stack@instack ~]$ openstack volume list
+--------------------------------------+--------------+--------+------+----------------------------------------------+
| ID                                   | Display Name | Status | Size | Attached to                                  |
+--------------------------------------+--------------+--------+------+----------------------------------------------+
| 1d12eb29-08bd-4457-98fe-0debf8dbcf59 | backupvol    | in-use |   10 | Attached to rhel7-volume-backup on /dev/vdd  |
+--------------------------------------+--------------+--------+------+----------------------------------------------+

3. Logged in the instance, mkfs and mount the volume. Copied a file to the mount directory

4. Create backup

[stack@instack ~]$ cinder backup-create backupvol --force
+-----------+--------------------------------------+
| Property  | Value                                |
+-----------+--------------------------------------+
| id        | 9eed827c-4100-4b20-b520-8bab2d769521 |
| name      | None                                 |
| volume_id | 1d12eb29-08bd-4457-98fe-0debf8dbcf59 |
+-----------+--------------------------------------+

5. Checking on the Ceph side
[root@overcloud-controller-0 ~]# rbd -p backups ls
volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521

[root@overcloud-controller-0 ~]# rbd -p backups info volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521
rbd image 'volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521':
	size 10240 MB in 2560 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.26e4977099a5b
	format: 2
	features: layering, striping
	flags: 
	stripe unit: 4096 kB
	stripe count: 1

6. Log back to instance and add another file to the mount directory

7. Create incremental backup

[stack@instack ~]$ cinder backup-create backupvol --force --incremental
+-----------+--------------------------------------+
| Property  | Value                                |
+-----------+--------------------------------------+
| id        | 2b72b678-4f5b-4c0c-a0e5-3dcb3c487210 |
| name      | None                                 |
| volume_id | 1d12eb29-08bd-4457-98fe-0debf8dbcf59 |
+-----------+--------------------------------------+

8. Check on Ceph side, we can see that the "incremental" backup is 10G in size.

[root@overcloud-controller-0 ~]# rbd -p backups ls
volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210
volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521

[root@overcloud-controller-0 ~]# rbd -p backups info volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210
rbd image 'volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210':
	size 10240 MB in 2560 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.26e82583ef15d
	format: 2
	features: layering, striping
	flags: 
	stripe unit: 4096 kB
	stripe count: 1

9. none of the backup rbds has snapshot

[root@overcloud-controller-0 ~]# rbd -p backups snap ls volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521
[root@overcloud-controller-0 ~]# rbd -p backups snap ls volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210

Comment 17 errata-xmlrpc 2018-06-27 13:37:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.