1503352 – Cinder backup on in-use volume from Ceph backend failure

Bug 1503352 - Cinder backup on in-use volume from Ceph backend failure

Summary: Cinder backup on in-use volume from Ceph backend failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-os-brick
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	beta
Target Release:	13.0 (Queens)
Assignee:	Gorka Eguileor
QA Contact:	Avi Avraham
Docs Contact:	Don Domingo
URL:
Whiteboard:
Depends On:	1375207 1710946 1790752
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-17 23:34 UTC by James Biao
Modified:	2022-08-09 13:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:	python-os-brick-2.3.0-0.20180211233135.7dd2076.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-27 13:37:31 UTC
Target Upstream Version:
Embargoed:
Flags:	lkuchlan: automate_bug+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	476503	None	MERGED	Fix ceph incremental backup fail	2020-10-12 06:16:47 UTC
Red Hat Issue Tracker	OSP-8656	None	None	None	2022-08-09 13:45:11 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:38:31 UTC

Description James Biao 2017-10-17 23:34:15 UTC

Description of problem:

This is a further query from customer on Bug 1501637.

The upstream gerrit https://review.openstack.org/#/c/476503/ has been tried and Customer has the following further findings / enquiries,

Need to understand the logic of the current cinder backup service on ceph. Based on the coding logic of cinder/backup/manager.py, cinder/volume/driver.py, cinder/backup/drivers/ceph.py, the in-use volume backup won't work properly

In the this section of code,
/usr/lib/python2.7/site-packages/cinder/backup/manager.py:        backup_dic = self.volume_rpcapi.get_backup_device(context,

Cinder is requesting volume driver/service to return a device object for volume backup request.  This device object will be used later in cinder/backup/drivers/ceph.py at,

def _backup_rbd(self, backup_id, volume_id, volume_file, volume_name, length):

where volume_file is passed as returned object from get_backup_device(). And a snapshot is created for this volume by the following codes,

    597         source_rbd_image = volume_file.rbd_image
.....
     643         source_rbd_image.create_snap(new_snap)

So by the logic of ceph.py, it really should just use the original source volume which the user calls for. And the snapshot will be used for "rbd export-diff". Instead, when volume is "in-use" status, get_backup_device() call to cinder volume service creates a snap-clone for the original ceph volume and return the snap-clone's object handle, which creates two problems,

(a) rbd export-diff in ceph.py gets the source volume mixed up, and tries to use original cinder volume as the source of the snapshot in the CMD1, but the snapshot is nowhere to be found for the original cinder volume. The new_snap creation call from above is actually run against the snap_clone volume. As a result, this differential backup will fail.

(b) when the above step fails, the code path in ceph.py is trying to perform a full backup with a brutal force copy (block by block) from the true original cinder volume to the backup ceph volume. For active volume ("in-use"), this is clearly not a crash consistent volume but tenant gets no warning and will get impression the backup is fully successful. This should not be the case.

If non-disruptive backup mode is offered as a service under newton. This will have to be fixed.








Version-Release number of selected component (if applicable):

OSP 10 

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 James Biao 2017-10-23 13:05:00 UTC

On OSP 11 I was able to reproduce as the following. In general, the "incremental" backup is not really an increment but a full copy of the volume.

1. This is the volume to be attached
[stack@instack ~]$ openstack volume list
+--------------------------------------+--------------+-----------+------+-------------+
| ID                                   | Display Name | Status    | Size | Attached to |
+--------------------------------------+--------------+-----------+------+-------------+
| 1d12eb29-08bd-4457-98fe-0debf8dbcf59 | backupvol    | available |   10 |             |
+--------------------------------------+--------------+-----------+------+-------------+

2. Attaching it to my instance

stack@instack ~]$ openstack server add volume rhel7-volume-backup backupvol
[stack@instack ~]$ openstack volume list
+--------------------------------------+--------------+--------+------+----------------------------------------------+
| ID                                   | Display Name | Status | Size | Attached to                                  |
+--------------------------------------+--------------+--------+------+----------------------------------------------+
| 1d12eb29-08bd-4457-98fe-0debf8dbcf59 | backupvol    | in-use |   10 | Attached to rhel7-volume-backup on /dev/vdd  |
+--------------------------------------+--------------+--------+------+----------------------------------------------+

3. Logged in the instance, mkfs and mount the volume. Copied a file to the mount directory

4. Create backup

[stack@instack ~]$ cinder backup-create backupvol --force
+-----------+--------------------------------------+
| Property  | Value                                |
+-----------+--------------------------------------+
| id        | 9eed827c-4100-4b20-b520-8bab2d769521 |
| name      | None                                 |
| volume_id | 1d12eb29-08bd-4457-98fe-0debf8dbcf59 |
+-----------+--------------------------------------+

5. Checking on the Ceph side
[root@overcloud-controller-0 ~]# rbd -p backups ls
volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521

[root@overcloud-controller-0 ~]# rbd -p backups info volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521
rbd image 'volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521':
	size 10240 MB in 2560 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.26e4977099a5b
	format: 2
	features: layering, striping
	flags: 
	stripe unit: 4096 kB
	stripe count: 1

6. Log back to instance and add another file to the mount directory

7. Create incremental backup

[stack@instack ~]$ cinder backup-create backupvol --force --incremental
+-----------+--------------------------------------+
| Property  | Value                                |
+-----------+--------------------------------------+
| id        | 2b72b678-4f5b-4c0c-a0e5-3dcb3c487210 |
| name      | None                                 |
| volume_id | 1d12eb29-08bd-4457-98fe-0debf8dbcf59 |
+-----------+--------------------------------------+

8. Check on Ceph side, we can see that the "incremental" backup is 10G in size.

[root@overcloud-controller-0 ~]# rbd -p backups ls
volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210
volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521

[root@overcloud-controller-0 ~]# rbd -p backups info volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210
rbd image 'volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210':
	size 10240 MB in 2560 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.26e82583ef15d
	format: 2
	features: layering, striping
	flags: 
	stripe unit: 4096 kB
	stripe count: 1

9. none of the backup rbds has snapshot

[root@overcloud-controller-0 ~]# rbd -p backups snap ls volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.9eed827c-4100-4b20-b520-8bab2d769521
[root@overcloud-controller-0 ~]# rbd -p backups snap ls volume-1d12eb29-08bd-4457-98fe-0debf8dbcf59.backup.2b72b678-4f5b-4c0c-a0e5-3dcb3c487210

Comment 15 lkuchlan 2018-05-24 14:14:12 UTC

Tested using:
python2-os-brick-2.3.1-1.el7ost.noarch

Automation result:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-storage-qe-13_director-rhel-virthost-3cont_2comp_1ceph-ipv4-vxlan-qe-storage-tests/5/testReport/tempest_storage_plugin.tests.scenario.test_volume_backup/TestVolumeBackup/Second_tempest_run___test_volume_backup_increment_restore_compute_id_2ce5e55c_4085_43c1_98c6_582525334ad7_image_volume_/

Comment 17 errata-xmlrpc 2018-06-27 13:37:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Comment 18 liyu zhou 2020-10-12 06:05:55 UTC

@"jbiao"<jbiao>;
hi, I also meet the same question,so how to resolv it,ths!

Note You need to log in before you can comment on or make changes to this bug.