Bug 1561862
Summary: | cinder-backup incremental flag marked as false when volume attached to instance (ceph backend) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | bkopilov <bkopilov> | ||||||||
Component: | openstack-cinder | Assignee: | Eric Harney <eharney> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Tzach Shefi <tshefi> | ||||||||
Severity: | medium | Docs Contact: | RHOS Documentation Team <rhos-docs> | ||||||||
Priority: | medium | ||||||||||
Version: | 13.0 (Queens) | CC: | abishop, chtwerdna, cinder-bugs, eharney, geguileo, jdillama, pgrist, scohen, slinaber, srevivo, tshefi, wesun, yrabl | ||||||||
Target Milestone: | z3 | Keywords: | Triaged, ZStream | ||||||||
Target Release: | 14.0 (Rocky) | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | openstack-cinder-12.0.7-1.el7ost | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2019-07-02 19:43:59 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1375207, 1710946, 1790752 | ||||||||||
Attachments: |
|
Description
bkopilov
2018-03-29 05:03:11 UTC
Created attachment 1414522 [details]
cinder-backup.log
2018-03-29 04:50:58.941 1 DEBUG cinder.backup.drivers.ceph [req-948d4670-aa96-4eb7-b11e-607f971450a0 38a3a0070b694fc3b693e476ec0388d4 4c60b49dfff543f6958f328515a51d09 - - -] Performing differential transfer from 'volume-8e12d2a8-9384-48c6-a372-4398d1c2ff92' to 'volume-8e12d2a8-9384-48c6-a372-4398d1c2ff92.backup.base' _rbd_diff_transfer /usr/lib/python2.7/site-packages/cinder/backup/drivers/ceph.py:579 2018-03-29 04:50:58.942 1 DEBUG cinder.backup.drivers.ceph [req-948d4670-aa96-4eb7-b11e-607f971450a0 38a3a0070b694fc3b693e476ec0388d4 4c60b49dfff543f6958f328515a51d09 - - -] Piping cmd1='rbd export-diff --id ope nstack --conf /tmp/brickrbd__lBGIN --pool volumes volumes/volume-8e12d2a8-9384-48c6-a372-4398d1c2ff92.snap.1522299057.95 -' into... _piped_execute /usr/lib/python2.7/s ite-packages/cinder/backup/drivers/ceph.py:540 2018-03-29 04:50:58.942 1 DEBUG cinder.backup.drivers.ceph [req-948d4670-aa96-4eb7-b11e-607f971450a0 38a3a0070b694fc3b693e476ec0388d4 4c60b49dfff543f6958f328515a51d09 - - -] cmd2='rbd import-diff --id openstack --conf /etc/ceph/ceph.conf --pool backups - backups/volume-8e12d2a8-9384-48c6-a372-4398d1c2ff92.backup.base' _piped_execute /usr/lib/python2.7/site-packages/cinder/backup/drivers/ceph.py:541 2018-03-29 04:50:59.054 1 INFO cinder.backup.drivers.ceph [req-948d4670-aa96-4eb7-b11e-607f971450a0 38a3a0070b694fc3b693e476ec0388d4 4c60b49dfff543f6958f328515a51d09 - - -] RBD diff op failed - (ret=33 stderr=rb d: import-diff failed: (33) Numerical argument out of domain ) 2018-03-29 04:50:59.055 1 DEBUG cinder.backup.drivers.ceph [req-948d4670-aa96-4eb7-b11e-607f971450a0 38a3a0070b694fc3b693e476ec0388d4 4c60b49dfff543f6958f328515a51d09 - - -] Differential backup transfer failed _ backup_rbd /usr/lib/python2.7/site-packages/cinder/backup/drivers/ceph.py:716 This seems to be a bug in RBD: http://tracker.ceph.com/issues/18844 version 10.2.6 or 10.2.7 should have this fixed. Gorka, Thanks for lead, I'm not sure which ceph version Benny way using when this was posted but I'm a bit confused as per version numbers I'm getting. I've just no started a deployment with an internal ceph, looks like it may fail due to whatever reason, ceph version I get on it: puppet-ceph-2.5.1-0.20180305100232.928fb38.el7ost.noarch ceph-common-12.2.1-46.el7cp.x86_64 libcephfs2-12.2.1-46.el7cp.x86_64 ceph-selinux-12.2.1-46.el7cp.x86_64 ceph-base-12.2.1-46.el7cp.x86_64 collectd-ceph-5.8.0-5.el7ost.x86_64 python-cephfs-12.2.1-46.el7cp.x86_64 ceph-mon-12.2.1-46.el7cp.x86_64 While on external TLV ceph cluster, getting these versions: python-cephfs-10.2.7-32.el7cp.x86_64 ceph-selinux-10.2.7-32.el7cp.x86_64 ceph-osd-10.2.7-32.el7cp.x86_64 ceph-ansible-2.1.9-1.el7scon.noarch libcephfs1-10.2.7-32.el7cp.x86_64 ceph-common-10.2.7-32.el7cp.x86_64 ceph-base-10.2.7-32.el7cp.x86_64 ceph-mds-10.2.7-32.el7cp.x86_64 ceph-radosgw-10.2.7-32.el7cp.x86_64 ceph-mon-10.2.7-32.el7cp.x86_64 Looking at version both are =>10.2.7 good to test with either of them, just waiting for a deployment to finish and rerun test to update this bz. Jason, Could you comment on this? As Eric noted in comment #2, we're seeing an RBD import-diff error. From the cinder-backup log, it pipes this export-diff command: "rbd export-diff --id openstack --conf /tmp/brickrbd__lBGIN --pool volumes volumes/volume-8e12d2a8-9384-48c6-a372-4398d1c2ff92.snap.1522299057.95 -" into this import-diff command: "rbd import-diff --id openstack --conf /etc/ceph/ceph.conf --pool backups - backups/volume-8e12d2a8-9384-48c6-a372-4398d1c2ff92.backup.base" And that results in this error: "RBD diff op failed - (ret=33 stderr=rbd: import-diff failed: (33) Numerical argument out of domain)" Gorka notices the same error message reported in an upstream ceph bug [1], but the circumstances are different (our test is not fiddling with the parent volume size), and according to Tzach the OSP-13 deployment is running rhcs-12.2.1-46. [1] http://tracker.ceph.com/issues/18844 Can you provide debug-level RBD logs from the Cinder Backup controller node when this occurs? You will need something like the following: [client] log file = /var/log/ceph/$name.$pid.log debug rbd = 20 ... specifically, the 'rbd import-diff' command appears to fail w/ that error if the pipe input is corrupt, so I am interested in what occurs w/ the 'rbd export-diff' of that image: $ echo "" | rbd import-diff - image0 rbd: import-diff failed: (33) Numerical argument out of domain (In reply to Jason Dillaman from comment #6) > Can you provide debug-level RBD logs from the Cinder Backup controller node > when this occurs? You will need something like the following: > > [client] > log file = /var/log/ceph/$name.$pid.log > debug rbd = 20 Hi , Sure no probelm , could you explain how to enable rbd debugs ? should we restart any docker of processes ? By some reason we have systemctl and dockers shown ... [root@controller-0 heat-admin]# docker ps | grep ceph 9f13626b64fb registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 3 minutes ago Up 3 minutes ceph-mon-controller-0 d5fcf89c0439 registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 3 minutes ago Up 3 minutes ceph-mgr-controller-0 [root@controller-0 heat-admin]# systemctl -a | grep ceph- ceph-mgr loaded active running Ceph Manager ceph-mon loaded active running Ceph Monitor ceph-mon.target See comment #8 how to configure debug for rbd . The config changes go on the Cinder backup controller host, not within the Ceph cluster. (In reply to Jason Dillaman from comment #10) > The config changes go on the Cinder backup controller host, not within the > Ceph cluster. Hi , I run the testcase again , providing all controller logs . See backup_id : d3ea4530-de98-410c-b793-12fccda2eb45 Attaching file: conainter_logs_contorller.tar.gz Jason , Where should i paste these lines exactly , which file name ? [client] log file = /var/log/ceph/$name.$pid.log debug rbd = 20 Thanks, Benny Created attachment 1419685 [details]
conainter_logs_contorller.tar.gz
Created attachment 1419707 [details]
SOSreport from where Cinder backup ran
Notice I posted the previous backup IDs on comment above, they won't showup on soserpot.
Below are the correct/valid IDs which will showup on sosrepot.
cinder backup-list
+--------------------------------------+--------------------------------------+-----------+-----------------------+------+--------------+-----------+
| ID | Volume ID | Status | Name | Size | Object Count | Container |
+--------------------------------------+--------------------------------------+-----------+-----------------------+------+--------------+-----------+
| 794f7740-a899-4290-8ac6-f7a66646c7af | 364ab166-2cd3-4e53-aa00-315de2f03a66 | available | AttachedVolIncBackup | 1 | 0 | backups |
| e485e61d-7b60-4375-b51f-5bdd5e953b8d | 364ab166-2cd3-4e53-aa00-315de2f03a66 | available | AttachedVolFullBackup | 1 | 0 | backups |
+--------------------------------------+--------------------------------------+-----------+-----------------------+------+--------------+-----------+
(overcloud) [stack@undercloud-0 ~]$ cinder backup-show e485e61d-7b60-4375-b51f-5bdd5e953b8d | grep incre
| is_incremental | False |
Priority - high, blocking urgently needed Cinder OPS13 RFE verification. Severity - urgent, as no workaround exists and RBD incremental backup support's basic flow is broken, thus feature is useless at current state. (In reply to bkopilov from comment #11) > (In reply to Jason Dillaman from comment #10) > > The config changes go on the Cinder backup controller host, not within the > > Ceph cluster. > > Hi , > > I run the testcase again , providing all controller logs . > See backup_id : d3ea4530-de98-410c-b793-12fccda2eb45 > > Attaching file: conainter_logs_contorller.tar.gz I also require Ceph's logs. > Jason , > Where should i paste these lines exactly , which file name ? > [client] > log file = /var/log/ceph/$name.$pid.log > debug rbd = 20 /etc/ceph/<cluster name>.conf (usually <cluster name> is just 'ceph'). > > > Thanks, > Benny I logged into the test environment listed above and don't see any issues on the Ceph/RBD side. I can pipe an export-diff of the image over to an import-diff in the backup image just fine manually. My hunch is that something is broken inside OpenStack cinder's pipe helper [1]. Perhaps either is cannot handle binary data correctly, or worse, it is merging stdout and stderr into the same stdin stream. If that were the case, you'd end up with a result like the following where the binary data (from stdout) is merged w/ the status output sent to stderr: # rbd export-diff --id openstack --pool volumes volumes/volume-364ab166-2cd3-4e53-aa00-315de2f03a66.snap.1523367066.14 - --debug-rbd=0 | hexdump -C 00000000 72 62 64 20 64 69 66 66 20 76 31 0a 74 3e 00 00 |rbd diff v1.t>..| 00000010 00 62 61 63 6b 75 70 2e 65 66 30 62 39 64 33 31 |.backup.ef0b9d31| 00000020 2d 37 63 64 30 2d 34 30 64 65 2d 61 37 66 63 2d |-7cd0-40de-a7fc-| 00000030 63 31 64 65 33 61 39 31 38 35 39 63 2e 73 6e 61 |c1de3a91859c.sna| 00000040 70 2e 31 35 32 33 33 36 37 30 36 36 2e 31 34 73 |p.1523367066.14s| Exporting image: 100% complete...done. 00000050 00 00 00 40 00 00 00 00 65 |...@....e| 00000059 [1] https://github.com/openstack/cinder/blob/stable/pike/cinder/backup/drivers/ceph.py#L544 ... alternatively, it's not immediately clear to me how the Ceph configuration file is written to disk since it appears to be using a temporary file [1] that is extracted from the volume_file: 2018-04-10 13:31:07.065 1 DEBUG cinder.backup.drivers.ceph [req-7775e2ad-a518-4099-a2ca-619485e7ed57 82e9e698e0c34efc856ea8972e30c07a 4adf2dd4b5844ca88db24a01289eaf49 - - -] Piping cmd1='rbd export-diff --id openstack --conf /tmp/brickrbd_ouc1ot --pool volumes volumes/volume-364ab166-2cd3-4e53-aa00-315de2f03a66.snap.1523367066.14 -' into... _piped_execute /usr/lib/python2.7/site-packages/cinder/backup/drivers/ceph.py:540 [1] https://github.com/openstack/cinder/blob/stable/pike/cinder/backup/drivers/ceph.py#L646 Same issue happens with Swift in place of RBD as Cinder's backup back-end. Not surprising or helpful just mentioned as I happened to hit it again/as well. Added link to upstream bug I opened for this. Recently I've done some test about this bug and find something. Before posting this I found eharney seemed to already had a solution. ( https://bugs.launchpad.net/cinder/+bug/1763399 ) Still post what I found, maybe it'll help. First, this bug is not about the piping error. It's because of different device_to_backup passing to cinder backup when the volume is available or in-use. For available volume, everything works fine. The volume itself is passed in and cinder backup will take a snapshot for it, ceph has: 1. originVolume 2. originVolume@snapshot After that, cinder backup uses: 1. the new snapshot(originVolume@snapshot) for first backup 2. the last snapshot(originVolume@snapshot-latest) and the new snapshot(originVolume@snapshot) for second time and latter backup to do rbd export diff and pipes it to rbd import diff. rbd export diff source is named like originVolume@snapshot For in-use volume, it's a little complicated. Before passing to cinder backup, it'll create a temp cloned volume. https://github.com/openstack/cinder/blob/stable/pike/cinder/volume/driver.py#L1161 The cloned volume is passed in and cinder backup will take a snapshot for it. For now, ceph has: 1. originVolume 2. originVolume@clone_snap (temp snapshot to create cloneVolume) 3. cloneVolume (passed into cinder backup) 4. cloneVolume@snapshot but in cinder backup the rbd export diff is still finding source named "originVolume@snapshot", can't find this so the export fails, and then causing the piping error (comment 7 had some test and gave a hint of this) After piping error cinder backup falls back and makes a full copy every time. So the backup will be full and the incremental flag is false. Note that we can't just change the export source path from originVolume/snapshot to cloneVolume@snapshot. Because every time cloneVolume and cloneVolume@snapshot will be new, so even cinder backup is using export diff, it'll still work like backup the whole new volume. After all the tests, if I don't care about in-use state and just pass the originVolume directly to cinder backup, it seems to work fine like available volumes. So I'm curious about why it's designed like this (create clone volume for in-use volume), maybe because other storages' constraints? For ceph, both ways mentioned before take snapshots first. Hope this post helps. And if eharney already sorts all these out, I hope the bug fix will come out shortly. Otherwise I'll need to patch my environment by myself. (It's my first post here just because this bug lives a little long for me.) Thanks, werdna (In reply to werdna from comment #22) > Recently I've done some test about this bug and find something. > Before posting this I found eharney seemed to already had a solution. > ( https://bugs.launchpad.net/cinder/+bug/1763399 ) Hi, Just to keep track of the changes here -- my patch has been submitted against a different upstream bug which is tracking the same problems: https://bugs.launchpad.net/cinder/+bug/1578036 I believe https://review.openstack.org/#/c/579606/ will fix this. Thank you for adding notes from your analysis, this kind of info is always appreciated. Thanks for the response. The discussion there is so helpful (but I didn't find it before). Another small question, do we need to freeze the I/O before taking ceph snapshot? I'm not quite sure that if ceph itself will do this or not. If not, then taking snapshot for in-use volume may result in a non-consistent state. Openstack snapshot seems to deal with the freezing I/O things, but in cinder backup we use ceph snapshot directly. in http://docs.ceph.com/docs/mimic/rbd/rbd-snapshot/ Note If a snapshot is taken while I/O is still in progress in a image, the snapshot might not get the exact or latest data of the image and the snapshot may have to be cloned to a new image to be mountable. So, we recommend to stop I/O before taking a snapshot of an image. If the image contains a filesystem, the filesystem must be in a consistent state before taking a snapshot. To stop I/O you can use fsfreeze command. See fsfreeze(8) man page for more details. For virtual machines, qemu-guest-agent can be used to automatically freeze filesystems when creating a snapshot. (In reply to werdna from comment #25) > Thanks for the response. > The discussion there is so helpful (but I didn't find it before). > > Another small question, do we need to freeze the I/O before taking ceph > snapshot? > The model that Cinder uses is that snapshots are crash-consistent. Freezing/quiescing within the guest is not performed. (In reply to Eric Harney from comment #23) > (In reply to werdna from comment #22) > > Recently I've done some test about this bug and find something. > > Before posting this I found eharney seemed to already had a solution. > > ( https://bugs.launchpad.net/cinder/+bug/1763399 ) > > Hi, > > Just to keep track of the changes here -- my patch has been submitted > against a different upstream bug which is tracking the same problems: > > https://bugs.launchpad.net/cinder/+bug/1578036 > > I believe https://review.openstack.org/#/c/579606/ will fix this. > > Thank you for adding notes from your analysis, this kind of info is always > appreciated. Hi Eric, Thanks for creating the patch. I applied the commit of 579606 to my newton cluster and the incremental backup with attached cinder/ceph volume seems to be working as expected now with ceph backup-driver, ie, 'rbd export-diff' is leveraging the snapshot left from previous backup for differential and debug shows the proper piping to 'rbd import-diff' against target ceph base volume. However, since your patch went to the base rbd volume clase, it seems to break the Swift backup driver for the ceph based cinder source volume. When I switch to swift backup driver, there is no snapshot (any kind) created for the source cinder/ceph volume any more, and the driver is literally copying from the source volume directly to the swift backend. After backing out your patch, I can see the normal behavior with swift backup driver, ie, cinder will place a call to ceph cluster for a snapshot to the source ceph volume, and created a cloned cinder volume (named as backup-xxxx) from that snapshot. Swift driver then uses this cloned volume for data copying to swift containers. Can the patch be restricted for ceph-backup driver only? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1678 |