Bug 2027260
Summary: | Cold backup fail in various ways - backup is reported ready before add_bitmap jobs complete | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Nir Soffer <nsoffer> | ||||
Component: | Backup-Restore.VMs | Assignee: | Benny Zlotnik <bzlotnik> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Amit Sharir <asharir> | ||||
Severity: | urgent | Docs Contact: | bugs <bugs> | ||||
Priority: | unspecified | ||||||
Version: | 4.4.10 | CC: | ahadas, asharir, bugs, bzlotnik, eshames, lsvaty, pbar | ||||
Target Milestone: | ovirt-4.4.10 | Keywords: | ZStream | ||||
Target Release: | 4.4.10.1 | Flags: | pm-rhel:
ovirt-4.4+
lsvaty: blocker- |
||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ovirt-engine-4.4.10.1 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-01-19 07:00:13 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
How to reproduce: 1. Setup VM with 2 disks: - OS qcow2 disk using a template (I used 6g fedora image with ~1.6g of data) - Empty 10g qcow2 data disk Single host - other wise add_bitmap jobs may run on another host, hiding the incorrect flow. 2. Start full backup to create a checkpoint: ./backup_vm.py -c engine-dev full \ --backup-dir backups \ 6e95d38f-d9b8-4955-878c-da6d631d0ab2 I tried 3 times, first 2 invocations failed. 3. Run incremental backup using the checkpoint created in step 2. ./backup_vm.py -c engine-dev incremental \ --from-checkpoint-uuid 0345dff0-cd6a-4a5f-bf3a-bf230318c5c0 \ --backup-dir backups \ 6e95d38f-d9b8-4955-878c-da6d631d0ab2 I tried few times, most runs failed. Since this is a race, you need to run both full backup and incremental backup in a loop to reproduce the errors. You can run incremental backup multiple times using the same checkpoint. The backup creates a new checkpoint, but backing up the same checkpoint multiple times is valid, and will create identical backup. Example reproduction with full backup: 10 backups, 1 succeeded, 9 failed! $ for i in $(seq 10); do echo iteration $i; ./backup_vm.py -c engine-dev full --backup-dir /data/scratch/backups 6e95d38f-d9b8-4955-878c-da6d631d0ab2; echo "---"; done iteration 1 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.6 ] Waiting until backup 'ce9da0d8-7237-4ad0-85ec-61e36df7c43a' is ready [ 1.6 ] Created checkpoint '88fb7415-9267-4171-add9-1607239a3213' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.7 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 3.7 ] Image transfer '8babfd11-b878-480d-bc60-ae02115c8072' is ready [ 3.7 ] Finalizing image transfer0 bytes/s [ 6.7 ] Download completed successfully [ 6.7 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 2 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'be2c34f0-46bd-427d-9227-43902636f7b2' is ready [ 1.2 ] Created checkpoint 'b0be8af4-59e6-4a26-b7b4-3d54978a1cdb' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 3.1 ] Image transfer 'f87e12e7-fe19-4130-be2e-08b20ba1c878' is ready [ 3.2 ] Finalizing image transfer0 bytes/s [ 6.2 ] Download completed successfully [ 6.2 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 3 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'e07bde95-31c8-4cde-823f-3e473e076df6' is ready [ 1.2 ] Created checkpoint '8172c36d-8e2f-49d3-a6d9-dfefcda65141' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 3.1 ] Image transfer '73e3088a-5e26-4513-b940-ccd06461a2fb' is ready [ 3.1 ] Finalizing image transfer0 bytes/s [ 6.2 ] Download completed successfully [ 6.2 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 4 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup '84de46ed-4917-474d-897d-77ed56030084' is ready [ 1.2 ] Created checkpoint 'f51a45e3-10f6-4e99-bd25-9833c1256a8b' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 3.1 ] Image transfer 'dda45081-f8dd-477c-ab78-e49ed7f237e5' is ready [ 3.1 ] Finalizing image transfer0 bytes/s [ 5.2 ] Download completed successfully [ 5.2 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 5 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'd438aafb-d57d-4027-87db-cf33f0817dd3' is ready [ 1.2 ] Created checkpoint '86911693-4156-4452-aa55-4acb1a90be23' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 2.9 ] Image transfer '0ae2fe92-6b95-4631-8b02-4d7e549eb4e5' is ready [ 2.9 ] Finalizing image transfer0 bytes/s [ 5.9 ] Download completed successfully [ 5.9 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 6 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'ba327596-5c27-48da-beb8-f2c2e51db0ad' is ready [ 1.2 ] Created checkpoint '3672cb60-edad-4794-b38e-50ca3473d40d' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 3.1 ] Image transfer '6375c4b6-7fd9-405e-990c-978e88de1bb1' is ready [ 3.1 ] Finalizing image transfer0 bytes/s [ 6.1 ] Download completed successfully [ 6.1 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 7 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'ff7d4150-561b-4cf3-b5fc-a8ff98f256fe' is ready [ 2.2 ] Created checkpoint '88179ce2-093e-48e6-b44a-e71e04b7f32d' (to use in --from-checkpoint-uuid for the next incremental backup) [ 2.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 3.6 ] Image transfer '1e1cd2cb-b15e-4e66-a9f0-04bc99e3a50b' is ready [ 100.00% ] 6.00 GiB, 4.75 seconds, 1.26 GiB/s [ 8.4 ] Finalizing image transfer [ 18.6 ] Download completed successfully [ 18.6 ] Downloading full backup for disk '74d8e801-f555-494f-a788-a7e64c8e7606' [ 20.2 ] Image transfer 'df782eb3-799d-40c5-aa95-20f0791f6bcd' is ready [ 100.00% ] 10.00 GiB, 0.07 seconds, 135.31 GiB/s [ 20.2 ] Finalizing image transfer [ 22.3 ] Download completed successfully [ 22.3 ] Finalizing backup [ 31.5 ] Full backup 'ff7d4150-561b-4cf3-b5fc-a8ff98f256fe' completed successfully --- iteration 8 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'b965e31e-7f7a-4ba5-8173-00749b2f5f19' is ready [ 1.2 ] Created checkpoint 'cd2d2948-20b7-4a6a-96ab-0feab49148fb' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.2 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 2.8 ] Image transfer '76d82aad-60af-4f41-85f5-3a98809cfcf0' is ready [ 2.8 ] Finalizing image transfer0 bytes/s [ 4.9 ] Download completed successfully [ 4.9 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 9 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup '9a072ad5-c1dc-4750-a33a-49f6e5889a49' is ready [ 1.2 ] Created checkpoint '6e90df94-9a7d-4050-b4a7-0cb0801b13df' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 2.9 ] Image transfer 'a59b4603-9870-43a5-a621-152bb66c5a16' is ready [ 2.9 ] Finalizing image transfer0 bytes/s [ 5.9 ] Download completed successfully [ 5.9 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- iteration 10 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup '9ff432bc-59cb-4ac1-b73e-e2d0028ff4ad' is ready [ 1.2 ] Created checkpoint '713ea2f3-4b1b-4981-adf8-01489aa4925b' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.2 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 2.9 ] Image transfer '88b473b0-19f0-4b5c-af3d-9eadc9db64de' is ready [ 3.0 ] Finalizing image transfer0 bytes/s [ 6.0 ] Download completed successfully [ 6.0 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 444, in download_disk client.download( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 150, in download with _open_http( File "/usr/lib64/python3.10/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.10/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- Example reproduction with incremental backup. First do full backup - I was lucky and it succeeded on the first attempt: $ ./backup_vm.py -c engine-dev full --backup-dir /data/scratch/backups 6e95d38f-d9b8-4955-878c-da6d631d0ab2 [ 0.0 ] Starting full backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup '5cf4f02c-2eaf-4e2b-aa91-c90b6431e592' is ready [ 2.3 ] Created checkpoint 'e802b0b4-1765-4f55-8119-3d79399b5f2d' (to use in --from-checkpoint-uuid for the next incremental backup) [ 2.3 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 3.8 ] Image transfer '28ac3eb2-e5f1-4e72-9ed3-f7e96268b6ae' is ready [ 100.00% ] 6.00 GiB, 4.37 seconds, 1.37 GiB/s [ 8.2 ] Finalizing image transfer [ 10.3 ] Download completed successfully [ 10.3 ] Downloading full backup for disk '74d8e801-f555-494f-a788-a7e64c8e7606' [ 11.8 ] Image transfer 'e81ad73c-6758-4883-bc82-9d3bd2a76aa4' is ready [ 100.00% ] 10.00 GiB, 0.07 seconds, 141.55 GiB/s [ 11.9 ] Finalizing image transfer [ 14.9 ] Download completed successfully [ 14.9 ] Finalizing backup [ 21.1 ] Full backup '5cf4f02c-2eaf-4e2b-aa91-c90b6431e592' completed successfully Now try incremental backup in a loop, all using checkpoint e802b0b4-1765-4f55-8119-3d79399b5f2d. First attempt succeeded, second attempt timed out after 60 seconds, third attempt blocked forever waiting for backup completion. $ for i in $(seq 10); do echo iteration $i; ./backup_vm.py -c engine-dev incremental --from-checkpoint-uuid e802b0b4-1765-4f55-8119-3d79399b5f2d --backup-dir /data/scratch/backups 6e95d38f-d9b8-4955-878c-da6d631d0ab2; echo "---"; done iteration 1 [ 0.0 ] Starting incremental backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'da90ef42-4bbc-4fe6-8b6f-feb9938854bc' is ready [ 1.2 ] Created checkpoint '964c2cb0-5f47-4ba1-b4c6-929db015ed85' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.3 ] Downloading incremental backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 2.9 ] Image transfer '42038e4a-0b7b-4a42-8a66-7fe4741acf13' is ready [ 100.00% ] 6.00 GiB, 0.08 seconds, 71.81 GiB/s [ 3.0 ] Finalizing image transfer [ 5.0 ] Download completed successfully [ 5.0 ] Downloading incremental backup for disk '74d8e801-f555-494f-a788-a7e64c8e7606' [ 6.5 ] Image transfer '685dafe2-a8f4-4851-9a94-a82b42836bdc' is ready [ 100.00% ] 10.00 GiB, 0.07 seconds, 152.78 GiB/s [ 6.6 ] Finalizing image transfer [ 9.6 ] Download completed successfully [ 9.6 ] Finalizing backup [ 12.8 ] Incremental backup 'da90ef42-4bbc-4fe6-8b6f-feb9938854bc' completed successfully --- iteration 2 [ 0.0 ] Starting incremental backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'd21d6285-fca6-4b3d-bc23-bc7367a802c0' is ready [ 1.2 ] Created checkpoint 'b91f3a63-7457-4349-8e13-845e9da2347a' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.2 ] Downloading incremental backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 62.6 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 225, in cmd_incremental download_backup(connection, backup.id, args, incremental=True) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 420, in download_disk transfer = imagetransfer.create_transfer( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 240, in create_transfer raise RuntimeError( RuntimeError: Timed out waiting for transfer 411acaea-dc6b-4ff9-904c-e3e47e6d9394 --- iteration 3 [ 0.0 ] Starting incremental backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.2 ] Waiting until backup 'fd31a96d-7df0-4674-9bee-1ab7d4120a02' is ready [ 1.2 ] Created checkpoint 'a428b068-766d-4e0d-8671-22b59ab79eb0' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.2 ] Downloading incremental backup for disk '74d8e801-f555-494f-a788-a7e64c8e7606' [ 62.4 ] Finalizing backup (I interrupted the script after several minutes) [ 62.4 ] Finalizing backup ^CTraceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 225, in cmd_incremental download_backup(connection, backup.id, args, incremental=True) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 420, in download_disk transfer = imagetransfer.create_transfer( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 240, in create_transfer raise RuntimeError( RuntimeError: Timed out waiting for transfer 08864ab4-9745-450a-b5ca-98f107c43de1 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 228, in cmd_incremental stop_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 369, in stop_backup time.sleep(1) KeyboardInterrupt Checking vms backups: https://engine-dev/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups <backups> <backup href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups/fd31a96d-7df0-4674-9bee-1ab7d4120a02" id="fd31a96d-7df0-4674-9bee-1ab7d4120a02"> <actions> <link href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups/fd31a96d-7df0-4674-9bee-1ab7d4120a02/finalize" rel="finalize"/> </actions> <link href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups/fd31a96d-7df0-4674-9bee-1ab7d4120a02/disks" rel="disks"/> <creation_date>2021-11-29T12:27:59.893+02:00</creation_date> <from_checkpoint_id>e802b0b4-1765-4f55-8119-3d79399b5f2d</from_checkpoint_id> <modification_date>2021-11-29T12:29:02.228+02:00</modification_date> <phase>finalizing</phase> <to_checkpoint_id>a428b068-766d-4e0d-8671-22b59ab79eb0</to_checkpoint_id> <host href="/ovirt-engine/api/hosts/00000000-0000-0000-0000-000000000000" id="00000000-0000-0000-0000-000000000000"/> <vm href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2" id="6e95d38f-d9b8-4955-878c-da6d631d0ab2"/> </backup> </backups> Backup was stuck in finalizing state. Pavel, isn't this fixed in engine 4.5.0-0.0.master.20211118145311.el8? Testing incremental backup again with latest engine (built yesterday): $ rpm -q ovirt-engine ovirt-engine-4.5.0-0.2.master.20211128100822.el8.noarch $ for i in $(seq 10); do echo iteration $i; ./backup_vm.py -c engine-dev incremental --from-checkpoint-uuid e802b0b4-1765-4f55-8119-3d79399b5f2d --backup-dir /data/scratch/backups 6e95d38f-d9b8-4955-878c-da6d631d0ab2; echo "---"; done iteration 1 [ 0.0 ] Starting incremental backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.6 ] Waiting until backup 'e0b2608b-7652-4fec-b8f2-10e82c545d80' is ready [ 2.7 ] Created checkpoint 'cebe36f1-1aed-4440-9eff-fa54fda14e00' (to use in --from-checkpoint-uuid for the next incremental backup) [ 2.8 ] Downloading incremental backup for disk '74d8e801-f555-494f-a788-a7e64c8e7606' [ 4.7 ] Image transfer '6fa5ea1f-67ee-47be-b63a-55f30774a050' is ready [ 100.00% ] 10.00 GiB, 0.07 seconds, 144.69 GiB/s [ 4.8 ] Finalizing image transfer [ 7.8 ] Download completed successfully [ 7.9 ] Downloading incremental backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 9.5 ] Image transfer '36d590c7-60f2-460f-bdfb-1e9d4e16b361' is ready [ 100.00% ] 6.00 GiB, 0.08 seconds, 78.79 GiB/s [ 9.6 ] Finalizing image transfer [ 12.6 ] Download completed successfully [ 12.6 ] Incremental backup not available for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 12.6 ] Downloading full backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' [ 12.7 ] Finalizing backup (Interrupt script after few minutes) ^CTraceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 225, in cmd_incremental download_backup(connection, backup.id, args, incremental=True) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 420, in download_disk transfer = imagetransfer.create_transfer( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 200, in create_transfer transfer = transfers_service.add(transfer) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/services.py", line 14153, in add return self._internal_add(image_transfer, headers, query, wait) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/service.py", line 232, in _internal_add return future.wait() if wait else future File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/service.py", line 229, in callback self._check_fault(response) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/service.py", line 118, in _raise_error raise error ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk. The VM backup is not in READY phase, backup phase is FAILED.]". HTTP response code is 409. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 228, in cmd_incremental stop_backup(connection, backup.id, args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 370, in stop_backup backup = get_backup(connection, backup_service, backup_uuid) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 510, in get_backup backup = backup_service.get() File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/services.py", line 33917, in get return self._internal_get(headers, query, wait) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/service.py", line 211, in _internal_get return future.wait() if wait else future File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/service.py", line 54, in wait response = self._connection.wait(self._context) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/__init__.py", line 496, in wait return self.__wait(context, failed_auth) File "/home/nsoffer/.local/lib/python3.10/site-packages/ovirt_engine_sdk_python-4.4.15-py3.10-linux-x86_64.egg/ovirtsdk4/__init__.py", line 532, in __wait self._multi.select(1.0) KeyboardInterrupt Looks like the backup_vm.py detected a failed backup: ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk. The VM backup is not in READY phase, backup phase is FAILED.]". HTTP response code is 409. And tried to finalize the backup, but the backup is stuck in finalizing phase again: https://engine-dev/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups <backups> <backup href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups/e0b2608b-7652-4fec-b8f2-10e82c545d80" id="e0b2608b-7652-4fec-b8f2-10e82c545d80"> <actions> <link href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups/e0b2608b-7652-4fec-b8f2-10e82c545d80/finalize" rel="finalize"/> </actions> <link href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2/backups/e0b2608b-7652-4fec-b8f2-10e82c545d80/disks" rel="disks"/> <creation_date>2021-11-29T12:54:25.777+02:00</creation_date> <from_checkpoint_id>e802b0b4-1765-4f55-8119-3d79399b5f2d</from_checkpoint_id> <modification_date>2021-11-29T12:54:38.019+02:00</modification_date> <phase>finalizing</phase> <to_checkpoint_id>cebe36f1-1aed-4440-9eff-fa54fda14e00</to_checkpoint_id> <host href="/ovirt-engine/api/hosts/00000000-0000-0000-0000-000000000000" id="00000000-0000-0000-0000-000000000000"/> <vm href="/ovirt-engine/api/vms/6e95d38f-d9b8-4955-878c-da6d631d0ab2" id="6e95d38f-d9b8-4955-878c-da6d631d0ab2"/> </backup> </backups> Looks like the bug when backup is stuck in finalizing state is not fixed yet in 4.5. Notes for testing on 4.4.9/10: 1. Test with single host - make it much more reproducible. 2. Test with the standard environment (3 hosts) - closer to real system. This will tell how this failure is likely for users. > Looks like the bug when backup is stuck in finalizing state is not fixed > yet in 4.5. Right, and it is already reported by bug 1985746 and it is not fixed yet. Amit, can you please check if this bug was also found on 4.4.9/4.4.10? Nir, what is the severity of this bug? why it is urgent? Following our meeting today, I reproduced the issue on vdsm-4.40.100.1-1.el8ev.x86_64 / ovirt-engine-4.4.10-0.17.el8ev.noarch I created a VM with 2 disks: - OS qcow2 disk using a template - Empty 10g qcow2 data disk For the "full backup" procedure with 3 hosts I got: 3/10 failures. 2/3 failures were: Error message- Traceback (most recent call last): File "./backup_vm.py", line 525, in <module> main() File "./backup_vm.py", line 191, in main args.command(args) File "./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 407, in download_backup incremental=has_incremental) File "./backup_vm.py", line 425, in download_disk timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy)) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 241, in create_transfer "Timed out waiting for transfer {}".format(transfer.id)) RuntimeError: Timed out waiting for transfer c9945bd1-0d4b-4c07-96dc-b655225ae515 1/3 failures I got: backup stuck in finalizing state like mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1985746 I did the same flow mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2027260#c10 with less active hosts (full backup, 2 hosts, for a VM with 2 disks): The issue didn't reproduce on the older version: vdsm-4.40.90.4-1.el8ev.x86_64/ovirt-engine-4.4.9.5-0.1.el8ev.noarch Please give me a "need info" if any additional information/reproductions are required from QE. When using only 1 active host it was easier to reproduce the issue. Tested on vdsm-4.40.100.1-1.el8ev.x86_64 / ovirt-engine-4.4.10-0.17.el8ev.noarch I created a VM with 2 disks: - OS qcow2 disk using a template - Empty 10g qcow2 data disk For the "full backup" procedure I got: 4/6 failures (in the 6th out of 10 iterations of the script, the backup was stuck in finalizing). Example of error messages I received: Traceback (most recent call last): File "./backup_vm.py", line 525, in <module> main() File "./backup_vm.py", line 191, in main args.command(args) File "./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 407, in download_backup incremental=has_incremental) File "./backup_vm.py", line 425, in download_disk timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy)) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 241, in create_transfer "Timed out waiting for transfer {}".format(transfer.id)) RuntimeError: Timed out waiting for transfer bf600dd1-8675-4164-a6ac-64ad106c4747 --- Traceback (most recent call last): File "./backup_vm.py", line 525, in <module> main() File "./backup_vm.py", line 191, in main args.command(args) File "./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 407, in download_backup incremental=has_incremental) File "./backup_vm.py", line 452, in download_disk **extra_args) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 155, in download proxy_url=proxy_url) as src: File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 514, in _open_http return http.open(url, mode, cafile=cafile, secure=secure) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 48, in open return Backend(url, **options) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 76, in __init__ self._connect() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 119, in _connect options = self._options() File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 482, in _options raise self._reraise(res.status, body) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise raise http.Error(status, msg) ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs --- (In reply to Amit Sharir from comment #12) > When using only 1 active host it was easier to reproduce the issue. > Tested on vdsm-4.40.100.1-1.el8ev.x86_64 / > ovirt-engine-4.4.10-0.17.el8ev.noarch ... > For the "full backup" procedure I got: 4/6 failures (in the 6th out of 10 > iterations of the script, the backup was stuck in finalizing). So we tried 6 times, 1 succeeded, 4 failed, 1 stuck forever. So this is not a 4.5 regression. (In reply to Eyal Shenitzky from comment #9) > Nir, what is the severity of this bug? why it is urgent? From development point of view, this is major issue, doing something which is not safe: - modifying an image during backup - running operations that activate/deactivate or lock/unlock an image at the same time From user point of view, this makes cold backup unreliable, it will fail randomly. This is very bad for RHV and for Red Hat; we should not release code that fails randomly. However, this bug is less horrible than the bug when backup is stuck in finalizing state - in that case you cannot use the system since the VM is locked. If we look on the full half of the glass, backup is not widely used yet, and this bug affect only cold backup which is not the main flow. Reproduced this issue again with only 1 active host. Tested on vdsm-4.40.100.1-1.el8ev.x86_64 / ovirt-engine-4.4.10-0.17.el8ev.noarch I created a VM with 2 disks: - OS qcow2 disk using a template - Empty 10g qcow2 data disk For the "full backup" procedure I got: 3/9 failures (in the 9th out of 10 iterations of the script, the backup was stuck in finalizing). Example of error messages I received: Traceback (most recent call last): File "./backup_vm.py", line 525, in <module> main() File "./backup_vm.py", line 191, in main args.command(args) File "./backup_vm.py", line 207, in cmd_full download_backup(connection, backup.id, args) File "./backup_vm.py", line 407, in download_backup incremental=has_incremental) File "./backup_vm.py", line 425, in download_disk timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy)) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 241, in create_transfer "Timed out waiting for transfer {}".format(transfer.id)) RuntimeError: Timed out waiting for transfer 8008b4c6-0840-413d-b932-9e8775a120d8 Attaching ovirt-imageio/daemon.log, vdsm.log and engine logs in following comments. Hi Benny, do my reproduction's look like a bitmap-related issue (backup is ready before the add_bitmap jobs are complete)? Please note that we have a finalization issue from another bug (https://bugzilla.redhat.com/show_bug.cgi?id=1985746). In other words, I want to be sure that what I reproduced in the comments above is a result of the new issue mentioned in this bug and not something else. Thanks, Looks the same 2021-11-30 20:02:52,222+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.AddVolumeBitmapVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [387aad7f-ce02-419f-933e-a41a2d8dcfc0] START, AddVolumeBitmapVDSCommand(HostName = host_mixed_1, VolumeBitmapVDSCommandParameters:{hostId='19738a5d-76d7-4eb9-b702-1c4c6c00c1da', storageDomainId='826ae9c9-4b07-4358-af82-3f76598c2dd2', jobId='f8a2c5d2-d85a-461c-a4f7-873008716632'}), log id: 158354ca 2021-11-30 20:02:52,226+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.AddVolumeBitmapVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [387aad7f-ce02-419f-933e-a41a2d8dcfc0] FINISH, AddVolumeBitmapVDSCommand, return: , log id: 158354ca 2021-11-30 20:02:52,242+02 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-15) [387aad7f-ce02-419f-933e-a41a2d8dcfc0] Ready to start image transfers 2021-11-30 20:02:52,499+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] Lock Acquired to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='[]'}' 2021-11-30 20:02:52,505+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] Running command: TransferDiskImageCommand internal: false. Entities affected : ID: 627dc8c0-a6c3-4ac7-836f-ec15592d4883 Type: DiskAction group EDIT_DISK_PROPERTIES with role type USER 2021-11-30 20:02:52,505+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] Creating ImageTransfer entity for command '8008b4c6-0840-413d-b932-9e8775a120d8', proxyEnabled: true 2021-11-30 20:02:52,515+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] Successfully added Download disk 'test1949_Disk1' (disk id: '627dc8c0-a6c3-4ac7-836f-ec15592d4883', image id: '79e48285-a7fd-48cc-81f8-2a6d76e02be3') for image transfer command '8008b4c6-0840-413d-b932-9e8775a120d8' 2021-11-30 20:02:52,527+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] START, PrepareImageVDSCommand(HostName = host_mixed_1, PrepareImageVDSCommandParameters:{hostId='19738a5d-76d7-4eb9-b702-1c4c6c00c1da'}), log id: 54e36abc 2021-11-30 20:02:53,020+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] Failed in 'PrepareImageVDS' method, for vds: 'host_mixed_1'; host: 'storage-ge11-vdsm1.lab.eng.tlv2.redhat.com': null 2021-11-30 20:02:53,020+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] Command 'PrepareImageVDSCommand(HostName = host_mixed_1, PrepareImageVDSCommandParameters:{hostId='19738a5d-76d7-4eb9-b702-1c4c6c00c1da'})' execution failed: null 2021-11-30 20:02:53,020+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.PrepareImageVDSCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] FINISH, PrepareImageVDSCommand, return: , log id: 54e36abc 2021-11-30 20:02:53,021+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-14) [b0a9a961-5ed3-4dcc-8a7d-0e89fccf1bc5] Failed to prepare image for transfer session: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001): org.ovirt.engine.core.common.errors.EngineException: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001) Version: vdsm-4.40.100.2-1.el8ev.x86_64 ovirt-engine-4.4.10.2-0.1.el8ev.noarch Verification flow: I Used both the full and the incremental backup flow mentioned in comments 15 / 12. Also tested this with only 1 vdsm active (I put the others into maintenance mode). Verification Conclusions: The expected output matched the actual output. I checked the logs and didn't find any issues regarding the "AddVolumeBitmapVDSCommand". Bug verified. This bugzilla is included in oVirt 4.4.10 release, published on January 18th 2022. Since the problem described in this bug report should be resolved in oVirt 4.4.10 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |
Created attachment 1843966 [details] Vdsm log showing several cold backup failures Description of problem: Cold backup fails randomly most of the time because engine reports the backup is ready before the add_bitmap jobs complete. When running cold backup, engine adds a new bitmap to the top volume. Engine has this loop: for (DiskImage diskImage : vmBackup.getDisks()) { ... ActionReturnValue returnValue = runInternalActionWithTasksContext( ActionType.AddVolumeBitmap, parameters); Once the internal action returns, engine switch state to READY and user can start the backup process. However when the user starts the backup, the add_bitmap storage jobs are still running. The add_bitmap job is doing: 1. Activate logical volume including parent logical volumes 2. List bitmaps in qcow2 image 3. Add bitmap to qcow2 image 4. Deactivate logical volume including parent logical volumes Disk backup is doing: 1. Activate logical volume including parent logical volumes 2. For incremental backup, prepare overlay on top of original image 3. Merge bitamp from entire chain to overlay 4. Start qemu-nbd with logical volume (full backup) or overlay (incremental backup) 5. Delete the overlay if doing incremental backup When running add_bitmap job in the same time as backup, we see various failures: ## Failure 1: qemu-img fail to open the volume add_bitmap job deactivate volume right after the volume was activated for preparing a backup. /var/log/vdsm/vdsm.log: 2021-11-28 22:30:11,446+0200 INFO (jsonrpc/7) [vdsm.api] FINISH start_nbd_server error=Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '--backing-chain', '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840'] failed with rc=1 out=b'' err=b"qemu-img: Could not open '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/9d63f782-7467-4243-af1e-5c1f8b49c111': Could not open '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/9d63f782-7467-4243-af1e-5c1f8b49c111': No such file or directory\n" from=::ffff:192.168.122.11,53010, flow_id=22b1fe4d-3498-4a63-adfc-c4ff99043585, task_id=6caa126e-1493-47fd-bba8-bf9d7ee5cb93 (api:52) ## Failure 2: qemu-nbd fail to accept connections add_bitmap job deactivate parent volumes after qemu-nbd was started. Client fail to connect to qemu-nbd. /var/log/ovirt-imageio/daemon.log: 2021-11-28 22:24:18,376 INFO (Thread-15) [images] [::ffff:192.168.122.1] OPTIONS ticket=4cebb41f-e0e2-40b8-8968-e4420ddb868 7 2021-11-28 22:24:18,376 ERROR (Thread-15) [http] Server error Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/__init__.py", line 66, in get return ticket.get_context(req.connection_id) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 146, in get_context return self._connections[con_id] KeyError: 15 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 785, in __call__ self.dispatch(req, resp) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 830, in dispatch return method(req, resp, *match.groups()) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 84, in wrapper return func(self, req, resp, *args) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/images.py", line 246, in options ctx = backends.get(req, ticket, self.config) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/__init__.py", line 85, in get cafile=ca_file) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/nbd.py", line 42, in open client = nbd.open(url, dirty=dirty) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 321, in open return Client(address, name, dirty=dirty) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 408, in __init__ self._sock = self._connect(address) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 483, in _connect return self._create_unix_connection(address) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 513, in _create_unix_connection sock.connect(address) ConnectionRefusedError: [Errno 111] Connection refused 2021-11-28 22:24:18,377 INFO (Thread-15) [http] CLOSE connection=15 client=::ffff:192.168.122.1 [connection 1 ops, 0.001430 s] [dispatch 1 ops, 0.000770 s] ## Failure 3: "qemu-img bitmap" fail to lock the image Because "qemu-img bitmap" run by add_bitmap storage job holds a write lock. 2021-11-19 00:18:42,065+0200 ERROR (jsonrpc/3) [storage.dispatcher] FINISH start_nbd_server error=Command ['/usr/bin/qemu-img' , 'bitmap', '--merge', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0', '-F', 'qcow2', '-b', '/rhev/data-center/mnt/blockSD/feab3738-c1 58-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840', '/var/lib/vdsm/st orage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca 51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840\': Failed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840]?\n' (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper return m(self, *a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1190, in prepare raise self.error File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run return fn(*args, **kargs) File "<decorator-gen-231>", line 2, in start_nbd_server File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3757, in start_nbd_server url = nbd.start_server(server_id, config) File "/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py", line 130, in start_server path = _create_overlay(server_id, vol.volumePath, cfg.bitmap) File "/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py", line 207, in _create_overlay src_img, bitmap, "qcow2", overlay, bitmap).run() File "/usr/lib/python3.6/site-packages/vdsm/storage/operation.py", line 83, in run self._finalize(out, err) File "/usr/lib/python3.6/site-packages/vdsm/storage/operation.py", line 179, in _finalize raise cmdutils.Error(self._cmd, rc, out, err) vdsm.common.cmdutils.Error: Command ['/usr/bin/qemu-img', 'bitmap', '--merge', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0', '-F', ' qcow2', '-b', '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/ 44d498a1-54a5-4371-8eda-02d839d7c840', '/var/lib/vdsm/storage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay', ' 0345dff0-cd6a-4a5f-bf3a-bf230318c5c0'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD /feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840\': Fail ed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95 c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840]?\n' ## Failure 4: "qemu-img bitmap" fail to lock the image This is a variant of 3 - "qemu-img bitmap" run by add_bitmap storage job fails because "qemu-img bimtap" run by start_nbd_server code took the lock. In this case the current backup may succeed, but the next incremental backup will fail since bitmap was not added. The root cause is not waiting for the storage jobs to complete before reporting the backup as READY. In engine we should have this flow: for disk in disks: start add bitmap job wait until add bitmap jobs complete report backup as ready Version-Release number of selected component (if applicable): 4.5.0-0.0.master.20211118145311.el8 How reproducible: Randomly, but mostly fails. Steps to Reproduce: 1. Start full or incremental backup Actual results: Backup fails in various ways, logs showing that add_bitmap jobs were running when backup was started. Expected results: Backup succeeds, log show that add_bitmap jobs completed before backup was started. Additional info: ## Complete flow showing failure 3: Staring incremental backup succeeded... $ ./backup_vm.py -c engine-dev incremental --from-checkpoint-uuid 0345dff0-cd6a-4a5f-bf3a-bf230318c5c0 --backup-dir /data/scratch/nsoffer/backups 6e95d38f-d9b8-4955-878c-da6d631d0ab2 [ 0.0 ] Starting incremental backup for VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' [ 0.3 ] Waiting until backup '9b66cc4e-4814-4162-8200-aba58aa54b18' is ready [ 1.3 ] Created checkpoint 'be87e5a5-5ed3-4aa5-915a-84f992a7bcf5' (to use in --from-checkpoint-uuid for the next incremental backup) [ 1.4 ] Downloading incremental backup for disk '242c082f-ca51-4cf4-82b9-fd3a3c92f38b' But the download was stuck, and finally timed out: [ 62.6 ] Finalizing backup Traceback (most recent call last): File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 525, in <module> main() File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 191, in main args.command(args) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 225, in cmd_incremental download_backup(connection, backup.id, args, incremental=True) File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 405, in download_backup download_disk( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/./backup_vm.py", line 420, in download_disk transfer = imagetransfer.create_transfer( File "/home/nsoffer/src/python-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 240, in create_transfer raise RuntimeError( RuntimeError: Timed out waiting for transfer e97a0008-b57a-4931-8153-d87113da9431 In engine event log we see this error: VDSM host4 command StartNbdServerVDS failed: Command ['/usr/bin/qemu-img', 'bitmap', '--merge', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0', '-F', 'qcow2', '-b', '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840', '/var/lib/vdsm/storage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840\': Failed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840]?\n' Looks like someone was accessing the image while we tried to prepare the disk for cold backup. But when this happens, starting the nbd server fails. Looking in vdsm log we see: 2021-11-19 00:18:41,805+0200 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'NBD.start_server' in bridge with {'server_id': 'fbf383f0-8de9-4e14-9a4c-2f9194ade81d', 'config': {'detect_zeroes': True, 'discard': False, 'readonly': True, 'bitmap': '0345 dff0-cd6a-4a5f-bf3a-bf230318c5c0', 'sd_id': 'feab3738-c158-4d48-8a41-b5a95c057a50', 'img_id': '242c082f-ca51-4cf4-82b9-fd3a3c9 2f38b', 'vol_id': '44d498a1-54a5-4371-8eda-02d839d7c840', 'backing_chain': True}} (__init__:333) We created an overlay image using transient disk: 2021-11-19 00:18:41,941+0200 INFO (jsonrpc/3) [storage.transientdisk] Creating transient disk /var/lib/vdsm/storage/transientas _disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay (transientdisk:61) Then we create qcow2 image in the overlay using volume 44d498a1-54a5-4371-8eda-02d839d7c840 as a backing file: 2021-11-19 00:18:41,941+0200 DEBUG (jsonrpc/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/nice -n 19 /usr/bin/ ionice -c 3 /usr/bin/qemu-img create -f qcow2 -o compat=1.1 -b /rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057 a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840 -F qcow2 /var/lib/vdsm/storage/transient_ disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay (cwd None) (commands:154) But at the same time there is another thread accessing the same volume! 2021-11-19 00:18:41,952+0200 DEBUG (tasks/4) [storage.misc.exccmd] SUCCESS: <err> = b'1+0 records in\n1+0 records out\n512 byt es copied, 0.000743612 s, 689 kB/s\n'; <rc> = 0 (commands:247) 2021-11-19 00:18:41,952+0200 DEBUG (tasks/4) [storage.misc] err: [b'1+0 records in', b'1+0 records out', b'512 bytes copied, 0.000743612 s, 689 kB/s'], size: 512 (misc:112) 2021-11-19 00:18:41,952+0200 INFO (tasks/4) [storage.volumemanifest] Starting volume operation on 44d498a1-54a5-4371-8eda-02d839d7c840 (generation=2, set_illegal=False) (volume:691) tasks/4 is adding a bitmap in the volume used as the backing file of the overlay... 2021-11-19 00:18:41,953+0200 INFO (tasks/4) [storage.bitmaps] Add bitmap be87e5a5-5ed3-4aa5-915a-84f992a7bcf5 to '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840' (bitmaps:168) thread tasks/4 is starting to add a bitmap in the volume 44d498a1-54a5-4371-8eda-02d839d7c840 2021-11-19 00:18:41,953+0200 DEBUG (tasks/4) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img bitmap --add --enable /rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840 be87e5a5-5ed3-4aa5-915a-84f992a7bcf5 (cwd /rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b) (commands:154) thread jsonrpc/3 finished to create the overlay image... 2021-11-19 00:18:42,002+0200 DEBUG (jsonrpc/3) [storage.operation] SUCCESS: <err> = b''; <rc> = 0 (operation:170) thread jsonrpc/3 add bitmap in the overlay... 2021-11-19 00:18:42,003+0200 DEBUG (jsonrpc/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img bitmap --add --enable /var/lib/vdsm/storage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay 0345dff0-cd6a-4a5f-bf3a-bf230318c5c0 (cwd /var/lib/vdsm/storage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d) (commands:154) 2021-11-19 00:18:42,035+0200 DEBUG (jsonrpc/3) [storage.operation] SUCCESS: <err> = b''; <rc> = 0 (operation:170) thread jsonrpc/3 try to merge bitmaps into the ovelay... 2021-11-19 00:18:42,035+0200 DEBUG (jsonrpc/3) [common.commands] /usr/bin/taskset --cpu-list 0-1 /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img bitmap --merge 0345dff0-cd6a-4a5f-bf3a-bf230318c5c0 -F qcow2 -b /rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840 /var/lib/vdsm/storage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay 0345dff0-cd6a-4a5f-bf3a-bf230318c5c0 (cwd /rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b) (commands:154) And it fails since someone else (thread tasks/4) is taking a lock on the backing file. 2021-11-19 00:18:42,063+0200 DEBUG (jsonrpc/3) [storage.operation] FAILED: <err> = b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840\': Failed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840]?\n'; <rc> = 1 (operation:170) And tasks/4 finished. 2021-11-19 00:18:42,089+0200 DEBUG (tasks/4) [storage.operation] SUCCESS: <err> = b''; <rc> = 0 (operation:170) We modify the image being backed up - bad bad bug. Now the NBD.start_server request fails, as expected: 2021-11-19 00:18:42,065+0200 ERROR (jsonrpc/3) [storage.dispatcher] FINISH start_nbd_server error=Command ['/usr/bin/qemu-img' , 'bitmap', '--merge', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0', '-F', 'qcow2', '-b', '/rhev/data-center/mnt/blockSD/feab3738-c1 58-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840', '/var/lib/vdsm/st orage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca 51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840\': Failed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d4 98a1-54a5-4371-8eda-02d839d7c840]?\n' (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper return m(self, *a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1190, in prepare raise self.error File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run return fn(*args, **kargs) File "<decorator-gen-231>", line 2, in start_nbd_server File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3757, in start_nbd_server url = nbd.start_server(server_id, config) File "/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py", line 130, in start_server path = _create_overlay(server_id, vol.volumePath, cfg.bitmap) File "/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py", line 207, in _create_overlay src_img, bitmap, "qcow2", overlay, bitmap).run() File "/usr/lib/python3.6/site-packages/vdsm/storage/operation.py", line 83, in run self._finalize(out, err) File "/usr/lib/python3.6/site-packages/vdsm/storage/operation.py", line 179, in _finalize raise cmdutils.Error(self._cmd, rc, out, err) vdsm.common.cmdutils.Error: Command ['/usr/bin/qemu-img', 'bitmap', '--merge', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0', '-F', ' qcow2', '-b', '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/ 44d498a1-54a5-4371-8eda-02d839d7c840', '/var/lib/vdsm/storage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay', ' 0345dff0-cd6a-4a5f-bf3a-bf230318c5c0'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD /feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840\': Fail ed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95 c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840]?\n' On engine side, we see the error: 2021-11-19 00:18:42,077+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StartNbdServerVDSCommand] (default task-1) [ea5f41 c1-2fc3-4414-b85b-36c6da71d6af] FINISH, StartNbdServerVDSCommand, return: , log id: 78f11471 2021-11-19 00:18:42,078+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-1) [ea5 f41c1-2fc3-4414-b85b-36c6da71d6af] Failed to prepare image for transfer session: EngineException: org.ovirt.engine.core.vdsbro ker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to StartNbdServerVDS, error = Command ['/usr/b in/qemu-img', 'bitmap', '--merge', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0', '-F', 'qcow2', '-b', '/rhev/data-center/mnt/blockSD /feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840', '/var /lib/vdsm/storage/transient_disks/fbf383f0-8de9-4e14-9a4c-2f9194ade81d/overlay', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0'] faile d with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images /242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840\': Failed to get shared "write" lock\nIs another pr ocess using the image [/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3 c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840]?\n', code = 100 (Failed with error GeneralException and code 100): org.ovirt.eng ine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGener icException: VDSErrorException: Failed to StartNbdServerVDS, error = Command ['/usr/bin/qemu-img', 'bitmap', '--merge', '0345d ff0-cd6a-4a5f-bf3a-bf230318c5c0', '-F', 'qcow2', '-b', '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/ima ges/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840', '/var/lib/vdsm/storage/transient_disks/fbf383f 0-8de9-4e14-9a4c-2f9194ade81d/overlay', '0345dff0-cd6a-4a5f-bf3a-bf230318c5c0'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d 498a1-54a5-4371-8eda-02d839d7c840\': Failed to get shared "write" lock\nIs another process using the image [/rhev/data-center/ mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c 840]?\n', code = 100 (Failed with error GeneralException and code 100) But backup was started successfully! Following the events for this backup in engine log: 2021-11-19 00:18:39,868+02 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-1) [17244d70-d10c-403f-999d-0ce6cf969cdf] Change VM '6e95d38f-d9b8-4955-878c-da6d631d0ab2' backup '9b66cc4e-4814-4162-8200-aba58aa54b18' phase from 'INITIALIZING' to 'STARTING' 2021-11-19 00:18:40,585+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.AddVolumeBitmapCommand] (EE-ManagedScheduledExe cutorService-engineScheduledThreadPool-Thread-17) [17244d70-d10c-403f-999d-0ce6cf969cdf] Running command: AddVolumeBitmapCommnd internal: true. 2021-11-19 00:18:40,630+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.AddVolumeBitmapVDSCommand] (EE-ManagedScheduledExe cutorService-engineScheduledThreadPool-Thread-17) [17244d70-d10c-403f-999d-0ce6cf969cdf] FINISH, AddVolumeBitmapVDSCommand, re turn: , log id: 73d3acaa2 Creating first bitmap finished in 0.045 seconds 2021-11-19 00:18:40,647+02 INFO [org.ovirt.engine.core.bll.storage.disk.image.AddVolumeBitmapCommand] (EE-ManagedScheduledExe cutorService-engineScheduledThreadPool-Thread-17) [17244d70-d10c-403f-999d-0ce6cf969cdf] Running command: AddVolumeBitmapCommand internal: true. 2021-11-19 00:18:40,800+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.AddVolumeBitmapVDSCommand] (EE-ManagedScheduledExe cutorService-engineScheduledThreadPool-Thread-17) [17244d70-d10c-403f-999d-0ce6cf969cdf] START, AddVolumeBitmapVDSCommand(Host Name = host4, VolumeBitmapVDSCommandParameters:{hostId='85a424f7-7cd6-4a89-993c-539d6c10fd0c', storageDomainId='feab3738-c158- 4d48-8a41-b5a95c057a50', jobId='b2875d2b-ea1c-4cc7-bafb-8ebfd5d81cbc'}), log id: 135c503a Creating a second bitmap finished in 0.153 seconds. 2021-11-19 00:18:40,820+02 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (EE-ManagedScheduledExecutorS ervice-engineScheduledThreadPool-Thread-17) [17244d70-d10c-403f-999d-0ce6cf969cdf] Change VM '6e95d38f-d9b8-4955-878c-da6d631d 0ab2' backup '9b66cc4e-4814-4162-8200-aba58aa54b18' phase from 'STARTING' to 'READY' In vdsm we see: 2021-11-19 00:18:40,623+0200 INFO (jsonrpc/7) [vdsm.api] START sdm_add_bitmap(job_id='cce515dc-3d31-4178-a2ae-66c5e5eea641', vol_info={'generation': 2, 'sd_id': 'feab3738-c158-4d48-8a41-b5a95c057a50', 'img_id': '242c082f-ca51-4cf4-82b9-fd3a3c92f38b', 'vol_id': '44d498a1-54a5-4371-8eda-02d839d7c840'}, bitmap='be87e5a5-5ed3-4aa5-915a-84f992a7bcf5') from=::ffff:192.168.122.11,45386, flow_id=17244d70-d10c-403f-999d-0ce6cf969cdf, task_id=4c8986ec-fa86-4b7b-8f92-ffc9f3dc0742 (api:48) 2021-11-19 00:18:40,624+0200 DEBUG (jsonrpc/7) [storage.taskmanager] scheduled job add_bitmap for task 4c8986ec-fa86-4b7b-8f92-ffc9f3dc0742 (taskManager:79) 2021-11-19 00:18:40,624+0200 INFO (jsonrpc/7) [vdsm.api] FINISH sdm_add_bitmap return=None from=::ffff:192.168.122.11,45386, flow_id=17244d70-d10c-403f-999d-0ce6cf969cdf, task_id=4c8986ec-fa86-4b7b-8f92-ffc9f3dc0742 (api:54) The command finished in 1 millisecond - because it did run yet - it is scheduled to run later like all storage jobs. 2021-11-19 00:18:40,624+0200 INFO (tasks/4) [storage.threadpool.workerthread] START task 4c8986ec-fa86-4b7b-8f92-ffc9f3dc0742 (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f8e47f18048>>, args=None) (threadPool:146) Yes, this is tasks/4 we met before. It finished in 1.8 seconds. 2021-11-19 00:18:42,445+0200 INFO (tasks/4) [storage.threadpool.workerthread] FINISH task 4c8986ec-fa86-4b7b-8f92-ffc9f3dc0742 (threadPool:148) ## Complete flow showing failure 1: 1. Job b2bb3f7c-702f-481b-bbbf-a92cb73e29b5 starting 2021-11-28 22:30:10,073+0200 INFO (jsonrpc/0) [vdsm.api] START sdm_add_bitmap(job_id='b2bb3f7c-702f-481b-bbbf-a92cb73e29b5', vol_info={'generation': 7, 'sd_id': 'feab3738-c158-4d48-8a41-b5a95c057a50', 'img_id': '242c082f-ca51-4cf4-82b9-fd3a3c92f38b', 'vol_id': '44d498a1-54a5-4371-8eda-02d839d7c840'}, bitmap='a2993507-d5eb-4191-b769-a5cd943e795b') from=::ffff:192.168.122.11,53010, flow_id=96fec517-538f-46d2-9a88-024240de4824, task_id=9542a71b-8eba-473e-bd0a-92332af9b99b (api:48) 2. Job b2bb3f7c-702f-481b-bbbf-a92cb73e29b5 scheduled task 9542a71b-8eba-473e-bd0a-92332af9b99b 2021-11-28 22:30:10,073+0200 INFO (jsonrpc/0) [vdsm.api] FINISH sdm_add_bitmap return=None from=::ffff:192.168.122.11,53010, flow_id=96fec517-538f-46d2-9a88-024240de4824, task_id=9542a71b-8eba-473e-bd0a-92332af9b99b (api:54) 3. Task 9542a71b-8eba-473e-bd0a-92332af9b99b startring on thread tasks/2 2021-11-28 22:30:10,105+0200 INFO (tasks/2) [storage.threadpool.workerthread] START task 9542a71b-8eba-473e-bd0a-92332af9b99b (cmd=<bound method Task.commit of <vdsm.storage.task.Task object at 0x7f782197cac8>>, args=None) (threadPool:146) 4. Engine prepares image for image transfer 2021-11-28 22:30:10,996+0200 INFO (jsonrpc/4) [vdsm.api] START prepareImage(sdUUID='feab3738-c158-4d48-8a41-b5a95c057a50', spUUID='3f8e08e6-5485-429e-a686-63754906b27b', imgUUID='242c082f-ca51-4cf4-82b9-fd3a3c92f38b', leafUUID='44d498a1-54a5-4371-8eda-02d839d7c840', allowIllegal=True) from=::ffff:192.168.122.11,53010, flow_id=22b1fe4d-3498-4a63-adfc-c4ff99043585, task_id=be70f82f-cab3-49cd-b4ef-84658b2a3fce (api:48) 5. LV was already active (activated by thread tasks/2), so it is refreshed 2021-11-28 22:30:11,103+0200 INFO (jsonrpc/4) [storage.lvm] Refreshing active lvs: vg=feab3738-c158-4d48-8a41-b5a95c057a50 lvs=['9d63f782-7467-4243-af1e-5c1f8b49c111'] (lvm:1678) 6. Thread tasks/2 deactivate LV 2021-11-28 22:30:11,137+0200 INFO (tasks/2) [storage.lvm] Deactivating lvs: vg=feab3738-c158-4d48-8a41-b5a95c057a50 lvs=['9d63f782-7467-4243-af1e-5c1f8b49c111'] (lvm:1693) 7. Prepare image finished, but LV was deactivated by tasks/2 2021-11-28 22:30:11,413+0200 INFO (jsonrpc/4) [vdsm.api] FINISH prepareImage return={'path': '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840', 'info': {'type': 'block', 'path': '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840'}, 'imgVolumesInfo': [{'domainID': 'feab3738-c158-4d48-8a41-b5a95c057a50', 'imageID': '242c082f-ca51-4cf4-82b9-fd3a3c92f38b', 'volumeID': '44d498a1-54a5-4371-8eda-02d839d7c840', 'path': '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840', 'leasePath': '/dev/feab3738-c158-4d48-8a41-b5a95c057a50/leases', 'leaseOffset': 110100480}, {'domainID': 'feab3738-c158-4d48-8a41-b5a95c057a50', 'imageID': '242c082f-ca51-4cf4-82b9-fd3a3c92f38b', 'volumeID': '9d63f782-7467-4243-af1e-5c1f8b49c111', 'path': '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/9d63f782-7467-4243-af1e-5c1f8b49c111', 'leasePath': '/dev/feab3738-c158-4d48-8a41-b5a95c057a50/leases', 'leaseOffset': 109051904}]} from=::ffff:192.168.122.11,53010, flow_id=22b1fe4d-3498-4a63-adfc-c4ff99043585, task_id=be70f82f-cab3-49cd-b4ef-84658b2a3fce (api:54) 8. Task 9542a71b-8eba-473e-bd0a-92332af9b99b finished 2021-11-28 22:30:11,328+0200 INFO (tasks/2) [storage.threadpool.workerthread] FINISH task 9542a71b-8eba-473e-bd0a-92332af9b99b (threadPool:148) 9. Engine starts NBD server 2021-11-28 22:30:11,416+0200 INFO (jsonrpc/7) [vdsm.api] START start_nbd_server(server_id='e1e63d2b-1253-4fbb-9b48-5637c7664d15', config={'detect_zeroes': True, 'discard': False, 'readonly': True, 'bitmap': 'd333d467-da40-4169-9963-c490fc7620c0', 'sd_id': 'feab3738-c158-4d48-8a41-b5a95c057a50', 'img_id': '242c082f-ca51-4cf4-82b9-fd3a3c92f38b', 'vol_id': '44d498a1-54a5-4371-8eda-02d839d7c840', 'backing_chain': True}) from=::ffff:192.168.122.11,53010, flow_id=22b1fe4d-3498-4a63-adfc-c4ff99043585, task_id=6caa126e-1493-47fd-bba8-bf9d7ee5cb93 (api:48) 10. Starting NBD server failed, since the LV was deactivated by tasks/2. 2021-11-28 22:30:11,446+0200 INFO (jsonrpc/7) [vdsm.api] FINISH start_nbd_server error=Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '--backing-chain', '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/44d498a1-54a5-4371-8eda-02d839d7c840'] failed with rc=1 out=b'' err=b"qemu-img: Could not open '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/9d63f782-7467-4243-af1e-5c1f8b49c111': Could not open '/rhev/data-center/mnt/blockSD/feab3738-c158-4d48-8a41-b5a95c057a50/images/242c082f-ca51-4cf4-82b9-fd3a3c92f38b/9d63f782-7467-4243-af1e-5c1f8b49c111': No such file or directory\n" from=::ffff:192.168.122.11,53010, flow_id=22b1fe4d-3498-4a63-adfc-c4ff99043585, task_id=6caa126e-1493-47fd-bba8-bf9d7ee5cb93 (api:52)