Created attachment 1766039 [details] host logs Description of problem: I can’t backup a vm wih two disks on ovirt 4.4.5 Version-Release number of selected component (if applicable): Ovirt-4.4.5.10 vdsm-4.40.50.8-1.el8.x86_64 ovirt-imageio-daemon-2.1.1-1.el8.x86_64 libvirt-daemon-6.6.0-13.el8.x86_64 qemu-kvm-5.1.0-20.el8.x86_64 How reproducible: VM – ubuntu 20.04 two disks on virtio-scsi with enabled incremental backup Steps to Reproduce: 1.Start the vm backup 2.Try to create image transfer Image transfer failed with 'proxy_url' child was not found in image_transfer element 'transfer_url' child was not found in image_transfer element Expected results: Image transfer successfully created Additional info: The main issue in the VDSM log 2021-03-24 18:35:17,780+0100 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='75f2f936-0f81-48d8-bf7c-cf6083ec8f13') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-147>", line 2, in add_image_ticket File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3085, in add_image_ticket imagetickets.add_ticket(ticket) File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 66, in wrapper return func(*args, **kw) File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 74, in add_ticket _request("PUT", ticket["uuid"], body) File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 119, in _request raise se.ImageDaemonError(res.status, res.reason, error) 2021-03-24 18:35:17,781+0100 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.add_image_ticket failed (error 482) in 0.00 seconds (__init__:312)
(In reply to Yury.Panchenko from comment #0) > How reproducible: > VM – ubuntu 20.04 two disks on virtio-scsi with enabled incremental backup Is this reproducible? how many times it failed? how many attempts? > Steps to Reproduce: > 1.Start the vm backup The backup response was successful? Can you share engine response? > 2.Try to create image transfer > Image transfer failed with > 'proxy_url' child was not found in image_transfer element > 'transfer_url' child was not found in image_transfer element This means the transfer failed, but you should not fail with this error if you check the transfer phase. You can access proxy_url and transfer_url only when the transfer is in phase TRANSFERRING. In vdsm log we see: 2021-03-24 15:01:02,407+0100 INFO (jsonrpc/0) [vdsm.api] START add_image_ticket(ticket={'dirty': False, 'ops': ['read'], 'size': 68719476736, 'sparse': True, 'transfer_id': '4cb2d4db-2691-4bee-8d4c-1592c0316703', 'uuid': '77921d4c-e592-41be-8bcd-02622d4b12b4', 'timeout': 300, 'url': None}) from=::ffff:172.25.16.44,42988, flow_id=fe63a36f-4325-46eb-b8bb-e7b1df7ccf07, task_id=f5eb1f3b-fd4c-4a9e-9cb4-b18ca3acf18d (api:48) This is not a valid ticket since "url" is None, but vdsm is only used to route the message to imageio... 2021-03-24 15:01:02,410+0100 INFO (jsonrpc/0) [vdsm.api] FINISH add_image_ticket error=Image daemon request failed: "status=400, reason=Bad Request, error=Invalid ticket: Invalid value for 'url': None: expecting a <class 'str'> value\n" from=::ffff:172.25.16.44,42988, flow_id=fe63a36f-4325-46eb-b8bb-e7b1df7ccf07, task_id=f5eb1f3b-fd4c-4a9e-9cb4-b18ca3acf18d (api:52) 2021-03-24 15:01:02,411+0100 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='f5eb1f3b-fd4c-4a9e-9cb4-b18ca3acf18d') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-147>", line 2, in add_image_ticket File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3085, in add_image_ticket imagetickets.add_ticket(ticket) File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 66, in wrapper return func(*args, **kw) File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 74, in add_ticket _request("PUT", ticket["uuid"], body) File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 119, in _request raise se.ImageDaemonError(res.status, res.reason, error) vdsm.storage.exception.ImageDaemonError: Image daemon request failed: "status=400, reason=Bad Request, error=Invalid ticket: Invalid value for 'url': None: expecting a <class 'str'> value\n" Which fails as expected, since there is no way to serve this request. But the root cause may be earlier in the log, for example maybe libvirt failed to start the backup, and we returned null disk url? On engine log we see: 2021-03-24 15:01:02,401+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-25) [fe63a36f-4325-46eb-b8bb-e7b1df7ccf07] START, AddImageTicketVDSCommand(HostName = PAN-KVMC, AddImageTicketVDSCommandParameters:{hostId='6c9c9438-17e1-4aa3-83a9-75c793e39f66', ticketId='77921d4c-e592-41be-8bcd-02622d4b12b4', timeout='300', operations='[read]', size='68719476736', url='null', filename='null', sparse='true', transferId='4cb2d4db-2691-4bee-8d4c-1592c0316703', dirty='false'}), log id: 6d52b223 But this is too late, we need to find the log for starting this backup but for this we need the backup id or the disk id for this backup, and they are not logged in this message (should be fixed in engine). I see this log few seconds before: 2021-03-24 15:00:48,991+01 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-22) [] Operatio n Failed: [Cannot backup VM: The following disks are locked: ubunturef_Disk2. Please try again in a few minutes.] But we don't have the backup id, vm id, or disk id, so I cannot tell if this backup is related to the image transfer with the missing url. Yuri, please attach a complete vdsm log showing the time the backup request was sent to engine, and: - the backup id (engine reports it in the response to the backup request) - the disk id (part of the backup response, or set by your application) Please include this info in any backup related bug. This info should be logged by the backup application for any failure so we can track the right backup in the our logs.
Hello Nir. > Is this reproducible? how many times it failed? how many attempts? Yes. It reproduces on two different 4.4.5 very often >But this is too late, we need to find the log for starting this backup That log contains many backup attemps. This backup started at 18.35 > This means the transfer failed, but you should not fail with > this error if you check the transfer phase. You can access proxy_url > and transfer_url only when the transfer is in phase TRANSFERRING. We will check that. > Yuri, please attach a complete vdsm log showing the time the backup > request was sent to engine I will attach the response log
Created attachment 1766269 [details] responses
Small correction, in the response log time is shifted to one hour. 17.35 in reponse and 18.35 in other logs
Based on attachment 1766269 [details]: You got a backup *without* phase: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <backup href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505/backups/2c7aa425-c73f-4692-8276-c25748049a1a" id="2c7aa425-c73f-4692-8276-c25748049a1a"> <actions> <link href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505/backups/2c7aa425-c73f-4692-8276-c25748049a1a/finalize" rel="finalize"/> </actions> <link href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505/backups/2c7aa425-c73f-4692-8276-c25748049a1a/disks" rel="disks"/> <creation_date>2021-03-24T18:35:04.298+01:00</creation_date> <host href="/ovirt-engine/api/hosts/6c9c9438-17e1-4aa3-83a9-75c793e39f66" id="6c9c9438-17e1-4aa3-83a9-75c793e39f66"/> <vm href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505" id="06088ba6-0f5b-49af-bf94-7e0b3351f505"/> </backup> This is engine bug, engine must report the backup phase. Then we see this log: [2021-03-24] [17:35:04.452] [14784] [Warn ] [RHEVClient] No 'phase' child in Backup XML element, assuming it 'ready' This is not a valid assumption, you cannot continue to use this backup until the backup reports that it is ready. Then you try to create a transfer: 2021-03-24] [17:35:17.813] [14784] [Info ] [RestCli] <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <image_transfer href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609" id="964fb58d-827b-4896-8a68-d7dd41e46609"> <actions> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/resume" rel="resume"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/cancel" rel="cancel"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/finalize" rel="finalize"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/pause" rel="pause"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/extend" rel="extend"/> </actions> <active>false</active> <direction>download</direction> <format>raw</format> <inactivity_timeout>600</inactivity_timeout> <phase>cancelled_system</phase> <shallow>false</shallow> <timeout_policy>legacy</timeout_policy> <transferred>0</transferred> <image id="9e0fc750-4df3-471a-b84d-cbc76906ac6e"/> </image_transfer> Not the status: "cancelled_system" - the transfer was canceled by the system. This warnings are not relevant since the transfer is not in TRANSFERRING phase: [2021-03-24] [17:35:17.813] [14784] [Warn ] [RHEVClient] 'proxy_url' child was not found in image_transfer element [2021-03-24] [17:35:17.813] [14784] [Error] [RHEVClient] 'transfer_url' child was not found in image_transfer element I think we have: - wrong usage, not waiting for backup to become ready (application bug) - engine reports backup without phase (engine bug) - creating transfer for a backup succeeds when backup is not ready. This should fail early instead of creating transfer that can never work. (possible engine bug) Yuri, we still need the vdsm log to understand this failure, see my request in comment 1.
Created attachment 1766349 [details] VDSM logs arhive
> engine reports backup without phase (engine bug) Do you have any ETA when that fix will be released?
(In reply to Yury.Panchenko from comment #7) > > engine reports backup without phase (engine bug) > Do you have any ETA when that fix will be released? Eyal found the the issue for this, it will be in 4.4.6. The problem is the phase for creating scratch disks, added for supporting scratch disks on shared storage. Due to error in the mapping code, this phase is dropped from the REST entity. But this does not block you in any way, when backup does not report a phase, you need to poll again until backup reports "ready" phase.
Looking in the response log (attachment 1766269 [details]): [2021-03-24] [17:35:04.181] [14784] [Info ] [RestCli] Request data (POST https://kvmc-man.robofish.local/ovirt-engine/api/vms/ 06088ba6-0f5b-49af-bf94-7e0b3351f505/backups): [2021-03-24] [17:35:04.181] [14784] [Info ] [RestCli] <backup><disks><disk id="f6e2057a-f77b-44a0-8ed7-d074dad8b39e" /><disk i d="9e0fc750-4df3-471a-b84d-cbc76906ac6e" /></disks></backup> [2021-03-24] [17:35:04.374] [14784] [Info ] [RestCli] <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <backup href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505/backups/2c7aa425-c73f-4692-8276-c25748049a1a" id="2c7 aa425-c73f-4692-8276-c25748049a1a"> <actions> <link href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505/backups/2c7aa425-c73f-4692-8276-c25748049a1a/fi nalize" rel="finalize"/> </actions> <link href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505/backups/2c7aa425-c73f-4692-8276-c25748049a1a/disks" rel="disks"/> <creation_date>2021-03-24T18:35:04.298+01:00</creation_date> <host href="/ovirt-engine/api/hosts/6c9c9438-17e1-4aa3-83a9-75c793e39f66" id="6c9c9438-17e1-4aa3-83a9-75c793e39f66"/> <vm href="/ovirt-engine/api/vms/06088ba6-0f5b-49af-bf94-7e0b3351f505" id="06088ba6-0f5b-49af-bf94-7e0b3351f505"/> </backup> Image transfer for first disk created here: [2021-03-24] [17:35:17.634] [14784] [Info ] [RestCli] Request data (POST https://kvmc-man.robofish.local/ovirt-engine/api/imagetransfers): [2021-03-24] [17:35:17.634] [14784] [Info ] [RestCli] <image_transfer><direction>download</direction><format>raw</format><inactivity_timeout>600</inactivity_timeout><backup id="2c7aa425-c73f-4692-8276-c25748049a1a"/><image id="9e0fc750-4df3-471a-b84d-cbc76906ac6e"/></image_transfer> [2021-03-24] [17:35:17.813] [14784] [Info ] [RestCli] <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <image_transfer href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609" id="964fb58d-827b-4896-8a68-d7dd41e46609"> <actions> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/resume" rel="resume"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/cancel" rel="cancel"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/finalize" rel="finalize"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/pause" rel="pause"/> <link href="/ovirt-engine/api/imagetransfers/964fb58d-827b-4896-8a68-d7dd41e46609/extend" rel="extend"/> </actions> <active>false</active> <direction>download</direction> <format>raw</format> <inactivity_timeout>600</inactivity_timeout> <phase>cancelled_system</phase> <shallow>false</shallow> <timeout_policy>legacy</timeout_policy> <transferred>0</transferred> <image id="9e0fc750-4df3-471a-b84d-cbc76906ac6e"/> </image_transfer> So these are the relevant ids: disk id: f6e2057a-f77b-44a0-8ed7-d074dad8b39e disk id: 9e0fc750-4df3-471a-b84d-cbc76906ac6e backup id: 2c7aa425-c73f-4692-8276-c25748049a1a Transfer id: 964fb58d-827b-4896-8a68-d7dd41e46609 Looking in engine log: Backup created here: 2021-03-24 18:35:04,306+01 INFO [org.ovirt.engine.core.bll.storage.backup.StartVmBackupCommand] (default task-31) [aed2aff0-a4cc-4bb7-86af-0217935d4b45] Created VmBackup entity '2c7aa425-c73f-4692-8276-c25748049a1a' Engine creating scratch disks: 2021-03-24 18:35:05,100+01 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [aed2aff0-a4cc-4bb7-86af-0217935d4b45] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='deba1372-88b6-11eb-94e2-00163e29bd5a', ignoreFailoverLimit='false', storageDomainId='8e0fcaca-dab4-4698-9d69-74a1a2a1b257', imageGroupId='d225813a-2157-4aa7-b7fe-35292edeecbe', imageSizeInBytes='2147483648', volumeFormat='COW', newImageId='14a1eeec-1f85-4bdb-962d-39c352fb7b90', imageType='Sparse', newImageDescription='{"DiskAlias":"VM ubunturef backup 2c7aa425-c73f-4692-8276-c25748049a1a scratch disk for ubunturef_Disk2","DiskDescription":"Backup 2c7aa425-c73f-4692-8276-c25748049a1a scratch disk"}', imageInitialSizeInBytes='0', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000', shouldAddBitmaps='false'}), log id: 1d0d7a05 2021-03-24 18:35:05,293+01 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [aed2aff0-a4cc-4bb7-86af-0217935d4b45] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='deba1372-88b6-11eb-94e2-00163e29bd5a', ignoreFailoverLimit='false', storageDomainId='8e0fcaca-dab4-4698-9d69-74a1a2a1b257', imageGroupId='346f5f0e-a0dd-4af4-bb84-023246a7e155', imageSizeInBytes='34359738368', volumeFormat='COW', newImageId='0995cbf4-8113-4670-a3f4-7b398bf0daf6', imageType='Sparse', newImageDescription='{"DiskAlias":"VM ubunturef backup 2c7aa425-c73f-4692-8276-c25748049a1a scratch disk for ubunturef_Disk1","DiskDescription":"Backup 2c7aa425-c73f-4692-8276-c25748049a1a scratch disk"}', imageInitialSizeInBytes='0', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000', shouldAddBitmaps='false'}), log id: 314d9a9b Scratch disks added: 2021-03-24 18:35:13,701+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [] EVENT_ID: USER_ADD_DISK_FINISHED_SUCCESS(2,021), The disk 'VM ubunturef backup 2c7aa425-c73f-4692-8276-c25748049a1a scratch disk for ubunturef_Disk1' was successfully added. 2021-03-24 18:35:15,865+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-76) [] EVENT_ID: USER_ADD_DISK_FINISHED_SUCCESS(2,021), The disk 'VM ubunturef backup 2c7aa425-c73f-4692-8276-c25748049a1a scratch disk for ubunturef_Disk2' was successfully added. 021-03-24 18:35:16,886+01 INFO [org.ovirt.engine.core.bll.storage.backup.CreateScratchDisksCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [aed2aff0-a4cc-4bb7-86af-0217935d4b45] All scratch disks created for VM 'ubunturef' backup '2c7aa425-c73f-4692-8276-c25748049a1a' Backup started on the host: 2021-03-24 18:35:20,030+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StartVmBackupVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-84) [aed2aff0-a4cc-4bb7-86af-0217935d4b45] START, StartVmBackupVDSCommand(HostName = PAN-KVMC, VmBackupVDSParameters:{hostId='6c9c9438-17e1-4aa3-83a9-75c793e39f66', backupId='2c7aa425-c73f-4692-8276-c25748049a1a', requireConsistency='false'}), log id: 50792b0 2021-03-24 18:35:20,152+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.StartVmBackupVDSCommand] (EE-ManagedScheduledExecu torService-engineScheduledThreadPool-Thread-84) [aed2aff0-a4cc-4bb7-86af-0217935d4b45] FINISH, StartVmBackupVDSCommand, return : VmBackupInfo:{status='Status [code=0, message=Done]'} At this point the backup should report the ready phase, and the application may create the transfer. But in this case the transfer started at: 2021-03-24 18:35:17,749+01 INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (default task-29) [91 18695e-a0d5-4512-afab-f6b2ff1d354e] Creating ImageTransfer entity for command '964fb58d-827b-4896-8a68-d7dd41e46609', proxyEna bled: true Which is 3 seconds *before* the backup was started, so we don't have the backup url at this point. The root cause for the failure is incorrect API usage - not waiting for ready phase. But it exposes issues in engine so we can use this bug for the engine issues. - not reporting backup phase during initialization - allowing creating transfer for backup which is not ready. If creating the transfer would fail with: 409 "Backup is not ready" the issue would be easier to debug.
According comment #9 "The root cause for the failure is incorrect API usage - not waiting for ready phase. But it exposes issues in engine so we can use this bug for the engine issues." What should I test on engine side? That the response of backup request includes the 'phase' status right away?
(In reply to Ilan Zuckerman from comment #10) > According comment #9 "The root cause for the failure is incorrect API usage > - not waiting for ready phase. But it exposes issues in engine so we can use > this bug for the engine issues." > > What should I test on engine side? That the response of backup request > includes the 'phase' status right away? The backup entity should have the 'phase' status for each phase of the backup. Before that fix, when the backup command was in CREATING_SCRATCH_DISKS and PREPARING_SCRATCH_DISK there was no phase attribute in the backup entity. Now, when the command will be in those phases the user will see the INITIALIZING phase when querying the backup entity in the API.
Verified on rhv-4.4.6-4 After backup invocation for a particular disk, we invoke 'GET' with backup ID from previous request (POST), and we receive right away a response with 'phase' attribute: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <backup href="/ovirt-engine/api/vms/3357c55f-7d0f-41d2-bb7b-5150a6ebd18d/backups/97c61be8-2a40-41d2-a3b1-cfb97aef203e" id="97c61be8-2a40-41d2-a3b1-cfb97aef203e"> <actions> <link href="/ovirt-engine/api/vms/3357c55f-7d0f-41d2-bb7b-5150a6ebd18d/backups/97c61be8-2a40-41d2-a3b1-cfb97aef203e/finalize" rel="finalize"/> </actions> <link href="/ovirt-engine/api/vms/3357c55f-7d0f-41d2-bb7b-5150a6ebd18d/backups/97c61be8-2a40-41d2-a3b1-cfb97aef203e/disks" rel="disks"/> <creation_date>2021-04-13T12:28:53.095+03:00</creation_date> <phase>initializing</phase> <host href="/ovirt-engine/api/hosts/de275f49-0fe1-4235-874f-c0475fb8fae2" id="de275f49-0fe1-4235-874f-c0475fb8fae2"/> <vm href="/ovirt-engine/api/vms/3357c55f-7d0f-41d2-bb7b-5150a6ebd18d" id="3357c55f-7d0f-41d2-bb7b-5150a6ebd18d"/> </backup>↵
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.