Created attachment 1476909 [details] logs Description of problem: A migration plan to RHV with 40x100GB VM disks failed on multiple VMs with the following error in virt-v2v log: nbdkit: python[1]: error: /var/tmp/rhvupload.QAHE6K/rhv-upload-plugin.py: pwrite: error: [Errno 32] Broken pipe nbdkit: python[1]: debug: sending error reply: Input/output error qemu-img: error while writing sector 107293568: Input/output error nbdkit: python[1]: debug: flush nbdkit: python[1]: error: /var/tmp/rhvupload.QAHE6K/rhv-upload-plugin.py: flush: error: nbdkit: python[1]: debug: sending error reply: Input/output error nbdkit: python[1]: debug: flush nbdkit: python[1]: error: /var/tmp/rhvupload.QAHE6K/rhv-upload-plugin.py: flush: error: nbdkit: python[1]: debug: sending error reply: Input/output error nbdkit: python[1]: debug: client sent disconnect command, closing connection nbdkit: python[1]: debug: connection cleanup with final status 0 nbdkit: python[1]: debug: close virt-v2v: error: qemu-img command failed, see earlier errors rm -rf '/var/tmp/rhvupload.QAHE6K' rm -rf '/var/tmp/vmx.oBM4LE' rm -rf '/var/tmp/null.W5hwry' libguestfs: trace: close libguestfs: closing guestfs handle 0x11c32e0 (state 0) nbdkit: python[1]: error: /var/tmp/rhvupload.QAHE6K/rhv-upload-plugin.py: close: error: Fault reason is "Operation Failed". Fault detail is "[Cannot remove Virtual Disk. Related operation is currently in progress. Please try again later.]". HTTP response code is 409. nbdkit: debug: /usr/lib64/nbdkit/plugins/nbdkit-python-plugin.so: unload Version-Release number of selected component (if applicable): ovirt-imageio-daemon-1.4.2-0.el7ev.noarch CFME 5.9.4.4.20180816162527_c00eb23 RHV 4.2.5.3-0.1.el7ev How reproducible: We noticed this error in the past, but no bug is opened because no reproducible steps found. Now it seems reproducible on this type of plan, but I need to run one more time to verify. Steps to Reproduce: Run migration plan with 40x100GB VMs on 4 conversion hosts. RHV provider should be configured for 40 parallel migrations, and hosts should be the default (10). Actual results: RHV upload plugin fails with 'Broken pipe' error. Expected results: No error. Additional info:
Mor, we don't have daemon logs showing this import again. $ grep transfer.id v2v-import-20180819T113214-10291.log transfer.id = '2fc56dcb-7509-4402-9474-49dd2d8765e8' $ grep 2fc56dcb-7509-4402-9474-49dd2d8765e8 daemon.log* (no output) $ ls -lht daemon.log* -rw-r--r--. 1 nsoffer nsoffer 8.5M Aug 19 18:31 daemon.log -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:23 daemon.log.1 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:16 daemon.log.2 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:15 daemon.log.3 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:13 daemon.log.4 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:09 daemon.log.5 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:07 daemon.log.6 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:05 daemon.log.7 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 18:02 daemon.log.8 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 17:51 daemon.log.9 -rw-r--r--. 1 nsoffer nsoffer 10M Aug 19 17:40 daemon.log.10 You are still using 10M logs - we increase the log size to 20M in 1.4.3. $ head -3 daemon.log.10 2018-08-19 14:32:04,688 INFO (Thread-309) [images] Writing 2097152 bytes at offset 60535472128 flush False to /rhev/data-center/mnt/blockSD/163943db-3ca8-4c79-a6c1-a1a76e755d9a/images/1dea87c5-447e-4373-bdf4-2e81a935bb2f/cc8d682c-a909-4b1f-b21f-e8dc12da0844 for ticket d376c6ce-79a5-4392-93fd-8858f88f7ceb 2018-08-19 14:32:04,690 INFO (Thread-4) [web] FINISH [] PUT /images/85051970-9806-437d-affc-c1ca13b81eec: [200] 0 [request=0.010351, operation=0.009398, read=0.001560, write=0.004890] 2018-08-19 14:32:04,694 INFO (Thread-4) [web] START: [] PATCH /images/85051970-9806-437d-affc-c1ca13b81eec You are not using latest version, logging 4 times less data. Please reproduce this with current version - 1.4.3, which will be available in oVirt 4.2.6.
Found in ovirt-imageio-daemon log: 2018-08-20 11:52:52,605 WARNING (Thread-348) [web] ERROR [^@] PUT /images/9a308235-7159-43b0-85fc-4834592fb571: [403] Ticket u'9a308235-7159-43b0-85fc-4834592fb571' expired [request=0.000521]
Mor did you try to reproduce with the latest 4.2.6 build? (As Nir suggested - Please reproduce this with current version - 1.4.3, which will be available in oVirt 4.2.6.)
No I didn't tried to reproduce it with 4.2.6. It was reproduced on 4.2.5 which is the upcoming RHV GA AFIAK.
Since the error is expired ticket, I don't think checking with 4.2.6 will change anything, but it is good idea in general. If imageio 1.4.3 fixes this issue, we don't have to waste time on this. In the attached google drive link, I see only daemon logs (great), but no vdsm or engine logs. We cannot do anything without these logs. Please attach them to the bug.
Mor, I downloaded the logs you posted in comment 2, but again, the daemon error does not appear in these logs. Please make sure the logs you upload contain the entire history of the upload that failed. This probably happened again since you increased the log size to 20M, and the number of logs to 20, but this is not enough, since imageio 1.4.2 produces 4 times more logs compared with 1.4.3. If you test with 4.2.5, set the log size to 80m to make sure we don't loose logs.
(In reply to Mor from comment #2) > Found in ovirt-imageio-daemon log: > 2018-08-20 11:52:52,605 WARNING (Thread-348) [web] ERROR [^@] PUT > /images/9a308235-7159-43b0-85fc-4834592fb571: [403] Ticket > u'9a308235-7159-43b0-85fc-4834592fb571' expired [request=0.000521] But the v2v log you sent is about: $ grep transfer.id v2v-import-20180820T082939-43946.log transfer.id = '91a6e1e3-6a8a-401b-bade-2b25c79c82cb' Please try to search your logs for 9a308235-7159-43b0-85fc-4834592fb571.
Nir, migration time window took place from 2018-08-20 08:29:39 until around 2018-08-20 11:52:55 when virt-v2v failure occurred. You have logs since 2018-08-20 08:53:53,265, isn't it enough? I'll attach engine and vdsm log file.
Created attachment 1477422 [details] vdsm and engine logs
I found the issue in the logs linked from comment 3. The daemon logs contain 3 failures, one of them is reported in comment 2. But the attached v2v logs are not relevant to these daemon logs. A mess, but we have enough info in the daemon logs to understand what happened in the daemon. ## Failures found in the deamon logs $ grep -a expired daemon-all.log 2018-08-20 11:52:47,561 WARNING (Thread-6) [web] ERROR [] PUT /images/12aa891f-ae76-4ec3-a5bb-febb4a58ef0e: [403] Ticket u'12aa891f-ae76-4ec3-a5bb-febb4a58ef0e' expired [request=0.001200] 2018-08-20 11:52:50,384 WARNING (Thread-9) [web] ERROR [] PUT /images/77f73eef-89b9-4071-a315-f534194ce0f2: [403] Ticket u'77f73eef-89b9-4071-a315-f534194ce0f2' expired [request=0.042259] 2018-08-20 11:52:52,605 WARNING (Thread-348) [web] ERROR [] PUT /images/9a308235-7159-43b0-85fc-4834592fb571: [403] Ticket u'9a308235-7159-43b0-85fc-4834592fb571' expired [request=0.000521] ## Transfer 12aa891f-ae76-4ec3-a5bb-febb4a58ef0e faiure 1. Last extend before failure 2018-08-20 11:47:47,502 INFO (Thread-3673) [tickets] Extending ticket 12aa891f-ae76-4ec3-a5bb-febb4a58ef0e, new expiration in 4988179 2. Last request before failure 2018-08-20 11:52:47,129 INFO (Thread-6) [images] Writing 2097152 bytes at offset 77011615744 flush False to /rhev/data-center/mnt/blockSD/163943db-3ca8-4c79-a6c1-a1a76e755d9a/images/0c866494-24cf-48c8-8907-547d29df1429/5c9b8c4e-3d06-42b3-872c-6ecea13519bb for ticket 12aa891f-ae76-4ec3-a5bb-febb4a58ef0e 3. Ticket expire (5 mintues after last extend, expected) 2018-08-20 11:52:47,561 WARNING (Thread-6) [web] ERROR [^@] PUT /images/12aa891f-ae76-4ec3-a5bb-febb4a58ef0e: [403] Ticket u'12aa891f-ae76-4ec3-a5bb-febb4a58ef0e' expired [request=0.001200] 4. Vdsm try to extend ticket 3 minutes later 2018-08-20 11:52:51,099 INFO (Thread-3733) [tickets] Extending ticket 12aa891f-ae76-4ec3-a5bb-febb4a58ef0e, new expiration in 4988482 Reason for failure: missing extend on 11:51:47 ## Transfer 77f73eef-89b9-4071-a315-f534194ce0f2 failure 1. Last extend 2018-08-20 11:47:51,244 INFO (Thread-3675) [tickets] Extending ticket 77f73eef-89b9-4071-a315-f534194ce0f2, new expiration in 4988182 2. Last request before failure 2018-08-20 11:52:50,110 INFO (Thread-9) [images] Writing 2097152 bytes at offset 91782381568 flush False to /rhev/data-center/mnt/blockSD/163943db-3ca8-4c79-a6c1-a1a76e755d9a/images/6ea8fa2c-e0f6-4eb7-92de-b3e0731d3528/e41fa82c-da7e-444c-afcc-97122b0327cf for ticket 77f73eef-89b9-4071-a315-f534194ce0f2 3. Ticket expire (5 minutes after last extend) 2018-08-20 11:52:50,384 WARNING (Thread-9) [web] ERROR [^@] PUT /images/77f73eef-89b9-4071-a315-f534194ce0f2: [403] Ticket u'77f73eef-89b9-4071-a315-f534194ce0f2' expired [request=0.042259] Reason for failure: missing extend on 11:51:51 ## Transfer 9a308235-7159-43b0-85fc-4834592fb571 failure 1. Last extend 2018-08-20 11:47:52,630 INFO (Thread-3677) [tickets] Extending ticket 9a308235-7159-43b0-85fc-4834592fb571, new expiration in 4988184 2. Last request 2018-08-20 11:52:52,240 INFO (Thread-348) [images] Writing 2097152 bytes at offset 82894127104 flush False to /rhev/data-center/mnt/blockSD/163943db-3ca8-4c79-a6c1-a1a76e755d9a/images/676aaf86-d5f5-43d6-ada8-2fd16951f93e/b65e7534-81dc-463a-a6b7-de34a2bb7108 for ticket 9a308235-7159-43b0-85fc-4834592fb571 3. Ticket expire (5 minutes after last extend) 2018-08-20 11:52:52,605 WARNING (Thread-348) [web] ERROR [^@] PUT /images/9a308235-7159-43b0-85fc-4834592fb571: [403] Ticket u'9a308235-7159-43b0-85fc-4834592fb571' expired [request=0.000521] Reason for failure: missing extend on 11:51:52 ## Summary engine or vdsm dis not send extend request around 11:51. We need to understand why. Daniel will check engine and vdsm logs. But regardless of the reason engine and vdsm did not send an extend request, I think the way we extend tickets is wrong. The way the system works now is: - user perform request - daemon update last access time - engine monitor ticket idle time (time since last access) reported by daemon - engine send extend request if ticket is idle time < inactivity timeout for this transfer So practically, the ticket is extended if the user is active. But we depend on both engine and vdsm being up at the time that a tikcet should be extended. If one of them is not up, for example, vdsm being fenced by engine, or user restarting engine, we may miss an extend request, which will cause some tickets to expire, while a client is happily transferring data to the daemon. We will fix this by extending the ticket on every request, so an active client ticket cannot expire, even if you kill engine and vdsm.
According to the engine log[1], extend ticket (on '12aa891f-ae76-4ec3-a5bb-febb4a58ef0e') was called in 11:47:47 and 11:52:51, which is more than 5 minutes. The extend logic in the engine is supposed to extend roughly every 4 minutes, which is apparently too fragile as we poll the ticket status every minute (see [2]). So, in engine, we should increase the ticket refresh allowance so we extend every minute (i'll open a separate bug on it). In the daemon, we can extend on each request. [1] engine.log-20180821:2018-08-20 11:47:47,493Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [63fd0843-7484-412d-b168-8825ae9a150d] START, ExtendImageTicketVDSCommand(HostName = b02-h25-r620.rhev.openstack.engineering.redhat.com, ExtendImageTicketVDSCommandParameters:{hostId='1ccaa917-c4ce-426f-98c5-34e96bfb6251', ticketId='12aa891f-ae76-4ec3-a5bb-febb4a58ef0e', timeout='300'}), log id: 60843388 engine.log-20180821:2018-08-20 11:47:47,507Z INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [63fd0843-7484-412d-b168-8825ae9a150d] Transfer session with ticket id 12aa891f-ae76-4ec3-a5bb-febb4a58ef0e extended, timeout 300 ... engine.log-20180821:2018-08-20 11:52:51,087Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [63fd0843-7484-412d-b168-8825ae9a150d] START, ExtendImageTicketVDSCommand(HostName = b02-h25-r620.rhev.openstack.engineering.redhat.com, ExtendImageTicketVDSCommandParameters:{hostId='1ccaa917-c4ce-426f-98c5-34e96bfb6251', ticketId='12aa891f-ae76-4ec3-a5bb-febb4a58ef0e', timeout='300'}), log id: 408b5021 engine.log-20180821:2018-08-20 11:52:51,103Z INFO [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-74) [63fd0843-7484-412d-b168-8825ae9a150d] Transfer session with ticket id 12aa891f-ae76-4ec3-a5bb-febb4a58ef0e extended, timeout 300 seconds [2] [63fd0843-7484-412d-b168-8825ae9a150d] START, GetImageTicketVDSCommand(HostName = b02-h25-r620.rhev.openstack.engineering.redhat.com, GetImageTicketVDSCommandParameters:{hostId='1ccaa917-c4ce-426f-98c5-34e96bfb6251', ticketId='12aa891f-ae76-4ec3-a5bb-febb4a58ef0e', timeout='null'}), log id: 12b669db ... engine.log-20180821:2018-08-20 11:46:24,157Z INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [63fd0843-7484-412d-b168-8825ae9a150d] START, GetImageTicketVDSCommand(HostName = b02-h25-r620.rhev.openstack.engineering.redhat.com, GetImageTicketVDSCommandParameters:{hostId='1ccaa917-c4ce-426f-98c5-34e96bfb6251', ticketId='12aa891f-ae76-4ec3-a5bb-febb4a58ef0e', timeout='null'}), log id: 6225f4dd
Mor, would you like to test if the attached patch fixes this issue? I created a build here: https://jenkins.ovirt.org/job/ovirt-imageio_standard-check-patch/330/artifact/build-artifacts.el7.x86_64/
I've checked the GetImageTicket execution rate on a local engine env, and seems it's actually being called every 10 seconds[1]. @Mor - which engine version are you using? Any config values were changed, or all kept with defaults? [1] 2018-08-19 15:22:50,178+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [cf60efb6-82a1-467f-a6e2-ff14cee6b74f] START, GetImageTicketVDSCommand(HostName = host1, GetImageTicketVDSCommandParameters:{hostId='0e7696d2-c45a-4436-ac2c-392581f56e99', ticketId='6a91bc2e-87de-4ab7-bea2-4733c541382b', timeout='null'}), log id: 26d9509f 2018-08-19 15:23:00,384+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [cf60efb6-82a1-467f-a6e2-ff14cee6b74f] START, GetImageTicketVDSCommand(HostName = host1, GetImageTicketVDSCommandParameters:{hostId='0e7696d2-c45a-4436-ac2c-392581f56e99', ticketId='6a91bc2e-87de-4ab7-bea2-4733c541382b', timeout='null'}), log id: 7ad3c290 2018-08-19 15:23:10,409+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetImageTicketVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-82) [cf60efb6-82a1-467f-a6e2-ff14cee6b74f] START, GetImageTicketVDSCommand(HostName = host1, GetImageTicketVDSCommandParameters:{hostId='0e7696d2-c45a-4436-ac2c-392581f56e99', ticketId='6a91bc2e-87de-4ab7-bea2-4733c541382b', timeout='null'}), log id: 6d0b026e
ovirt-engine-4.2.5.3-0.1.el7ev.noarch Mordechai, did we changed anything special on our server?
(In reply to Nir Soffer from comment #13) > Mor, would you like to test if the attached patch fixes this issue? > > I created a build here: > https://jenkins.ovirt.org/job/ovirt-imageio_standard-check-patch/330/ > artifact/build-artifacts.el7.x86_64/ Sure, but only after we finish with the sprint results.
We need another fix before we can enable auto-extend. here is updated build if someone want to test this fix: https://jenkins.ovirt.org/job/ovirt-imageio_standard-check-patch/334/
(In reply to Mor from comment #15) > ovirt-engine-4.2.5.3-0.1.el7ev.noarch > > Mordechai, did we changed anything special on our server? Everything should be default aside from pool connections to postgres.
(In reply to mlehrer from comment #18) > (In reply to Mor from comment #15) > > ovirt-engine-4.2.5.3-0.1.el7ev.noarch > > > > Mordechai, did we changed anything special on our server? > > Everything should be default aside from pool connections to postgres. Can you please attach the output of 'engine-config --all'? (so we could check what makes the difference)
Created attachment 1477821 [details] engine config
This, Bug is blocking to migrate more then 20 VMs in parallel with SSH.
Fix upstream in: commit 8dcb7235177f2a4e21b1f4c331f86d04f3e74005 Author: Nir Soffer <nsoffer> Date: Tue Aug 21 16:48:16 2018 +0300 daemon: Extend ticket on every request We have seen too many failures to extend tickets recently. Typically a single extend request is missing, and the upload fail one minute after the missing extend request. The current extend logic is complex and fragile. If engine or vdsm are down or stuck for some reason when an extend message should be sent, active transfers will expire, even if the client and daemon are happily transferring data. Change the logic so a ticket is extended after every client request. A ticket will expire only if a client failed to do one request during ticket timeout period. With this change, you can kill both engine and vdsm once a transfer has started. They have to be UP only when the client is trying to finalize the transfer. Change-Id: I78cd266e915aa509ba89d4b191bc0733f74d6d46 Bug-Url: https://bugzilla.redhat.com/1619019 Signed-off-by: Nir Soffer <nsoffer> commit ad94cfc6969d0c0559c468da5057ca214a061050 Author: Nir Soffer <nsoffer> Date: Tue Aug 21 23:55:23 2018 +0300 daemon: Fail OPTIONS request if ticket expired We want to allow a client to extend a ticket by issuing OPTIONS request. However a client should not be able to extend a ticket after it was expired. To fix this we need to use tickets.authorize() to get a ticket. This method will fail if the ticket has expired. Here is an example of failed upload when a ticket has expired: $ ./upload /var/tmp/fedora-27.img https://host:54322/images/test Traceback (most recent call last): File "./upload", line 263, in <module> server_options = options(con, url.path) File "./upload", line 240, in options raise RuntimeError("options %s failed: %s" % (path, body)) RuntimeError: options /images/test failed: {"explanation": "Access was denied to this resource.", "code": 403, "detail": "Ticket u'test' expired", "title": "Forbidden"} Although this does change OPTIONS behavior, it cannot break a client, since even before this change, any useful request (e.g. PUT, GET) would fail if a ticket has expired. Change-Id: Ia92d42bab8eb6ab56896086a7ac11c4a15bb49d7 https://bugzilla.redhat.com/1619019 Signed-off-by: Nir Soffer <nsoffer> commit 403b4d8f776bc2bf50a27bc04ab24bdc8509b037 Author: Nir Soffer <nsoffer> Date: Tue Aug 21 23:52:28 2018 +0300 daemon: Add failing test for OPTIONS with expired ticket OPTIONS request on expired ticket is successful, but any other request will fail on expired ticket, so this is not useful. Add a failing test showing the issue. Change-Id: I6761d5abbdfdc65fb1b22a6531d67d787120194e Bug-Url: https://bugzilla.redhat.com/1619019 Signed-off-by: Nir Soffer <nsoffer>
can we merge it to 4.2.6?
Nir, which RHV build number includes 1.4.4? We currently work on 4.2.6.3-0.1.el7ev and we have 1.4.3.
(In reply to Mor from comment #24) 1.4.4 was built last week, it will be available in today build, sometime this week. If you want to test it now, you can use the brew build. Daniel, can you point us to the build?
I prefer to test official builds, not brew builds.
Please move to "on QA" once its available on downstream.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Correcting state, there is no work on engineering side. QE should test this once the 4.2.6 build is available.
Nir, are we getting it today in the Final 4.2.6 build?
Verified on: RHV 4.2.6-4