| Summary: | RFE: make seamless migration do not use timeout or timeout on init | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | David Jaša <djasa> | ||||
| Component: | spice | Assignee: | Default Assignee for SPICE Bugs <rh-spice-bugs> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | SPICE QE bug list <spice-qe-bugs> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.0 | CC: | dblechte, desktop-qa-list, djasa, dyuan, jdenemar, juzhang, marcandre.lureau, mzhan, qzhang, rbalakri, rh-spice-bugs, tpelka, ydu, ylavi, zpeng | ||||
| Target Milestone: | rc | Keywords: | FutureFeature | ||||
| Target Release: | 7.3 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Enhancement | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-05-02 12:21:18 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
David Jaša
2013-11-29 17:43:18 UTC
Created attachment 830740 [details]
ping stats over some less stable connections
Just for the record - in the attachent, there is a real-world ping statistic of a WAN connection to work (one of spice use cases): note the several seconds-long network hiccups even with some packet loss - should these align with client_migrate_info time, we could get to hard-to-debug switch_host migrations.
There's indeed no timeout in libvirt anywhere around client_migrate_info. I guess it must be either in qemu or in spice. No timeout in qemu-kvm either. qemu just calls spice_server_migrate_connect(), then goes wait for the callback from spice-server. Reassigning. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. There is a 10s timeout for migrate_info to complete seamless destination connection. Passed that time, the client will receive a switch host after migration complete (server_migrate_end). I think the reason was that we don't want to "block" the qemu migration from happening, since seamless migration waits for all channels to be connected before migrating. I guess "waiting" for destination connections is necessary to have a seamless experience, or else you would have to wait after migration completed and it wouldn't feel seamless anymore... It seems there is a tradeoff here. I think we shouldn't "block" qemu migration longer than 10s. If the client is to slow or stuck, switch_host is a reasonable fallback. So, I am tempted to close as not a bug. David, please comment, thanks (In reply to Marc-Andre Lureau from comment #6) > There is a 10s timeout for migrate_info to complete seamless destination > connection. Passed that time, the client will receive a switch host after > migration complete (server_migrate_end). > I agree, it's good from resilience POV (in most cases, keeping VM up is more important than keeping client connected) > I think the reason was that we don't want to "block" the qemu migration from > happening, since seamless migration waits for all channels to be connected > before migrating. I guess "waiting" for destination connections is necessary > to have a seamless experience, or else you would have to wait after > migration completed and it wouldn't feel seamless anymore... > I'd still like this as a fallback to real seamless migration (say not-so-seamless migration: instead of switch_host, keep source running, have client connect to destination like after client_migrate_info, transfer spice-server state, switch client to dest, give go-ahead to stop source qemu). It wouldn't feel seamless but stuff like USB redirection or monitor config wouldnt' be reset and RHEV could re-allow connection to migrating VM. > It seems there is a tradeoff here. I think we shouldn't "block" qemu > migration longer than 10s. If the client is to slow or stuck, switch_host is > a reasonable fallback. IMO the sweet spot could be in 10 - 15 s range. What is the current timeout? > > So, I am tempted to close as not a bug. (In reply to David Jaša from comment #8) > I'd still like this as a fallback to real seamless migration (say > not-so-seamless migration: instead of switch_host, keep source running, have > client connect to destination like after client_migrate_info, transfer > spice-server state, switch client to dest, give go-ahead to stop source > qemu). It wouldn't feel seamless but stuff like USB redirection or monitor > config wouldnt' be reset and RHEV could re-allow connection to migrating VM. I wish Yonit could comment on that, it might not be doable... And to me this is overkill for little gain. We have already 3 migrations path, that would be another one. If it's possible to change the behaviour, I would rather change seamless to not block migration start, even if the channel aren't all connected. So if it's possible, get rid of the timeout alltogether. A "slow client" won't perhaps feel so seamless, but we would still keep the channel state that you listed. > > It seems there is a tradeoff here. I think we shouldn't "block" qemu > > migration longer than 10s. If the client is to slow or stuck, switch_host is > > a reasonable fallback. > > IMO the sweet spot could be in 10 - 15 s range. What is the current timeout? As I said, it's 10s. It's short for the client to connect and handshake all channels and reply to origin server, but it's long for the other side initiating the migration. What about changing to "RFE: make seamless migration do not timeout on init" ? (In reply to Marc-Andre Lureau from comment #9) > ... > If it's possible to change the behaviour, I would rather change seamless to > not block migration start, even if the channel aren't all connected. So if > it's possible, get rid of the timeout alltogether. A "slow client" won't > perhaps feel so seamless, but we would still keep the channel state that you > listed. > So the timeline would be: 1. client_migrate_info src -> client 2. main channels connects 3. other channels try connecting 4. migration starts, dst qemu accepts no new connections now 5. migration finishes 6. (in order to avoid timeouts, dst qemu orders) client to connect the rest of the channels 7. the rest is just like regular seamless migration That's pretty much what I requested above and I'm afraid that it wouldn't be easy either. > > > It seems there is a tradeoff here. I think we shouldn't "block" qemu > > > migration longer than 10s. If the client is to slow or stuck, switch_host is > > > a reasonable fallback. > > > > IMO the sweet spot could be in 10 - 15 s range. What is the current timeout? > > As I said, it's 10s. It's short for the client to connect and handshake all > channels and reply to origin server, but it's long for the other side > initiating the migration. OK, 10 s should stay. (In reply to Marc-Andre Lureau from comment #10) > What about changing to "RFE: make seamless migration do not timeout on init" > ? (In reply to Marc-Andre Lureau from comment #12) > (In reply to Marc-Andre Lureau from comment #10) > > What about changing to "RFE: make seamless migration do not timeout on init" > > ? That would be also effective change of migration mode (I mean channels missing client_migrate_info "window" connecting to dest after migration finishes), wouldn't it? (In reply to David Jaša from comment #14) > (In reply to Marc-Andre Lureau from comment #12) > > (In reply to Marc-Andre Lureau from comment #10) > > > What about changing to "RFE: make seamless migration do not timeout on init" > > > ? > > That would be also effective change of migration mode (I mean channels > missing client_migrate_info "window" connecting to dest after migration > finishes), wouldn't it? yes, an improvement of seamless migration. Eventually, get rid of the initial timeout: afaik, it doesn't help much to ensure client is ready before starting migration. If it's not yet ready in 10s, there is still a good chance it will be ready later, when migration finishes. If not, it could fallback to switch mode then (I imagine the reason it uses the fallback is to avoid freezing destination when seamless migration finishes) OK, but then I'd prefer a new RFE for this lazy :) seamless migration.
> Eventually, get rid of the initial timeout:
A timeout has to remain there if the main channel is to connect before migration starts, doesn't it?
(In reply to David Jaša from comment #16) > OK, but then I'd prefer a new RFE for this lazy :) seamless migration. > > > Eventually, get rid of the initial timeout: > > A timeout has to remain there if the main channel is to connect before > migration starts, doesn't it? Well, I am not convinced. In any case, we should not consider the current behaviour as a bug, but propose rfe. Imho, this kind of undefined/unclear RFE are better handled in upstream. So David, should we close this bug and open a new "RFE: make seamless migration do not timeout on init" or just repurpose this one? moving to 7.2 Let's close it now. There was no other case of this bug happening since reporting and it will be even less likely to hit again with rise of local DNS caches. |