Bug 1422544 - [RFE] leaving the hosted engine deployment page should save the state, and reconnect when the page is reloaded
Summary: [RFE] leaving the hosted engine deployment page should save the state, and re...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: cockpit-ovirt
Version: unspecified
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Aviv Turgeman
QA Contact: Wei Wang
URL:
Whiteboard:
: 1540255 1760488 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-15 14:07 UTC by Michael Burman
Modified: 2023-09-07 18:50 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-03 09:50:47 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
record (10.77 MB, application/x-gzip)
2017-02-15 14:07 UTC, Michael Burman
no flags Details
messages log (50.70 KB, application/x-gzip)
2017-02-22 06:36 UTC, Michael Burman
no flags Details
setup logs (59.77 KB, application/x-gzip)
2017-02-22 06:41 UTC, Michael Burman
no flags Details
sos report (8.63 MB, application/x-xz)
2017-02-22 09:53 UTC, Michael Burman
no flags Details
ovirt-hosted-engine-setup log (597.98 KB, text/plain)
2017-12-11 13:00 UTC, Michael Burman
no flags Details
host1 cockpit connection closed (8.80 MB, application/x-gzip)
2017-12-12 07:07 UTC, Michael Burman
no flags Details

Description Michael Burman 2017-02-15 14:07:44 UTC
Created attachment 1250619 [details]
record

Description of problem:
Cockpit is closing connection to the HE installation during the setup and when reconnecting i can't reach the HE deployment any more and can't get the status of the deployment. 

When starting to run the Hosted Engine deploy via the HE wizard in cockpit, during the management configuration the session get closed and there is no way to get back to the HE deploy. When you reconnect back it seems that i need to start it all over again(although it seems to be continued some where in the back round) 
It means that once i disconnected from the session, i can't get the status of the HE deployment any more(which is pretty bad). 
It is seems that the time out connection is very slow when configuring the management network(taking some time to get the ip from the dhcp server).

I think the time out should be much longer for the management configuration state.

Version-Release number of selected component (if applicable):
cockpit-ws-126-1.el7.x86_64
rhvh-4.1-0.20170208.0+1

How reproducible:
On some hosts

Steps to Reproduce:
1. Run HE deploy via the Hosted Engine tab 

Actual results:
During the management network/bridge configuration the connection to the session get closed(small time out) and i can't get back to the HE deployment any more. Have no idea what is the status of the deployment. has it failed? is it still running?

Expected results:
Should remain in the session during the deploy session(in case of slow response from dhcp) and must be a way to get back to this session in case of disconnection. 

Additional info:
Attaching record

Comment 2 Michael Burman 2017-02-20 08:45:57 UTC
Hi Dominik,

Is this bug related to cockpit component? or should we put it on the right one?

Thanks,

Comment 4 Michael Burman 2017-02-22 06:36:30 UTC
Created attachment 1256348 [details]
messages log

Comment 5 Michael Burman 2017-02-22 06:41:54 UTC
Created attachment 1256349 [details]
setup logs

Comment 6 Michael Burman 2017-02-22 06:43:07 UTC
rhvh-4.1-0.20170208.0+1
rhvm-appliance-4.1.20170221.0-1.el7ev.noarch.rpm
cockpit-ws-126-1.el7.x86_64

Comment 7 Sandro Bonazzola 2017-02-22 08:02:16 UTC
Can you please attach sos report from the host?

Comment 9 Michael Burman 2017-02-22 09:53:29 UTC
Created attachment 1256390 [details]
sos report

Comment 10 Ryan Barry 2017-02-22 13:53:57 UTC
The HE setup in the background dies as soon as you disconnect, so it's not still running.

Early in development, we debated whether we should use "screen" or not, but this doesn't solve the essential issue -- cockpit-ovirt is only a presentation layer. It connects to stdin/stdout of hosted-engine --deploy. 

If we provided a way to reconnect to the session, the wizard would stay waiting for input until some result returned, with no input box available. In many cases (assume you did not disconnect during the network setup), this would never happen.

In theory, we could create a small service to proxy this connection and/or save the last page result, but this is risk prone, because we'd need to implement a significant amount of logic around this.

Given that a complete redesign is targeted for 4.2, the development effort doesn't seem worth the risk, and it will definitely not make 4.1.1.

It would be better to simply see if we can increase the timeout, though I've never encountered this and can't reproduce. That's part of base cockpit, though.

Dominik - is this configurable?

Comment 11 Dominik Perpeet 2017-02-22 15:58:51 UTC
There is a hack to make cockpit ignore transport health checks, but I don't think it's part of the official API.

https://github.com/cockpit-project/cockpit/blob/43b85f0b5fca6e8d58fd164b2622004484896d1f/pkg/networkmanager/interfaces.js#L1955

You set it to start ignoring the timeout with
cockpit.hint("ignore_transport_health_check", { data: true });

and then reset with
cockpit.hint("ignore_transport_health_check", { data: false });

Apart from that there are timeouts on the webservice side that are compiled in.

Comment 12 cshao 2017-02-28 03:15:50 UTC
Hi yzhao,
Seem we met this bug before, could you help to double confirm this?

Comment 13 Yihui Zhao 2017-03-06 02:30:53 UTC
Usually, deploy the HE via Cockpit works well. But if the session get closed, we don't get back to the HE deploy any more. And  reconnect back it  that we need to start it all over again(although it seems to be continued some where in the back round).

Comment 24 Michael Burman 2017-12-11 13:00:47 UTC
Created attachment 1365984 [details]
ovirt-hosted-engine-setup log

Comment 25 Ryan Barry 2017-12-11 13:18:09 UTC
I don't see any reason why there would be a failure here.

Is it possible that your engine VM is getting the same IP as the test host?

Simone, do you see anything in the log? It looks like HE setup is running successfully...

Comment 26 Michael Burman 2017-12-11 13:21:50 UTC
(In reply to Michael Burman from comment #24)
> Created attachment 1365984 [details]
> ovirt-hosted-engine-setup log

No the VM has a different IP. And yes everything in the log seems to be just fine, but i did got disconnected from the HE wizard session in cockpit and i had no indication about the actual status of the HE deployment, which is not so nice to the user. Only now looking at the log i see that the setup was successful..

Comment 27 Ryan Barry 2017-12-11 13:28:39 UTC
Once disconnected, can you reconnect?

Comment 28 Michael Burman 2017-12-11 13:33:30 UTC
(In reply to Ryan Barry from comment #27)
> Once disconnected, can you reconnect?

How ? :) there is no way in cockpit to resume it after disconnected, only to start again. 
When log in to cockpit again, it's not aware of nothing, it's not aware if the deploy still running or not. It could be great i could reconnect to the session)

Comment 29 Ryan Barry 2017-12-11 13:59:50 UTC
I would expect, then, that the deployment did not succeed. By "can you reconnect", I meant to cockpit, rather than the wizard. I was wondering whether the firewall rules were being changed out from under us.

Given that the disconnect happens right after the engine comes up, I'm still suspicious that there's something wrong with the environment, especially if this is being tested with nested virtualization. Have you checked the same test scenario over the CLI?

Realistically, we are probably not going to fix this bug.

The HE rewrite in 4.2 is primarily using Ansible for deployment. See comment#15

Also see comment#10. Reworking this to use 'screen' or something else we can attach to is a possibility, but we need to see whether this is still reproducible after the last vestiges of the old wizard are gone.

Lastly, see comment#11. Some of the timeouts are hardcoded into cockpit. We can try Dominik's suggestion in cockpit-ovirt and disable the transport health check, but it may not resolve your issue.

Since this is 100% reproducible on 4.2, it's probably a new bug somewhere in the rewrite. I'm just trying to get some idea of a cause (or a way in which it differs from this bug, which has been present since 4.0, but is not 100% reproducible...

Testing from the CLI would be great, because then we can either determine that it is or is not cockpit-ovirt.

Comment 30 Michael Burman 2017-12-11 14:25:13 UTC
(In reply to Ryan Barry from comment #29)
> I would expect, then, that the deployment did not succeed. By "can you
> reconnect", I meant to cockpit, rather than the wizard. I was wondering
> whether the firewall rules were being changed out from under us.
> 
> Given that the disconnect happens right after the engine comes up, I'm still
> suspicious that there's something wrong with the environment, especially if
> this is being tested with nested virtualization. Have you checked the same
> test scenario over the CLI?
> 
> Realistically, we are probably not going to fix this bug.
> 
> The HE rewrite in 4.2 is primarily using Ansible for deployment. See
> comment#15
> 
> Also see comment#10. Reworking this to use 'screen' or something else we can
> attach to is a possibility, but we need to see whether this is still
> reproducible after the last vestiges of the old wizard are gone.
> 
> Lastly, see comment#11. Some of the timeouts are hardcoded into cockpit. We
> can try Dominik's suggestion in cockpit-ovirt and disable the transport
> health check, but it may not resolve your issue.
> 
> Since this is 100% reproducible on 4.2, it's probably a new bug somewhere in
> the rewrite. I'm just trying to get some idea of a cause (or a way in which
> it differs from this bug, which has been present since 4.0, but is not 100%
> reproducible...
> 
> Testing from the CLI would be great, because then we can either determine
> that it is or is not cockpit-ovirt.

1) It's not nested visualization, but bare metal host
2) Don't know about cli, not testing cli, need to ask someone that do.(i'm not a HE tester..) 
3) Disconnect not happen when the engine comes up, it happens much much earlier, i guess on the dhcp response when configuring the ovirtmgmt bridge on the NIC.
4) Yes, you can reconnect to the cockpit it self after disconnected, but not to the HE wizard
5) I don't see any difference from this bug, which i reported to what i face now on 4.2, from my side it's the same issue. And in the same deploy state.
I get disconnected when starting to configure the ovirtmgmt bridge.

Comment 31 Ryan Barry 2017-12-11 14:30:23 UTC
1) Ok, just checking...

2) 'hosted-engine --deploy' follows exactly the same steps as the cockpit wizard did in 4.1 (since it was just a wrapper around hosted-engine --deploy). It should look pretty familiar.

3) I would guess that it actually does not, or that the IP you are assigned after configuring differs somehow. 'hosted-engine --deploy' is terminated when the page unloads. See https://bugzilla.redhat.com/show_bug.cgi?id=1334740

If it continues running to 'Engine is up', this should be just about where it died...

4) What happens if you reconnect? HE status? Or does it ask to start the wizard from scratch?

5) The difference from my perspective is that it's 100% reproducible. And that, since the deployment is using ansible, any solution which involves using 'screen' to spawn 'hosted-engine --deploy' is no longer valid, so it would require a completely different solution.

Comment 32 Michael Burman 2017-12-11 14:38:02 UTC
(In reply to Ryan Barry from comment #31)
> 1) Ok, just checking...
> 
> 2) 'hosted-engine --deploy' follows exactly the same steps as the cockpit
> wizard did in 4.1 (since it was just a wrapper around hosted-engine
> --deploy). It should look pretty familiar.
> 
> 3) I would guess that it actually does not, or that the IP you are assigned
> after configuring differs somehow. 'hosted-engine --deploy' is terminated
> when the page unloads. See
> https://bugzilla.redhat.com/show_bug.cgi?id=1334740
> 
> If it continues running to 'Engine is up', this should be just about where
> it died...
> 
> 4) What happens if you reconnect? HE status? Or does it ask to start the
> wizard from scratch?
> 
> 5) The difference from my perspective is that it's 100% reproducible. And
> that, since the deployment is using ansible, any solution which involves
> using 'screen' to spawn 'hosted-engine --deploy' is no longer valid, so it
> would require a completely different solution.

3) All i know that i got disconnected on the very much beginning of the deploy(it still didn't created the engine VM), when vdsm configuring the bridge, possibly cause of the slow dhcp response and the short time out in cockpit, i get disconnected from the session.

4) It ask me to start again from scratch

Comment 33 Yedidyah Bar David 2017-12-11 14:38:50 UTC
There is another important difference from 4.1: Now the run is (supposedly) fully-unattended, HE-setup does not ask anything and cockpit does not need to reply. If this is so, we can run HE-setup > some-log-file and make cockpit check the file, instead of a pipe. This way a reconnect should still work, if we always use the same file, or if we have safe and good means to know where this file is.

Comment 34 Yedidyah Bar David 2017-12-11 14:40:26 UTC
(Adding to previous comment) If indeed cockpit has some kind of 'state', and can have a random/variable log file name, it can probably also save somewhere the offset into the file that it already processed, so a reconnect can start from there. This is much much simpler than implementing something like 'screen' for it.

Comment 35 Simone Tiraboschi 2017-12-11 16:26:28 UTC
(In reply to Ryan Barry from comment #25)
> I don't see any reason why there would be a failure here.

the setup was successful:
2017-12-11 12:16:42,929+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND       **%EventStart STAGE terminate METHOD otopi.plugins.gr_he_common.core.misc.Plugin._terminate (None)
2017-12-11 12:16:42,930+0200 INFO otopi.plugins.gr_he_common.core.misc misc._terminate:251 Hosted Engine successfully deployed

> Is it possible that your engine VM is getting the same IP as the test host?
> 
> Simone, do you see anything in the log? It looks like HE setup is running
> successfully...

Although latest messages got lost because the output channel was closed:

dialog.__logString:204 DIALOG:Not SENDING, output is closed: %s **%EventEnd STAGE terminate METHOD otopi.plugins.otopi.dialog.machine.Plugin._terminate (None)

dialog.__logString:204 DIALOG:Not SENDING, output is closed: %s **%EventStart STAGE terminate METHOD otopi.plugins.otopi.core.log.Plugin._terminate (None)

Comment 36 Yihui Zhao 2017-12-12 02:34:54 UTC
Michael, from my point of view, your problem is 

Once we close the session about HE setup wizard or disconnect, for 4.1 or 4.2, we can not reach the HE setup wizard. Justly, we only clean the EHV and deploy again. Right?


This is different from bug https://bugzilla.redhat.com/show_bug.cgi?id=1522641.

Comment 37 Michael Burman 2017-12-12 06:52:04 UTC
(In reply to Yihui Zhao from comment #36)
> Michael, from my point of view, your problem is 
> 
> Once we close the session about HE setup wizard or disconnect, for 4.1 or
> 4.2, we can not reach the HE setup wizard. Justly, we only clean the EHV and
> deploy again. Right?
> 
> 
> This is different from bug
> https://bugzilla.redhat.com/show_bug.cgi?id=1522641.

Hello Yihui,
Yes exactly, once the session closed, i can't reach it back and can't know what is going on with the deploy. I can of course to check the deploy logs or check if any relevant process is running, but once kicked out of the session, it's very difficult for the user understand the deploy status.
The thing is, that the problem is not in the HE deploy it self, the deploy actually was successful in the end. 

It does different from BZ 1522641, because vdsm is running and HE deploy ends up with success. I can reach my HE VM and everything seems to work as expected regarding the deploy it self.

- Will attach 2 records from 2 different HWs, showing the get session closed and that actually the HE deploy does running, but i can't re-connect to the session via cockpit. Hope this will explain the problem.

Comment 38 Michael Burman 2017-12-12 07:07:13 UTC
Created attachment 1366457 [details]
host1 cockpit connection closed

Comment 39 Ryan Barry 2018-02-01 10:04:47 UTC
From my point of view, this is definitely a duplicate.

Even if ansible is being used instead of otopi, the behavior is identical, and all of the same caveats in bug #1422544 apply.

We can try to serialize the object and reconnect to it, but it's also a UX question. Many web applications, for years, have behaved in the same way. Leaving or refreshing the page allows users to restart the process from the beginning.

If we do not want to do this, we will instead need to leave the cockpit channel open, serialize it, and reconnect.

Note that this will not work and resolve the original bug (#1422544) if the connection actually drops, since it will not be possible to serialize the object to the disk of the server without a connection to cockpit's dbus backend.

IMO, it would be best to pop up a "are you sure" modal when switching tabs to resolve this bug.

Something like:

this.state = {
  allowReconnect = true
}

reconnectConfirmation(e) {
  this.setState({allowReconnect: e.confirmed})
  if (this.state.allowReconnect) {
    serializeObject(...)
  }
}

componentWillMount() {
  try {
    loadSavedState(...)
  }
  catch {
    // do default operation
  }
}

In the worst case, to resolve the original bug, we can serialize the object once we actually invoke otopi/ansible, so a disconnect will preserve it.

Note that cockpit itself does not do this. If the page is switched while in the RHSM plugin, it starts from the beginning. If the page is refreshed while in the cockpit terminal, it starts from scratch on reload. We should try to match this behavior if possible.

Comment 40 Ryan Barry 2018-02-01 10:07:42 UTC
*** Bug 1540255 has been marked as a duplicate of this bug. ***

Comment 41 Sandro Bonazzola 2018-06-27 09:15:07 UTC
Moving to Phillip being Ryan unavailable for the next 2 weeks

Comment 42 Ryan Barry 2018-07-20 21:19:07 UTC
Maybe we can easily use React's localStorage now that React is updated?

Comment 43 Sandro Bonazzola 2018-09-21 07:23:57 UTC
Not identified as blocker for 4.2.7, moving to 4.2.8

Comment 44 Ryan Barry 2019-10-11 04:34:22 UTC
*** Bug 1760488 has been marked as a duplicate of this bug. ***

Comment 46 Martin Pitt 2020-07-06 10:29:14 UTC
FYI: In https://github.com/cockpit-project/cockpit/pull/14318 I am adding an example how a Cockpit page can launch and re-attach to a long-running process, using systemd-run (transient service units). Comments appreciated.

Comment 47 Martin Tessun 2021-02-03 09:50:47 UTC
As cockpit itself is meant to be stateless, and the RHV team does not have capacity to implement statefulness just for ovirt-deploy, closing this issue.

If you really need this functionality in cockpit, please open a BZ with the cockpit team.


Note You need to log in before you can comment on or make changes to this bug.