Created attachment 1250619 [details] record Description of problem: Cockpit is closing connection to the HE installation during the setup and when reconnecting i can't reach the HE deployment any more and can't get the status of the deployment. When starting to run the Hosted Engine deploy via the HE wizard in cockpit, during the management configuration the session get closed and there is no way to get back to the HE deploy. When you reconnect back it seems that i need to start it all over again(although it seems to be continued some where in the back round) It means that once i disconnected from the session, i can't get the status of the HE deployment any more(which is pretty bad). It is seems that the time out connection is very slow when configuring the management network(taking some time to get the ip from the dhcp server). I think the time out should be much longer for the management configuration state. Version-Release number of selected component (if applicable): cockpit-ws-126-1.el7.x86_64 rhvh-4.1-0.20170208.0+1 How reproducible: On some hosts Steps to Reproduce: 1. Run HE deploy via the Hosted Engine tab Actual results: During the management network/bridge configuration the connection to the session get closed(small time out) and i can't get back to the HE deployment any more. Have no idea what is the status of the deployment. has it failed? is it still running? Expected results: Should remain in the session during the deploy session(in case of slow response from dhcp) and must be a way to get back to this session in case of disconnection. Additional info: Attaching record
Hi Dominik, Is this bug related to cockpit component? or should we put it on the right one? Thanks,
Created attachment 1256348 [details] messages log
Created attachment 1256349 [details] setup logs
rhvh-4.1-0.20170208.0+1 rhvm-appliance-4.1.20170221.0-1.el7ev.noarch.rpm cockpit-ws-126-1.el7.x86_64
Can you please attach sos report from the host?
Created attachment 1256390 [details] sos report
The HE setup in the background dies as soon as you disconnect, so it's not still running. Early in development, we debated whether we should use "screen" or not, but this doesn't solve the essential issue -- cockpit-ovirt is only a presentation layer. It connects to stdin/stdout of hosted-engine --deploy. If we provided a way to reconnect to the session, the wizard would stay waiting for input until some result returned, with no input box available. In many cases (assume you did not disconnect during the network setup), this would never happen. In theory, we could create a small service to proxy this connection and/or save the last page result, but this is risk prone, because we'd need to implement a significant amount of logic around this. Given that a complete redesign is targeted for 4.2, the development effort doesn't seem worth the risk, and it will definitely not make 4.1.1. It would be better to simply see if we can increase the timeout, though I've never encountered this and can't reproduce. That's part of base cockpit, though. Dominik - is this configurable?
There is a hack to make cockpit ignore transport health checks, but I don't think it's part of the official API. https://github.com/cockpit-project/cockpit/blob/43b85f0b5fca6e8d58fd164b2622004484896d1f/pkg/networkmanager/interfaces.js#L1955 You set it to start ignoring the timeout with cockpit.hint("ignore_transport_health_check", { data: true }); and then reset with cockpit.hint("ignore_transport_health_check", { data: false }); Apart from that there are timeouts on the webservice side that are compiled in.
Hi yzhao, Seem we met this bug before, could you help to double confirm this?
Usually, deploy the HE via Cockpit works well. But if the session get closed, we don't get back to the HE deploy any more. And reconnect back it that we need to start it all over again(although it seems to be continued some where in the back round).
Created attachment 1365984 [details] ovirt-hosted-engine-setup log
I don't see any reason why there would be a failure here. Is it possible that your engine VM is getting the same IP as the test host? Simone, do you see anything in the log? It looks like HE setup is running successfully...
(In reply to Michael Burman from comment #24) > Created attachment 1365984 [details] > ovirt-hosted-engine-setup log No the VM has a different IP. And yes everything in the log seems to be just fine, but i did got disconnected from the HE wizard session in cockpit and i had no indication about the actual status of the HE deployment, which is not so nice to the user. Only now looking at the log i see that the setup was successful..
Once disconnected, can you reconnect?
(In reply to Ryan Barry from comment #27) > Once disconnected, can you reconnect? How ? :) there is no way in cockpit to resume it after disconnected, only to start again. When log in to cockpit again, it's not aware of nothing, it's not aware if the deploy still running or not. It could be great i could reconnect to the session)
I would expect, then, that the deployment did not succeed. By "can you reconnect", I meant to cockpit, rather than the wizard. I was wondering whether the firewall rules were being changed out from under us. Given that the disconnect happens right after the engine comes up, I'm still suspicious that there's something wrong with the environment, especially if this is being tested with nested virtualization. Have you checked the same test scenario over the CLI? Realistically, we are probably not going to fix this bug. The HE rewrite in 4.2 is primarily using Ansible for deployment. See comment#15 Also see comment#10. Reworking this to use 'screen' or something else we can attach to is a possibility, but we need to see whether this is still reproducible after the last vestiges of the old wizard are gone. Lastly, see comment#11. Some of the timeouts are hardcoded into cockpit. We can try Dominik's suggestion in cockpit-ovirt and disable the transport health check, but it may not resolve your issue. Since this is 100% reproducible on 4.2, it's probably a new bug somewhere in the rewrite. I'm just trying to get some idea of a cause (or a way in which it differs from this bug, which has been present since 4.0, but is not 100% reproducible... Testing from the CLI would be great, because then we can either determine that it is or is not cockpit-ovirt.
(In reply to Ryan Barry from comment #29) > I would expect, then, that the deployment did not succeed. By "can you > reconnect", I meant to cockpit, rather than the wizard. I was wondering > whether the firewall rules were being changed out from under us. > > Given that the disconnect happens right after the engine comes up, I'm still > suspicious that there's something wrong with the environment, especially if > this is being tested with nested virtualization. Have you checked the same > test scenario over the CLI? > > Realistically, we are probably not going to fix this bug. > > The HE rewrite in 4.2 is primarily using Ansible for deployment. See > comment#15 > > Also see comment#10. Reworking this to use 'screen' or something else we can > attach to is a possibility, but we need to see whether this is still > reproducible after the last vestiges of the old wizard are gone. > > Lastly, see comment#11. Some of the timeouts are hardcoded into cockpit. We > can try Dominik's suggestion in cockpit-ovirt and disable the transport > health check, but it may not resolve your issue. > > Since this is 100% reproducible on 4.2, it's probably a new bug somewhere in > the rewrite. I'm just trying to get some idea of a cause (or a way in which > it differs from this bug, which has been present since 4.0, but is not 100% > reproducible... > > Testing from the CLI would be great, because then we can either determine > that it is or is not cockpit-ovirt. 1) It's not nested visualization, but bare metal host 2) Don't know about cli, not testing cli, need to ask someone that do.(i'm not a HE tester..) 3) Disconnect not happen when the engine comes up, it happens much much earlier, i guess on the dhcp response when configuring the ovirtmgmt bridge on the NIC. 4) Yes, you can reconnect to the cockpit it self after disconnected, but not to the HE wizard 5) I don't see any difference from this bug, which i reported to what i face now on 4.2, from my side it's the same issue. And in the same deploy state. I get disconnected when starting to configure the ovirtmgmt bridge.
1) Ok, just checking... 2) 'hosted-engine --deploy' follows exactly the same steps as the cockpit wizard did in 4.1 (since it was just a wrapper around hosted-engine --deploy). It should look pretty familiar. 3) I would guess that it actually does not, or that the IP you are assigned after configuring differs somehow. 'hosted-engine --deploy' is terminated when the page unloads. See https://bugzilla.redhat.com/show_bug.cgi?id=1334740 If it continues running to 'Engine is up', this should be just about where it died... 4) What happens if you reconnect? HE status? Or does it ask to start the wizard from scratch? 5) The difference from my perspective is that it's 100% reproducible. And that, since the deployment is using ansible, any solution which involves using 'screen' to spawn 'hosted-engine --deploy' is no longer valid, so it would require a completely different solution.
(In reply to Ryan Barry from comment #31) > 1) Ok, just checking... > > 2) 'hosted-engine --deploy' follows exactly the same steps as the cockpit > wizard did in 4.1 (since it was just a wrapper around hosted-engine > --deploy). It should look pretty familiar. > > 3) I would guess that it actually does not, or that the IP you are assigned > after configuring differs somehow. 'hosted-engine --deploy' is terminated > when the page unloads. See > https://bugzilla.redhat.com/show_bug.cgi?id=1334740 > > If it continues running to 'Engine is up', this should be just about where > it died... > > 4) What happens if you reconnect? HE status? Or does it ask to start the > wizard from scratch? > > 5) The difference from my perspective is that it's 100% reproducible. And > that, since the deployment is using ansible, any solution which involves > using 'screen' to spawn 'hosted-engine --deploy' is no longer valid, so it > would require a completely different solution. 3) All i know that i got disconnected on the very much beginning of the deploy(it still didn't created the engine VM), when vdsm configuring the bridge, possibly cause of the slow dhcp response and the short time out in cockpit, i get disconnected from the session. 4) It ask me to start again from scratch
There is another important difference from 4.1: Now the run is (supposedly) fully-unattended, HE-setup does not ask anything and cockpit does not need to reply. If this is so, we can run HE-setup > some-log-file and make cockpit check the file, instead of a pipe. This way a reconnect should still work, if we always use the same file, or if we have safe and good means to know where this file is.
(Adding to previous comment) If indeed cockpit has some kind of 'state', and can have a random/variable log file name, it can probably also save somewhere the offset into the file that it already processed, so a reconnect can start from there. This is much much simpler than implementing something like 'screen' for it.
(In reply to Ryan Barry from comment #25) > I don't see any reason why there would be a failure here. the setup was successful: 2017-12-11 12:16:42,929+0200 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:204 DIALOG:SEND **%EventStart STAGE terminate METHOD otopi.plugins.gr_he_common.core.misc.Plugin._terminate (None) 2017-12-11 12:16:42,930+0200 INFO otopi.plugins.gr_he_common.core.misc misc._terminate:251 Hosted Engine successfully deployed > Is it possible that your engine VM is getting the same IP as the test host? > > Simone, do you see anything in the log? It looks like HE setup is running > successfully... Although latest messages got lost because the output channel was closed: dialog.__logString:204 DIALOG:Not SENDING, output is closed: %s **%EventEnd STAGE terminate METHOD otopi.plugins.otopi.dialog.machine.Plugin._terminate (None) dialog.__logString:204 DIALOG:Not SENDING, output is closed: %s **%EventStart STAGE terminate METHOD otopi.plugins.otopi.core.log.Plugin._terminate (None)
Michael, from my point of view, your problem is Once we close the session about HE setup wizard or disconnect, for 4.1 or 4.2, we can not reach the HE setup wizard. Justly, we only clean the EHV and deploy again. Right? This is different from bug https://bugzilla.redhat.com/show_bug.cgi?id=1522641.
(In reply to Yihui Zhao from comment #36) > Michael, from my point of view, your problem is > > Once we close the session about HE setup wizard or disconnect, for 4.1 or > 4.2, we can not reach the HE setup wizard. Justly, we only clean the EHV and > deploy again. Right? > > > This is different from bug > https://bugzilla.redhat.com/show_bug.cgi?id=1522641. Hello Yihui, Yes exactly, once the session closed, i can't reach it back and can't know what is going on with the deploy. I can of course to check the deploy logs or check if any relevant process is running, but once kicked out of the session, it's very difficult for the user understand the deploy status. The thing is, that the problem is not in the HE deploy it self, the deploy actually was successful in the end. It does different from BZ 1522641, because vdsm is running and HE deploy ends up with success. I can reach my HE VM and everything seems to work as expected regarding the deploy it self. - Will attach 2 records from 2 different HWs, showing the get session closed and that actually the HE deploy does running, but i can't re-connect to the session via cockpit. Hope this will explain the problem.
Created attachment 1366457 [details] host1 cockpit connection closed
From my point of view, this is definitely a duplicate. Even if ansible is being used instead of otopi, the behavior is identical, and all of the same caveats in bug #1422544 apply. We can try to serialize the object and reconnect to it, but it's also a UX question. Many web applications, for years, have behaved in the same way. Leaving or refreshing the page allows users to restart the process from the beginning. If we do not want to do this, we will instead need to leave the cockpit channel open, serialize it, and reconnect. Note that this will not work and resolve the original bug (#1422544) if the connection actually drops, since it will not be possible to serialize the object to the disk of the server without a connection to cockpit's dbus backend. IMO, it would be best to pop up a "are you sure" modal when switching tabs to resolve this bug. Something like: this.state = { allowReconnect = true } reconnectConfirmation(e) { this.setState({allowReconnect: e.confirmed}) if (this.state.allowReconnect) { serializeObject(...) } } componentWillMount() { try { loadSavedState(...) } catch { // do default operation } } In the worst case, to resolve the original bug, we can serialize the object once we actually invoke otopi/ansible, so a disconnect will preserve it. Note that cockpit itself does not do this. If the page is switched while in the RHSM plugin, it starts from the beginning. If the page is refreshed while in the cockpit terminal, it starts from scratch on reload. We should try to match this behavior if possible.
*** Bug 1540255 has been marked as a duplicate of this bug. ***
Moving to Phillip being Ryan unavailable for the next 2 weeks
Maybe we can easily use React's localStorage now that React is updated?
Not identified as blocker for 4.2.7, moving to 4.2.8
*** Bug 1760488 has been marked as a duplicate of this bug. ***
FYI: In https://github.com/cockpit-project/cockpit/pull/14318 I am adding an example how a Cockpit page can launch and re-attach to a long-running process, using systemd-run (transient service units). Comments appreciated.
As cockpit itself is meant to be stateless, and the RHV team does not have capacity to implement statefulness just for ovirt-deploy, closing this issue. If you really need this functionality in cockpit, please open a BZ with the cockpit team.