Bug 1172511 - Switch to Hosted Engine TUI menu so slowly due to failed to connect to broker
Summary: Switch to Hosted Engine TUI menu so slowly due to failed to connect to broker
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-node-plugin-hosted-engine
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-3.6.2
: 3.6.2
Assignee: Fabian Deutsch
QA Contact: Ying Cui
URL:
Whiteboard:
Depends On: 1259247 1260470 1269176
Blocks: 1222282 1252796
TreeView+ depends on / blocked
 
Reported: 2014-12-10 09:17 UTC by Ying Cui
Modified: 2016-03-04 08:07 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1222282 1252796 (view as bug list)
Environment:
Last Closed: 2016-03-04 08:07:23 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log (267.39 KB, application/x-gzip)
2014-12-10 09:17 UTC, Ying Cui
no flags Details
sosreport-black-vdsb.qa.lab.tlv.redhat.com-20151116144612.tar.xz (5.48 MB, application/x-xz)
2015-11-16 14:55 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:0422 0 normal SHIPPED_LIVE ovirt-hosted-engine-ha bug fix and enhancement update 2016-03-09 23:58:25 UTC
oVirt gerrit 36194 0 master MERGED HAClient: allow overriding broker connection parameters Never
oVirt gerrit 45267 0 ovirt-hosted-engine-ha-1.2 MERGED HAClient: allow overriding broker connection parameters Never
oVirt gerrit 49227 0 master ABANDONED ui: Use a shorter wait time Never

Description Ying Cui 2014-12-10 09:17:05 UTC
Created attachment 966708 [details]
/var/log

Description of problem:
engine VM has created, but engine VM OS installation interrupt or failed anyway, then switch TUI configuration menu, then so slowly switch to Hosted Engine TUI menu page.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 7.0 (20141202.0.el7ev)
ovirt-node-3.1.0-0.28.20141126git25ce016.el7.noarch
ovirt-node-plugin-hosted-engine-0.2.0-5.0.el7ev.x86_64
ovirt-hosted-engine-setup-1.2.1-6.el7ev.noarch
ovirt-hosted-engine-ha-1.2.4-2.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. installed rhevh 7.0 successful.
2. login TUI configuration
3. switch to "Hosted Engine" TUI menu
4. PXE to setup VM
5. one by one steps setting configuration
6. then the VM has been started:
   Install the OS and shut down or reboot it.
     To continue please make a selection:
       (1)Continue setup - VM installation is complete
       (2)Reboot the VM and restart installation
       (3)Abort setup
       (4)Destroy VM ...
7. you can select 4 or interrupt the VM OS installation manually.
8. Go back to TUI menu, then switch to "Hosted Engine" 

Actual results:
after engine VM OS setup failed, then go to TUI menu, so slowly switch to Hosted Engine.

Expected results:
Switch to Hosted Engine TUI menu normally.

<snip>
2014-12-10 08:19:49,685       INFO Current page is 'Plugins'
2014-12-10 08:19:49,690       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:49,690       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:50,834       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:50,834       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:54,695       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:54,696       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:55,839       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:55,840       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:59,701       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:59,701       INFO Retrying broker connection in '5' seconds
2014-12-10 08:20:00,845      ERROR Failed to connect to broker, the number of errors has exceeded the limit (5)
2014-12-10 08:20:00,849       INFO Current page is 'Hosted Engine'
2014-12-10 08:20:00,933       INFO Current page is 'Plugins'
</snip>

Comment 2 Ryan Barry 2014-12-10 16:32:38 UTC
We can probably handle this pretty easily with another worker thread to update the TUI, but...

I'd really prefer to see this handled from the ovirt-engine-ha-broker. It seems like it should shut down and return to an unconfigured state if setup is aborted.

Comment 3 Fabian Deutsch 2014-12-11 08:33:44 UTC
Sandro, what do you think about Ryan's comment 2?

Comment 4 Sandro Bonazzola 2014-12-11 08:36:29 UTC
Fabian, about the broker I redirect the needinfo to Jiri.

Comment 5 Jiri Moskovcak 2014-12-12 13:15:31 UTC
(In reply to Ryan Barry from comment #2)
> We can probably handle this pretty easily with another worker thread to
> update the TUI, but...
> 
> I'd really prefer to see this handled from the ovirt-engine-ha-broker. It
> seems like it should shut down and return to an unconfigured state if setup
> is aborted.

I don't understand, where is this log coming from? Are you trying to connect to the broker while it's not running? How can broker handle this? If you're using brokerlink.connect, then you can just call it with retries and wait parameter to shorten the period while it's trying to connect:

connect(retries=1, wait=0) # this will only try once and won't wait

Comment 6 Ryan Barry 2014-12-12 14:54:42 UTC
Jiri -

We're using ovirt_hosted_engine_ha.client.HAClient().get_all_host_stats()

This doesn't occur before the user has attempted to configure hosted engine (and destroyed/aborted it). It throws an exception about no hosted-engine.conf in that case.

Essentially, we're not interacting with the broker at all. The interaction is through ovirt_hosted_engine_ha.HAClient. This seems reasonable, since we don't need to directly interact with the broker and we're just pulling information it should presumably know about.

Taking a quick look through the code, I don't see an easy way for us to modify any of the connection parameters without directly re-implementing parts of the client code in our plugin, which isn't desirable from a maintenance perspective, nor do I think our plugin should need to be aware of the inner workings of a module provided to use as a client.

It looks like it'd be really easy to add optional args to the stat calls which would be passed to brokerlink.connect, though. Is this something you guys would accept?

Comment 7 Ryan Barry 2014-12-12 14:55:53 UTC
Sandro, any thoughts? It's a problem with/questions about ovirt_hosted_engine_ha.client, not the broker specifically.

Comment 8 Jiri Moskovcak 2014-12-12 15:23:04 UTC
Ok, refusing to connect when the hosted-engine.conf does not exist sounds reasonable. I'll take a look at it on Monday.

Comment 9 Ryan Barry 2014-12-12 15:44:23 UTC
Thanks.

It refuses to connect as expected when the conf doesn't exist (on a fresh install), but I'm guessing that destroying the VM and aborting the setup leaves hosted-engine.conf around and it tries to connect anyway.

It seems like there are two things we could do:

Provide optional arguments for connect() through the client (either in HAClient.__init__ or otherwise, doesn't much matter to use) so we can set a fast timeout and only one retry.

hosted-engine-setup shouldn't leave hosted-engine.conf sitting around if the install is aborted.

Comment 10 Eyal Edri 2014-12-14 13:25:38 UTC
bug pushed to 3.5.1, removing from 3.5.0 trackers.

Comment 11 Sandro Bonazzola 2014-12-15 12:10:39 UTC
(In reply to Ryan Barry from comment #9)

> hosted-engine-setup shouldn't leave hosted-engine.conf sitting around if the
> install is aborted.

This conflicts with having the possibility to resume the setup if interrupted in the middle.
At closeup stage when we're running the VM, all config files have been already written and we already exited the setup part covered by rollback.

From this point on, you can just run hosted-engine --vm-start and you should get a running VM so from a config files perspective, the configuration is consistent with the state of the system.

Comment 12 Ryan Barry 2014-12-15 14:42:21 UTC
Is aborting setup and resuming it with "hosted-engine --vm-start" a supported use case? I haven't seen it in the documentation, but I've only read the upstream hosted-engine bits.

Is it possible to have the broker exit early or otherwise flag it if this happens? I'm trying to see why a connect timeout from the broker would be part of a normal, supported workflow, and what we can do to mitigate the usability impact it from the node side.

Comment 13 Jiri Moskovcak 2014-12-16 08:28:46 UTC
The timeout is *between* connections, it should help in situation when someone restarts the broker and the agent is still running

Comment 14 Jiri Moskovcak 2014-12-16 10:22:13 UTC
The proposed patch publish the option connect option thru the HAClient constructor parameters. Can you please test the patch and let me know if it helps with yout problem? The default values in the HAClient are set to retries=1, wait = 0, so you don't have to change your code, just apply the patch to ovirt-hosted-engine-ha and build it.

Comment 15 Ryan Barry 2014-12-17 14:59:37 UTC
I'll see if I can get a scratch build out today to test with.

I'll need to build a new ovirt-hosted-engine-ha RPM to test with. Should I cherry pick it off master to 1.2 to build?

Comment 16 Ying Cui 2014-12-18 03:24:22 UTC
Ryan, see this bug fix should be in ovirt-hosted-engine-ha component, not ovirt-node-plugin-hosted-engine component. and move 'node' whiteboard out.

Comment 17 Jiri Moskovcak 2014-12-18 08:42:38 UTC
(In reply to Ryan Barry from comment #15)
> I'll see if I can get a scratch build out today to test with.
> 
> I'll need to build a new ovirt-hosted-engine-ha RPM to test with. Should I
> cherry pick it off master to 1.2 to build?

- yes, you have to cherry-pick it, let me know if there are any problems with it, but I just tried and the cherry-pick works without any conflicts

Comment 19 Eyal Edri 2015-02-25 08:43:19 UTC
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2

Comment 21 Roy Golan 2015-08-12 08:47:43 UTC
The patch in the bug will just create ability to do retries and timeout. 
It won't solve the slow response. This needs to be addressed in the TUI.
I think rhevh needs to take this one?

Comment 22 Fabian Deutsch 2015-08-12 09:12:01 UTC
Yes, the TUI can leverage the new functionality (the timeouts) to provide a better experience.

Let's keep this bug to track the ha sided change, and I'll create a clone to track the Node side change.

Comment 23 Ying Cui 2015-08-19 09:35:34 UTC
(In reply to Fabian Deutsch from comment #22)
> Yes, the TUI can leverage the new functionality (the timeouts) to provide a
> better experience.
> 
> Let's keep this bug to track the ha sided change, and I'll create a clone to
> track the Node side change.

cloned bug to node: https://bugzilla.redhat.com/show_bug.cgi?id=1252796

Comment 24 Ying Cui 2015-08-19 09:47:21 UTC
From QE view, we have to request to escalate this bug, because any action to interrupt VM(vm setup failed, terminate quit...) will cause RHEV-H TUI _so_ slowly, switch TUI menu need about 1~2mins for each, unacceptable. And once this issue happen, TUI almost like crash, can do nothing. Thanks.

Comment 25 Anatoly Litovsky 2015-08-20 12:27:30 UTC
Does restarting the host returnes the TUI to its original speed?

Comment 26 Ying Cui 2015-08-21 08:08:55 UTC
Answer the comment 25 from my side:
Reboot RHEV-H, then it's back to the normally speed to switch TUI menu in RHEV-H.

*BUT* after rhevh reboot, if new setup Hosted Engine in TUI, it will be failed.

----------

2015-08-21 07:56:56 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/ha/ha_services.py", line 66, in _programs
RuntimeError: Hosted Engine HA services are already running on this system. Hosted Engine cannot be deployed on a host already running those services.
2015-08-21 07:56:56 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Programs detection': Hosted Engine HA services are already running on this system. Hosted Engine cannot be deployed on a host already running those services.
2015-08-21 07:56:56 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN
------------

Test Env: 
rhevh 10.66.8.233  admin and root password: redhat

-----console-----
[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]: 

[screen is terminating]
Hit <Return> to return to the TUI
------------------

Here you click on enter to continue, the process is interrupted, need to back to TUI, then error in log under /var/log/ovirt-hosted-engine-setup/

For above error, do we need to spit it to new bug? or the same root cause of this bug?

Comment 27 Ryan Barry 2015-08-21 16:29:16 UTC
(In reply to Ying Cui from comment #26)
> Answer the comment 25 from my side:
> Reboot RHEV-H, then it's back to the normally speed to switch TUI menu in
> RHEV-H.
> 
> *BUT* after rhevh reboot, if new setup Hosted Engine in TUI, it will be
> failed.
> 
> ----------
> 
> 2015-08-21 07:56:56 DEBUG otopi.context context._executeMethod:152 method
> exception
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in
> _executeMethod
>   File
> "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/ha/
> ha_services.py", line 66, in _programs
> RuntimeError: Hosted Engine HA services are already running on this system.
> Hosted Engine cannot be deployed on a host already running those services.
> 2015-08-21 07:56:56 ERROR otopi.context context._executeMethod:161 Failed to
> execute stage 'Programs detection': Hosted Engine HA services are already
> running on this system. Hosted Engine cannot be deployed on a host already
> running those services.
> 2015-08-21 07:56:56 DEBUG otopi.context context.dumpEnvironment:490
> ENVIRONMENT DUMP - BEGIN
> ------------
> 
> Test Env: 
> rhevh 10.66.8.233  admin and root password: redhat
> 
> -----console-----
> [ INFO  ] Stage: Initializing
> [ INFO  ] Generating a temporary VNC password.
> [ INFO  ] Stage: Environment setup
>           Continuing will configure this host for serving as hypervisor and
> create a VM where you have to install oVirt Engine afterwards.
>           Are you sure you want to continue? (Yes, No)[Yes]: 
> 
> [screen is terminating]
> Hit <Return> to return to the TUI
> ------------------
> 
> Here you click on enter to continue, the process is interrupted, need to
> back to TUI, then error in log under /var/log/ovirt-hosted-engine-setup/
> 
> For above error, do we need to spit it to new bug? or the same root cause of
> this bug?

This could be bz#1208489, but ovirt-ha-broker and ovirt-ha-agent are enabled by default on RHEV-H.

If these services detect a configuration file, they start, and hosted-engine-setup won't run because they're alreaady started.

/etc/ovirt-hosted-engine/hosted-engine.conf needs to be deleted and those services restarted before restarting the setup.

I've had a discussion with the hosted-engine developers before (but I can't find the thread), and not removing the configuration is an intentional decision. I would suggest that any change to that behavior should be a bug filed against ovirt-hosted-engine, since RHEV-H's behavior matches RHEL's (on RHEL, start hosted-engine setup, cancel out of it once it writes the configuration, then try to start it again, and you'll see the same message), but it may be NOTABUG.

Comment 28 Ying Cui 2015-08-24 07:03:11 UTC
There is no hosted-engine.conf file under /etc/ovirt-hosted-engine/

So far we are still finding a workaround when TUI is accessed slowly, after restart rhevh, how restart the HE setup via TUI? Thanks.

Comment 29 Eyal Edri 2015-08-30 11:46:54 UTC
both patches are merged, and afaik hosted engine was built,
can this bug move to ON_QA?

Comment 30 Martin Sivák 2015-08-31 11:21:18 UTC
Isn't this a downstream bug? Errata tool moves those from MODIFIED to ON_QA, doesn't it?

Comment 31 Eyal Edri 2015-08-31 15:17:34 UTC
if you added the bug to errata, it should move it once errata is moved to ON_QA.
but this is a 3.6.0 bug, not 3.5.4, and i assume you don't have errata for 3.6 yet and probably not ON_QA.

Comment 33 Nikolai Sednev 2015-09-16 06:21:38 UTC
We didn't received the RHEVH3.6 yet, checked now in Foreman.
This bug can't be verified yet.

Comment 34 Eyal Edri 2015-10-08 15:37:14 UTC
adding fabian to reply on RHEVH avaliablity

Comment 35 Nikolai Sednev 2015-10-26 16:04:38 UTC
Without being able to first deploy the HE over RHEVH, can't proceed to verification of this bug, hence adding a new depends on 1269176.

Comment 36 Nikolai Sednev 2015-11-16 14:46:08 UTC
Reproduced on Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151113.123.el7ev) on these components:
mom-0.5.1-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
libvirt-1.2.17-13.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7.x86_64
vdsm-4.17.10.1-0.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-node-branding-rhev-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-snmp-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-3.el7ev.noarch
ovirt-setup-lib-1.0.0-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-node-lib-config-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-selinux-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-cim-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch
ovirt-node-plugin-snmp-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-rhn-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-legacy-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64
ovirt-node-plugin-cim-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-1.4.1-0.0.master.el7ev.noarch




       The VM has been started.
          To continue please install OS and shutdown or reboot the VM.
         
          Make a selection from the options below:
          (1) Continue setup - OS installation is complete
          (2) Power off and restart the VM
          (3) Abort setup
          (4) Destroy VM and abort setup
         
          (1, 2, 3, 4)[1]: 4
[ ERROR ] Failed to execute stage 'Closing up': VM destroyed and setup aborted by user
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151116144051.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
Something went wrong setting up hosted engine, or the setup process was cancelled.

Press any key to continue...

Then following the instructions provided in bug's description, I've tried to go back to step 8 and it took me a lot of time, until system responded.

Sosreport from host is attached.

Comment 37 Nikolai Sednev 2015-11-16 14:55:10 UTC
Created attachment 1094958 [details]
sosreport-black-vdsb.qa.lab.tlv.redhat.com-20151116144612.tar.xz

Comment 38 Roy Golan 2015-12-27 09:25:07 UTC
Needs a closer look to see if this is another issue stalling the TUI or this is a completely different issue.

Comment 41 Fabian Deutsch 2016-01-20 09:26:18 UTC
According to my testing there is now long delay anymore.

Test:

1. Used a successful HE deployment
2. TUI: Switch to HE page
3. Console: Stop ovirt-ha-broker service
4. TUI: Switch to HE page
5. Console: Stop ovirt-ha-agent service
6. TUI: Switch to HE page
7. Console: Start ovirt-ha-agent and ovirt-ha-broker service
8. TUI: Switch to HE page

After step 2: Speed ok
After step 4: Speed ok, message on page: "Engine Status: Cannot connect to HA daemon, please check the logs "
After step 6: Speed ok, message on page: "Engine Status: Cannot connect to HA daemon, please check the logs "
After step 8: Speed ok, no message

Comment 42 Fabian Deutsch 2016-01-20 09:28:04 UTC
(In reply to Fabian Deutsch from comment #41)
> According to my testing there is now long delay anymore.

… no long delay anymore

Comment 43 Fabian Deutsch 2016-01-20 10:42:30 UTC
Note: There is still a small glitch when the screen is switched and ha broker and/or agent are down, but this is a different issue.

Comment 44 Ying Cui 2016-01-21 07:39:55 UTC
This bug is changed to ovirt-node-plugin-hosted-engine component, so I set the QA contact to myself. 

The original bug issue is fixed on the following versions. Here is the normally speed to switch TUI menu in RHEV-H after HE-VM setup is aborted, not slowly, 

And I did not encountered the comment 36 issue which taking lots of time to back step 8.

Test Version:
# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20160120.0.el7ev)
# rpm -qa ovirt-node-plugin-hosted-engine ovirt-hosted-engine-ha
ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-6.el7ev.noarch

Test steps:
1. installed rhevh 7.2 successful.
2. login TUI configuration
3. switch to "Hosted Engine" TUI menu
4. PXE to setup VM
5. one by one steps setting configuration
6. then the VM has been started:
   Install the OS and shut down or reboot it.
     To continue please make a selection:
       (1)Continue setup - VM installation is complete
       (2)Reboot the VM and restart installation
       (3)Abort setup
       (4)Destroy VM ...
7. you can select 4 or interrupt the VM OS installation manually.
8. Go back to TUI menu, then switch to "Hosted Engine" 
9. The speed to access the HE TUI is OK, not slowly like before.

Verified this bug firstly, for further new issue, we will open new bug to trace. Another issue see bug 1258754, the TUI still show "Failed to connect to broker, the number of errors has exceeded the limit (1)", but not slowly.

Comment 45 Julie 2016-02-22 05:58:17 UTC
If this bug requires doc text for errata release, please provide draft text in the doc text field in the following format:

Cause:
Consequence:
Fix:
Result:

The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.


Note You need to log in before you can comment on or make changes to this bug.