1172511 – Switch to Hosted Engine TUI menu so slowly due to failed to connect to broker

Bug 1172511 - Switch to Hosted Engine TUI menu so slowly due to failed to connect to broker

Summary: Switch to Hosted Engine TUI menu so slowly due to failed to connect to broker

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-node-plugin-hosted-engine
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-3.6.2
Target Release:	3.6.2
Assignee:	Fabian Deutsch
QA Contact:	Ying Cui
Docs Contact:
URL:
Whiteboard:
Depends On:	1259247 1260470 1269176
Blocks:	1222282 1252796
TreeView+	depends on / blocked

Reported:	2014-12-10 09:17 UTC by Ying Cui
Modified:	2016-03-04 08:07 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1222282 1252796 (view as bug list)
Environment:
Last Closed:	2016-03-04 08:07:23 UTC
oVirt Team:	Node
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/var/log (267.39 KB, application/x-gzip) 2014-12-10 09:17 UTC, Ying Cui	no flags	Details
sosreport-black-vdsb.qa.lab.tlv.redhat.com-20151116144612.tar.xz (5.48 MB, application/x-xz) 2015-11-16 14:55 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:0422	normal	SHIPPED_LIVE	ovirt-hosted-engine-ha bug fix and enhancement update	2016-03-09 23:58:25 UTC
oVirt gerrit	36194	master	MERGED	HAClient: allow overriding broker connection parameters	Never
oVirt gerrit	45267	ovirt-hosted-engine-ha-1.2	MERGED	HAClient: allow overriding broker connection parameters	Never
oVirt gerrit	49227	master	ABANDONED	ui: Use a shorter wait time	Never

Description Ying Cui 2014-12-10 09:17:05 UTC

Created attachment 966708 [details]
/var/log

Description of problem:
engine VM has created, but engine VM OS installation interrupt or failed anyway, then switch TUI configuration menu, then so slowly switch to Hosted Engine TUI menu page.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 7.0 (20141202.0.el7ev)
ovirt-node-3.1.0-0.28.20141126git25ce016.el7.noarch
ovirt-node-plugin-hosted-engine-0.2.0-5.0.el7ev.x86_64
ovirt-hosted-engine-setup-1.2.1-6.el7ev.noarch
ovirt-hosted-engine-ha-1.2.4-2.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1. installed rhevh 7.0 successful.
2. login TUI configuration
3. switch to "Hosted Engine" TUI menu
4. PXE to setup VM
5. one by one steps setting configuration
6. then the VM has been started:
   Install the OS and shut down or reboot it.
     To continue please make a selection:
       (1)Continue setup - VM installation is complete
       (2)Reboot the VM and restart installation
       (3)Abort setup
       (4)Destroy VM ...
7. you can select 4 or interrupt the VM OS installation manually.
8. Go back to TUI menu, then switch to "Hosted Engine" 

Actual results:
after engine VM OS setup failed, then go to TUI menu, so slowly switch to Hosted Engine.

Expected results:
Switch to Hosted Engine TUI menu normally.

<snip>
2014-12-10 08:19:49,685       INFO Current page is 'Plugins'
2014-12-10 08:19:49,690       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:49,690       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:50,834       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:50,834       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:54,695       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:54,696       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:55,839       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:55,840       INFO Retrying broker connection in '5' seconds
2014-12-10 08:19:59,701       INFO Failed to connect to broker: [Errno 2] No such file or directory
2014-12-10 08:19:59,701       INFO Retrying broker connection in '5' seconds
2014-12-10 08:20:00,845      ERROR Failed to connect to broker, the number of errors has exceeded the limit (5)
2014-12-10 08:20:00,849       INFO Current page is 'Hosted Engine'
2014-12-10 08:20:00,933       INFO Current page is 'Plugins'
</snip>

Comment 2 Ryan Barry 2014-12-10 16:32:38 UTC

We can probably handle this pretty easily with another worker thread to update the TUI, but...

I'd really prefer to see this handled from the ovirt-engine-ha-broker. It seems like it should shut down and return to an unconfigured state if setup is aborted.

Comment 3 Fabian Deutsch 2014-12-11 08:33:44 UTC

Sandro, what do you think about Ryan's comment 2?

Comment 4 Sandro Bonazzola 2014-12-11 08:36:29 UTC

Fabian, about the broker I redirect the needinfo to Jiri.

Comment 5 Jiri Moskovcak 2014-12-12 13:15:31 UTC

(In reply to Ryan Barry from comment #2)
> We can probably handle this pretty easily with another worker thread to
> update the TUI, but...
> 
> I'd really prefer to see this handled from the ovirt-engine-ha-broker. It
> seems like it should shut down and return to an unconfigured state if setup
> is aborted.

I don't understand, where is this log coming from? Are you trying to connect to the broker while it's not running? How can broker handle this? If you're using brokerlink.connect, then you can just call it with retries and wait parameter to shorten the period while it's trying to connect:

connect(retries=1, wait=0) # this will only try once and won't wait

Comment 6 Ryan Barry 2014-12-12 14:54:42 UTC

Jiri -

We're using ovirt_hosted_engine_ha.client.HAClient().get_all_host_stats()

This doesn't occur before the user has attempted to configure hosted engine (and destroyed/aborted it). It throws an exception about no hosted-engine.conf in that case.

Essentially, we're not interacting with the broker at all. The interaction is through ovirt_hosted_engine_ha.HAClient. This seems reasonable, since we don't need to directly interact with the broker and we're just pulling information it should presumably know about.

Taking a quick look through the code, I don't see an easy way for us to modify any of the connection parameters without directly re-implementing parts of the client code in our plugin, which isn't desirable from a maintenance perspective, nor do I think our plugin should need to be aware of the inner workings of a module provided to use as a client.

It looks like it'd be really easy to add optional args to the stat calls which would be passed to brokerlink.connect, though. Is this something you guys would accept?

Comment 7 Ryan Barry 2014-12-12 14:55:53 UTC

Sandro, any thoughts? It's a problem with/questions about ovirt_hosted_engine_ha.client, not the broker specifically.

Comment 8 Jiri Moskovcak 2014-12-12 15:23:04 UTC

Ok, refusing to connect when the hosted-engine.conf does not exist sounds reasonable. I'll take a look at it on Monday.

Comment 9 Ryan Barry 2014-12-12 15:44:23 UTC

Thanks.

It refuses to connect as expected when the conf doesn't exist (on a fresh install), but I'm guessing that destroying the VM and aborting the setup leaves hosted-engine.conf around and it tries to connect anyway.

It seems like there are two things we could do:

Provide optional arguments for connect() through the client (either in HAClient.__init__ or otherwise, doesn't much matter to use) so we can set a fast timeout and only one retry.

hosted-engine-setup shouldn't leave hosted-engine.conf sitting around if the install is aborted.

Comment 10 Eyal Edri 2014-12-14 13:25:38 UTC

bug pushed to 3.5.1, removing from 3.5.0 trackers.

Comment 11 Sandro Bonazzola 2014-12-15 12:10:39 UTC

(In reply to Ryan Barry from comment #9)

> hosted-engine-setup shouldn't leave hosted-engine.conf sitting around if the
> install is aborted.

This conflicts with having the possibility to resume the setup if interrupted in the middle.
At closeup stage when we're running the VM, all config files have been already written and we already exited the setup part covered by rollback.

From this point on, you can just run hosted-engine --vm-start and you should get a running VM so from a config files perspective, the configuration is consistent with the state of the system.

Comment 12 Ryan Barry 2014-12-15 14:42:21 UTC

Is aborting setup and resuming it with "hosted-engine --vm-start" a supported use case? I haven't seen it in the documentation, but I've only read the upstream hosted-engine bits.

Is it possible to have the broker exit early or otherwise flag it if this happens? I'm trying to see why a connect timeout from the broker would be part of a normal, supported workflow, and what we can do to mitigate the usability impact it from the node side.

Comment 13 Jiri Moskovcak 2014-12-16 08:28:46 UTC

The timeout is *between* connections, it should help in situation when someone restarts the broker and the agent is still running

Comment 14 Jiri Moskovcak 2014-12-16 10:22:13 UTC

The proposed patch publish the option connect option thru the HAClient constructor parameters. Can you please test the patch and let me know if it helps with yout problem? The default values in the HAClient are set to retries=1, wait = 0, so you don't have to change your code, just apply the patch to ovirt-hosted-engine-ha and build it.

Comment 15 Ryan Barry 2014-12-17 14:59:37 UTC

I'll see if I can get a scratch build out today to test with.

I'll need to build a new ovirt-hosted-engine-ha RPM to test with. Should I cherry pick it off master to 1.2 to build?

Comment 16 Ying Cui 2014-12-18 03:24:22 UTC

Ryan, see this bug fix should be in ovirt-hosted-engine-ha component, not ovirt-node-plugin-hosted-engine component. and move 'node' whiteboard out.

Comment 17 Jiri Moskovcak 2014-12-18 08:42:38 UTC

(In reply to Ryan Barry from comment #15)
> I'll see if I can get a scratch build out today to test with.
> 
> I'll need to build a new ovirt-hosted-engine-ha RPM to test with. Should I
> cherry pick it off master to 1.2 to build?

- yes, you have to cherry-pick it, let me know if there are any problems with it, but I just tried and the cherry-pick works without any conflicts

Comment 19 Eyal Edri 2015-02-25 08:43:19 UTC

3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2

Comment 21 Roy Golan 2015-08-12 08:47:43 UTC

The patch in the bug will just create ability to do retries and timeout. 
It won't solve the slow response. This needs to be addressed in the TUI.
I think rhevh needs to take this one?

Comment 22 Fabian Deutsch 2015-08-12 09:12:01 UTC

Yes, the TUI can leverage the new functionality (the timeouts) to provide a better experience.

Let's keep this bug to track the ha sided change, and I'll create a clone to track the Node side change.

Comment 23 Ying Cui 2015-08-19 09:35:34 UTC

(In reply to Fabian Deutsch from comment #22)
> Yes, the TUI can leverage the new functionality (the timeouts) to provide a
> better experience.
> 
> Let's keep this bug to track the ha sided change, and I'll create a clone to
> track the Node side change.

cloned bug to node: https://bugzilla.redhat.com/show_bug.cgi?id=1252796

Comment 24 Ying Cui 2015-08-19 09:47:21 UTC

From QE view, we have to request to escalate this bug, because any action to interrupt VM(vm setup failed, terminate quit...) will cause RHEV-H TUI _so_ slowly, switch TUI menu need about 1~2mins for each, unacceptable. And once this issue happen, TUI almost like crash, can do nothing. Thanks.

Comment 25 Anatoly Litovsky 2015-08-20 12:27:30 UTC

Does restarting the host returnes the TUI to its original speed?

Comment 26 Ying Cui 2015-08-21 08:08:55 UTC

Answer the comment 25 from my side:
Reboot RHEV-H, then it's back to the normally speed to switch TUI menu in RHEV-H.

*BUT* after rhevh reboot, if new setup Hosted Engine in TUI, it will be failed.

----------

2015-08-21 07:56:56 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod
File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/ha/ha_services.py", line 66, in _programs
RuntimeError: Hosted Engine HA services are already running on this system. Hosted Engine cannot be deployed on a host already running those services.
2015-08-21 07:56:56 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Programs detection': Hosted Engine HA services are already running on this system. Hosted Engine cannot be deployed on a host already running those services.
2015-08-21 07:56:56 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN
------------

Test Env:
rhevh 10.66.8.233 admin and root password: redhat

-----console-----
[ INFO ] Stage: Initializing
[ INFO ] Generating a temporary VNC password.
[ INFO ] Stage: Environment setup
Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
Are you sure you want to continue? (Yes, No)[Yes]:

[screen is terminating]
Hit <Return> to return to the TUI
------------------

Here you click on enter to continue, the process is interrupted, need to back to TUI, then error in log under /var/log/ovirt-hosted-engine-setup/

For above error, do we need to spit it to new bug? or the same root cause of this bug?

Comment 27 Ryan Barry 2015-08-21 16:29:16 UTC

(In reply to Ying Cui from comment #26)
> Answer the comment 25 from my side:
> Reboot RHEV-H, then it's back to the normally speed to switch TUI menu in
> RHEV-H.
> 
> *BUT* after rhevh reboot, if new setup Hosted Engine in TUI, it will be
> failed.
> 
> ----------
> 
> 2015-08-21 07:56:56 DEBUG otopi.context context._executeMethod:152 method
> exception
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in
> _executeMethod
>   File
> "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/ha/
> ha_services.py", line 66, in _programs
> RuntimeError: Hosted Engine HA services are already running on this system.
> Hosted Engine cannot be deployed on a host already running those services.
> 2015-08-21 07:56:56 ERROR otopi.context context._executeMethod:161 Failed to
> execute stage 'Programs detection': Hosted Engine HA services are already
> running on this system. Hosted Engine cannot be deployed on a host already
> running those services.
> 2015-08-21 07:56:56 DEBUG otopi.context context.dumpEnvironment:490
> ENVIRONMENT DUMP - BEGIN
> ------------
> 
> Test Env: 
> rhevh 10.66.8.233  admin and root password: redhat
> 
> -----console-----
> [ INFO  ] Stage: Initializing
> [ INFO  ] Generating a temporary VNC password.
> [ INFO  ] Stage: Environment setup
>           Continuing will configure this host for serving as hypervisor and
> create a VM where you have to install oVirt Engine afterwards.
>           Are you sure you want to continue? (Yes, No)[Yes]: 
> 
> [screen is terminating]
> Hit <Return> to return to the TUI
> ------------------
> 
> Here you click on enter to continue, the process is interrupted, need to
> back to TUI, then error in log under /var/log/ovirt-hosted-engine-setup/
> 
> For above error, do we need to spit it to new bug? or the same root cause of
> this bug?

This could be bz#1208489, but ovirt-ha-broker and ovirt-ha-agent are enabled by default on RHEV-H.

If these services detect a configuration file, they start, and hosted-engine-setup won't run because they're alreaady started.

/etc/ovirt-hosted-engine/hosted-engine.conf needs to be deleted and those services restarted before restarting the setup.

I've had a discussion with the hosted-engine developers before (but I can't find the thread), and not removing the configuration is an intentional decision. I would suggest that any change to that behavior should be a bug filed against ovirt-hosted-engine, since RHEV-H's behavior matches RHEL's (on RHEL, start hosted-engine setup, cancel out of it once it writes the configuration, then try to start it again, and you'll see the same message), but it may be NOTABUG.

Comment 28 Ying Cui 2015-08-24 07:03:11 UTC

There is no hosted-engine.conf file under /etc/ovirt-hosted-engine/

So far we are still finding a workaround when TUI is accessed slowly, after restart rhevh, how restart the HE setup via TUI? Thanks.

Comment 29 Eyal Edri 2015-08-30 11:46:54 UTC

both patches are merged, and afaik hosted engine was built,
can this bug move to ON_QA?

Comment 30 Martin Sivák 2015-08-31 11:21:18 UTC

Isn't this a downstream bug? Errata tool moves those from MODIFIED to ON_QA, doesn't it?

Comment 31 Eyal Edri 2015-08-31 15:17:34 UTC

if you added the bug to errata, it should move it once errata is moved to ON_QA.
but this is a 3.6.0 bug, not 3.5.4, and i assume you don't have errata for 3.6 yet and probably not ON_QA.

Comment 33 Nikolai Sednev 2015-09-16 06:21:38 UTC

We didn't received the RHEVH3.6 yet, checked now in Foreman.
This bug can't be verified yet.

Comment 34 Eyal Edri 2015-10-08 15:37:14 UTC

adding fabian to reply on RHEVH avaliablity

Comment 35 Nikolai Sednev 2015-10-26 16:04:38 UTC

Without being able to first deploy the HE over RHEVH, can't proceed to verification of this bug, hence adding a new depends on 1269176.

Comment 36 Nikolai Sednev 2015-11-16 14:46:08 UTC

Reproduced on Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151113.123.el7ev) on these components:
mom-0.5.1-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
libvirt-1.2.17-13.el7.x86_64
qemu-kvm-rhev-2.3.0-31.el7.x86_64
vdsm-4.17.10.1-0.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-node-branding-rhev-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-snmp-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-3.el7ev.noarch
ovirt-setup-lib-1.0.0-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-node-lib-config-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-selinux-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-cim-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch
ovirt-node-plugin-snmp-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-plugin-rhn-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-node-lib-legacy-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64
ovirt-node-plugin-cim-3.6.0-0.20.20151103git3d3779a.el7ev.noarch
ovirt-host-deploy-1.4.1-0.0.master.el7ev.noarch




       The VM has been started.
          To continue please install OS and shutdown or reboot the VM.
         
          Make a selection from the options below:
          (1) Continue setup - OS installation is complete
          (2) Power off and restart the VM
          (3) Abort setup
          (4) Destroy VM and abort setup
         
          (1, 2, 3, 4)[1]: 4
[ ERROR ] Failed to execute stage 'Closing up': VM destroyed and setup aborted by user
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151116144051.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
Something went wrong setting up hosted engine, or the setup process was cancelled.

Press any key to continue...

Then following the instructions provided in bug's description, I've tried to go back to step 8 and it took me a lot of time, until system responded.

Sosreport from host is attached.

Comment 37 Nikolai Sednev 2015-11-16 14:55:10 UTC

Created attachment 1094958 [details]
sosreport-black-vdsb.qa.lab.tlv.redhat.com-20151116144612.tar.xz

Comment 38 Roy Golan 2015-12-27 09:25:07 UTC

Needs a closer look to see if this is another issue stalling the TUI or this is a completely different issue.

Comment 41 Fabian Deutsch 2016-01-20 09:26:18 UTC

According to my testing there is now long delay anymore.

Test:

1. Used a successful HE deployment
2. TUI: Switch to HE page
3. Console: Stop ovirt-ha-broker service
4. TUI: Switch to HE page
5. Console: Stop ovirt-ha-agent service
6. TUI: Switch to HE page
7. Console: Start ovirt-ha-agent and ovirt-ha-broker service
8. TUI: Switch to HE page

After step 2: Speed ok
After step 4: Speed ok, message on page: "Engine Status: Cannot connect to HA daemon, please check the logs "
After step 6: Speed ok, message on page: "Engine Status: Cannot connect to HA daemon, please check the logs "
After step 8: Speed ok, no message

Comment 42 Fabian Deutsch 2016-01-20 09:28:04 UTC

(In reply to Fabian Deutsch from comment #41)
> According to my testing there is now long delay anymore.

… no long delay anymore

Comment 43 Fabian Deutsch 2016-01-20 10:42:30 UTC

Note: There is still a small glitch when the screen is switched and ha broker and/or agent are down, but this is a different issue.

Comment 44 Ying Cui 2016-01-21 07:39:55 UTC

This bug is changed to ovirt-node-plugin-hosted-engine component, so I set the QA contact to myself.

The original bug issue is fixed on the following versions. Here is the normally speed to switch TUI menu in RHEV-H after HE-VM setup is aborted, not slowly,

And I did not encountered the comment 36 issue which taking lots of time to back step 8.

Test Version:
# cat /etc/redhat-release
Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20160120.0.el7ev)
# rpm -qa ovirt-node-plugin-hosted-engine ovirt-hosted-engine-ha
ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-6.el7ev.noarch

Test steps:
1. installed rhevh 7.2 successful.
2. login TUI configuration
3. switch to "Hosted Engine" TUI menu
4. PXE to setup VM
5. one by one steps setting configuration
6. then the VM has been started:
Install the OS and shut down or reboot it.
To continue please make a selection:
(1)Continue setup - VM installation is complete
(2)Reboot the VM and restart installation
(3)Abort setup
(4)Destroy VM ...
7. you can select 4 or interrupt the VM OS installation manually.
8. Go back to TUI menu, then switch to "Hosted Engine"
9. The speed to access the HE TUI is OK, not slowly like before.

Verified this bug firstly, for further new issue, we will open new bug to trace. Another issue see bug 1258754, the TUI still show "Failed to connect to broker, the number of errors has exceeded the limit (1)", but not slowly.

Comment 45 Julie 2016-02-22 05:58:17 UTC

If this bug requires doc text for errata release, please provide draft text in the doc text field in the following format:

Cause:
Consequence:
Fix:
Result:

The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Note You need to log in before you can comment on or make changes to this bug.

cshao
dfediuck
didi
eedri
fdeutsch
gklein
gouyang
huiwa
istein
juwu
leiwang
lsurette
lveyde
mgoldboi
msivak
nsednev
rbarry
rgolan
sbonazzo
stirabos
tlitovsk
ycui
ykaul