Created attachment 966708 [details] /var/log Description of problem: engine VM has created, but engine VM OS installation interrupt or failed anyway, then switch TUI configuration menu, then so slowly switch to Hosted Engine TUI menu page. Version-Release number of selected component (if applicable): Red Hat Enterprise Virtualization Hypervisor release 7.0 (20141202.0.el7ev) ovirt-node-3.1.0-0.28.20141126git25ce016.el7.noarch ovirt-node-plugin-hosted-engine-0.2.0-5.0.el7ev.x86_64 ovirt-hosted-engine-setup-1.2.1-6.el7ev.noarch ovirt-hosted-engine-ha-1.2.4-2.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1. installed rhevh 7.0 successful. 2. login TUI configuration 3. switch to "Hosted Engine" TUI menu 4. PXE to setup VM 5. one by one steps setting configuration 6. then the VM has been started: Install the OS and shut down or reboot it. To continue please make a selection: (1)Continue setup - VM installation is complete (2)Reboot the VM and restart installation (3)Abort setup (4)Destroy VM ... 7. you can select 4 or interrupt the VM OS installation manually. 8. Go back to TUI menu, then switch to "Hosted Engine" Actual results: after engine VM OS setup failed, then go to TUI menu, so slowly switch to Hosted Engine. Expected results: Switch to Hosted Engine TUI menu normally. <snip> 2014-12-10 08:19:49,685 INFO Current page is 'Plugins' 2014-12-10 08:19:49,690 INFO Failed to connect to broker: [Errno 2] No such file or directory 2014-12-10 08:19:49,690 INFO Retrying broker connection in '5' seconds 2014-12-10 08:19:50,834 INFO Failed to connect to broker: [Errno 2] No such file or directory 2014-12-10 08:19:50,834 INFO Retrying broker connection in '5' seconds 2014-12-10 08:19:54,695 INFO Failed to connect to broker: [Errno 2] No such file or directory 2014-12-10 08:19:54,696 INFO Retrying broker connection in '5' seconds 2014-12-10 08:19:55,839 INFO Failed to connect to broker: [Errno 2] No such file or directory 2014-12-10 08:19:55,840 INFO Retrying broker connection in '5' seconds 2014-12-10 08:19:59,701 INFO Failed to connect to broker: [Errno 2] No such file or directory 2014-12-10 08:19:59,701 INFO Retrying broker connection in '5' seconds 2014-12-10 08:20:00,845 ERROR Failed to connect to broker, the number of errors has exceeded the limit (5) 2014-12-10 08:20:00,849 INFO Current page is 'Hosted Engine' 2014-12-10 08:20:00,933 INFO Current page is 'Plugins' </snip>
We can probably handle this pretty easily with another worker thread to update the TUI, but... I'd really prefer to see this handled from the ovirt-engine-ha-broker. It seems like it should shut down and return to an unconfigured state if setup is aborted.
Sandro, what do you think about Ryan's comment 2?
Fabian, about the broker I redirect the needinfo to Jiri.
(In reply to Ryan Barry from comment #2) > We can probably handle this pretty easily with another worker thread to > update the TUI, but... > > I'd really prefer to see this handled from the ovirt-engine-ha-broker. It > seems like it should shut down and return to an unconfigured state if setup > is aborted. I don't understand, where is this log coming from? Are you trying to connect to the broker while it's not running? How can broker handle this? If you're using brokerlink.connect, then you can just call it with retries and wait parameter to shorten the period while it's trying to connect: connect(retries=1, wait=0) # this will only try once and won't wait
Jiri - We're using ovirt_hosted_engine_ha.client.HAClient().get_all_host_stats() This doesn't occur before the user has attempted to configure hosted engine (and destroyed/aborted it). It throws an exception about no hosted-engine.conf in that case. Essentially, we're not interacting with the broker at all. The interaction is through ovirt_hosted_engine_ha.HAClient. This seems reasonable, since we don't need to directly interact with the broker and we're just pulling information it should presumably know about. Taking a quick look through the code, I don't see an easy way for us to modify any of the connection parameters without directly re-implementing parts of the client code in our plugin, which isn't desirable from a maintenance perspective, nor do I think our plugin should need to be aware of the inner workings of a module provided to use as a client. It looks like it'd be really easy to add optional args to the stat calls which would be passed to brokerlink.connect, though. Is this something you guys would accept?
Sandro, any thoughts? It's a problem with/questions about ovirt_hosted_engine_ha.client, not the broker specifically.
Ok, refusing to connect when the hosted-engine.conf does not exist sounds reasonable. I'll take a look at it on Monday.
Thanks. It refuses to connect as expected when the conf doesn't exist (on a fresh install), but I'm guessing that destroying the VM and aborting the setup leaves hosted-engine.conf around and it tries to connect anyway. It seems like there are two things we could do: Provide optional arguments for connect() through the client (either in HAClient.__init__ or otherwise, doesn't much matter to use) so we can set a fast timeout and only one retry. hosted-engine-setup shouldn't leave hosted-engine.conf sitting around if the install is aborted.
bug pushed to 3.5.1, removing from 3.5.0 trackers.
(In reply to Ryan Barry from comment #9) > hosted-engine-setup shouldn't leave hosted-engine.conf sitting around if the > install is aborted. This conflicts with having the possibility to resume the setup if interrupted in the middle. At closeup stage when we're running the VM, all config files have been already written and we already exited the setup part covered by rollback. From this point on, you can just run hosted-engine --vm-start and you should get a running VM so from a config files perspective, the configuration is consistent with the state of the system.
Is aborting setup and resuming it with "hosted-engine --vm-start" a supported use case? I haven't seen it in the documentation, but I've only read the upstream hosted-engine bits. Is it possible to have the broker exit early or otherwise flag it if this happens? I'm trying to see why a connect timeout from the broker would be part of a normal, supported workflow, and what we can do to mitigate the usability impact it from the node side.
The timeout is *between* connections, it should help in situation when someone restarts the broker and the agent is still running
The proposed patch publish the option connect option thru the HAClient constructor parameters. Can you please test the patch and let me know if it helps with yout problem? The default values in the HAClient are set to retries=1, wait = 0, so you don't have to change your code, just apply the patch to ovirt-hosted-engine-ha and build it.
I'll see if I can get a scratch build out today to test with. I'll need to build a new ovirt-hosted-engine-ha RPM to test with. Should I cherry pick it off master to 1.2 to build?
Ryan, see this bug fix should be in ovirt-hosted-engine-ha component, not ovirt-node-plugin-hosted-engine component. and move 'node' whiteboard out.
(In reply to Ryan Barry from comment #15) > I'll see if I can get a scratch build out today to test with. > > I'll need to build a new ovirt-hosted-engine-ha RPM to test with. Should I > cherry pick it off master to 1.2 to build? - yes, you have to cherry-pick it, let me know if there are any problems with it, but I just tried and the cherry-pick works without any conflicts
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2
The patch in the bug will just create ability to do retries and timeout. It won't solve the slow response. This needs to be addressed in the TUI. I think rhevh needs to take this one?
Yes, the TUI can leverage the new functionality (the timeouts) to provide a better experience. Let's keep this bug to track the ha sided change, and I'll create a clone to track the Node side change.
(In reply to Fabian Deutsch from comment #22) > Yes, the TUI can leverage the new functionality (the timeouts) to provide a > better experience. > > Let's keep this bug to track the ha sided change, and I'll create a clone to > track the Node side change. cloned bug to node: https://bugzilla.redhat.com/show_bug.cgi?id=1252796
From QE view, we have to request to escalate this bug, because any action to interrupt VM(vm setup failed, terminate quit...) will cause RHEV-H TUI _so_ slowly, switch TUI menu need about 1~2mins for each, unacceptable. And once this issue happen, TUI almost like crash, can do nothing. Thanks.
Does restarting the host returnes the TUI to its original speed?
Answer the comment 25 from my side: Reboot RHEV-H, then it's back to the normally speed to switch TUI menu in RHEV-H. *BUT* after rhevh reboot, if new setup Hosted Engine in TUI, it will be failed. ---------- 2015-08-21 07:56:56 DEBUG otopi.context context._executeMethod:152 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in _executeMethod File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/ha/ha_services.py", line 66, in _programs RuntimeError: Hosted Engine HA services are already running on this system. Hosted Engine cannot be deployed on a host already running those services. 2015-08-21 07:56:56 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Programs detection': Hosted Engine HA services are already running on this system. Hosted Engine cannot be deployed on a host already running those services. 2015-08-21 07:56:56 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN ------------ Test Env: rhevh 10.66.8.233 admin and root password: redhat -----console----- [ INFO ] Stage: Initializing [ INFO ] Generating a temporary VNC password. [ INFO ] Stage: Environment setup Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards. Are you sure you want to continue? (Yes, No)[Yes]: [screen is terminating] Hit <Return> to return to the TUI ------------------ Here you click on enter to continue, the process is interrupted, need to back to TUI, then error in log under /var/log/ovirt-hosted-engine-setup/ For above error, do we need to spit it to new bug? or the same root cause of this bug?
(In reply to Ying Cui from comment #26) > Answer the comment 25 from my side: > Reboot RHEV-H, then it's back to the normally speed to switch TUI menu in > RHEV-H. > > *BUT* after rhevh reboot, if new setup Hosted Engine in TUI, it will be > failed. > > ---------- > > 2015-08-21 07:56:56 DEBUG otopi.context context._executeMethod:152 method > exception > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/otopi/context.py", line 142, in > _executeMethod > File > "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/ha/ > ha_services.py", line 66, in _programs > RuntimeError: Hosted Engine HA services are already running on this system. > Hosted Engine cannot be deployed on a host already running those services. > 2015-08-21 07:56:56 ERROR otopi.context context._executeMethod:161 Failed to > execute stage 'Programs detection': Hosted Engine HA services are already > running on this system. Hosted Engine cannot be deployed on a host already > running those services. > 2015-08-21 07:56:56 DEBUG otopi.context context.dumpEnvironment:490 > ENVIRONMENT DUMP - BEGIN > ------------ > > Test Env: > rhevh 10.66.8.233 admin and root password: redhat > > -----console----- > [ INFO ] Stage: Initializing > [ INFO ] Generating a temporary VNC password. > [ INFO ] Stage: Environment setup > Continuing will configure this host for serving as hypervisor and > create a VM where you have to install oVirt Engine afterwards. > Are you sure you want to continue? (Yes, No)[Yes]: > > [screen is terminating] > Hit <Return> to return to the TUI > ------------------ > > Here you click on enter to continue, the process is interrupted, need to > back to TUI, then error in log under /var/log/ovirt-hosted-engine-setup/ > > For above error, do we need to spit it to new bug? or the same root cause of > this bug? This could be bz#1208489, but ovirt-ha-broker and ovirt-ha-agent are enabled by default on RHEV-H. If these services detect a configuration file, they start, and hosted-engine-setup won't run because they're alreaady started. /etc/ovirt-hosted-engine/hosted-engine.conf needs to be deleted and those services restarted before restarting the setup. I've had a discussion with the hosted-engine developers before (but I can't find the thread), and not removing the configuration is an intentional decision. I would suggest that any change to that behavior should be a bug filed against ovirt-hosted-engine, since RHEV-H's behavior matches RHEL's (on RHEL, start hosted-engine setup, cancel out of it once it writes the configuration, then try to start it again, and you'll see the same message), but it may be NOTABUG.
There is no hosted-engine.conf file under /etc/ovirt-hosted-engine/ So far we are still finding a workaround when TUI is accessed slowly, after restart rhevh, how restart the HE setup via TUI? Thanks.
both patches are merged, and afaik hosted engine was built, can this bug move to ON_QA?
Isn't this a downstream bug? Errata tool moves those from MODIFIED to ON_QA, doesn't it?
if you added the bug to errata, it should move it once errata is moved to ON_QA. but this is a 3.6.0 bug, not 3.5.4, and i assume you don't have errata for 3.6 yet and probably not ON_QA.
We didn't received the RHEVH3.6 yet, checked now in Foreman. This bug can't be verified yet.
adding fabian to reply on RHEVH avaliablity
Without being able to first deploy the HE over RHEVH, can't proceed to verification of this bug, hence adding a new depends on 1269176.
Reproduced on Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20151113.123.el7ev) on these components: mom-0.5.1-1.el7ev.noarch sanlock-3.2.4-1.el7.x86_64 libvirt-1.2.17-13.el7.x86_64 qemu-kvm-rhev-2.3.0-31.el7.x86_64 vdsm-4.17.10.1-0.el7ev.noarch ovirt-vmconsole-host-1.0.0-1.el7ev.noarch ovirt-node-branding-rhev-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-lib-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-plugin-snmp-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch ovirt-node-plugin-vdsm-0.6.1-3.el7ev.noarch ovirt-setup-lib-1.0.0-1.el7ev.noarch ovirt-vmconsole-1.0.0-1.el7ev.noarch ovirt-node-lib-config-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-selinux-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-plugin-cim-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch ovirt-node-plugin-hosted-engine-0.3.0-3.el7ev.noarch ovirt-node-plugin-snmp-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-plugin-rhn-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-lib-legacy-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64 ovirt-node-plugin-cim-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-host-deploy-1.4.1-0.0.master.el7ev.noarch The VM has been started. To continue please install OS and shutdown or reboot the VM. Make a selection from the options below: (1) Continue setup - OS installation is complete (2) Power off and restart the VM (3) Abort setup (4) Destroy VM and abort setup (1, 2, 3, 4)[1]: 4 [ ERROR ] Failed to execute stage 'Closing up': VM destroyed and setup aborted by user [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20151116144051.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination Something went wrong setting up hosted engine, or the setup process was cancelled. Press any key to continue... Then following the instructions provided in bug's description, I've tried to go back to step 8 and it took me a lot of time, until system responded. Sosreport from host is attached.
Created attachment 1094958 [details] sosreport-black-vdsb.qa.lab.tlv.redhat.com-20151116144612.tar.xz
Needs a closer look to see if this is another issue stalling the TUI or this is a completely different issue.
According to my testing there is now long delay anymore. Test: 1. Used a successful HE deployment 2. TUI: Switch to HE page 3. Console: Stop ovirt-ha-broker service 4. TUI: Switch to HE page 5. Console: Stop ovirt-ha-agent service 6. TUI: Switch to HE page 7. Console: Start ovirt-ha-agent and ovirt-ha-broker service 8. TUI: Switch to HE page After step 2: Speed ok After step 4: Speed ok, message on page: "Engine Status: Cannot connect to HA daemon, please check the logs " After step 6: Speed ok, message on page: "Engine Status: Cannot connect to HA daemon, please check the logs " After step 8: Speed ok, no message
(In reply to Fabian Deutsch from comment #41) > According to my testing there is now long delay anymore. … no long delay anymore
Note: There is still a small glitch when the screen is switched and ha broker and/or agent are down, but this is a different issue.
This bug is changed to ovirt-node-plugin-hosted-engine component, so I set the QA contact to myself. The original bug issue is fixed on the following versions. Here is the normally speed to switch TUI menu in RHEV-H after HE-VM setup is aborted, not slowly, And I did not encountered the comment 36 issue which taking lots of time to back step 8. Test Version: # cat /etc/redhat-release Red Hat Enterprise Virtualization Hypervisor (Beta) release 7.2 (20160120.0.el7ev) # rpm -qa ovirt-node-plugin-hosted-engine ovirt-hosted-engine-ha ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch ovirt-node-plugin-hosted-engine-0.3.0-6.el7ev.noarch Test steps: 1. installed rhevh 7.2 successful. 2. login TUI configuration 3. switch to "Hosted Engine" TUI menu 4. PXE to setup VM 5. one by one steps setting configuration 6. then the VM has been started: Install the OS and shut down or reboot it. To continue please make a selection: (1)Continue setup - VM installation is complete (2)Reboot the VM and restart installation (3)Abort setup (4)Destroy VM ... 7. you can select 4 or interrupt the VM OS installation manually. 8. Go back to TUI menu, then switch to "Hosted Engine" 9. The speed to access the HE TUI is OK, not slowly like before. Verified this bug firstly, for further new issue, we will open new bug to trace. Another issue see bug 1258754, the TUI still show "Failed to connect to broker, the number of errors has exceeded the limit (1)", but not slowly.
If this bug requires doc text for errata release, please provide draft text in the doc text field in the following format: Cause: Consequence: Fix: Result: The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -.