Bug 1258754 - [Docs] - Add steps for cleaning up a failed HE deployment
[Docs] - Add steps for cleaning up a failed HE deployment
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: Documentation (Show other bugs)
3.5.4
Unspecified Unspecified
high Severity high
: ovirt-3.6.6
: ---
Assigned To: Julie
Byron Gravenorst
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-01 04:13 EDT by Ying Cui
Modified: 2016-12-07 03:33 EST (History)
23 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-12 01:38:07 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Docs
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreport (7.25 MB, application/x-xz)
2015-09-01 04:22 EDT, Ying Cui
no flags Details
varlog.tar.bz2 (556.01 KB, application/x-bzip)
2015-09-01 04:22 EDT, Ying Cui
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2121581 None None None 2016-01-12 14:40 EST

  None (edit)
Description Ying Cui 2015-09-01 04:13:21 EDT
Description of problem:
Note: There had 2 bugs before: bug 1172511 and bug 1172966.
Abort HE setup process firstly, the TUI still show "Failed to connect to broker, the number of errors has exceeded the limit (1)", but not slowly.

And try to restart HE setup again via TUI, failed, the HE setup process can not be continued, have to return to TUI.

Version-Release number of selected component (if applicable):
# rpm -qa ovirt-hosted-engine-setup ovirt-hosted-engine-ha 
ovirt-hosted-engine-setup-1.2.5.3-1.el6ev.noarch
ovirt-hosted-engine-ha-1.2.6-3.el6ev.noarch
# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150828.0.el6ev)

How reproducible:
100%

Steps to Reproduce:
1. Install rhevh-20150828.0.el6ev with TUI
2. Enable network, and set root password, enable ssh on rhevh.
3. Config hosted engine with correct steps.
4. The VM has been started.  Install the OS and shut down or reboot it.  To continue please make a selection:
         
          (1) Continue setup - VM installation is complete
          (2) Reboot the VM and restart installation
          (3) Abort setup
          (4) Destroy VM and abort setup
         
          (1, 2, 3, 4)[1]: 3
5. after Abort setup
6. back to TUI
7. restart HE setup via TUI, but failed.

-----console-----
[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]: 

[screen is terminating]
Hit <Return> to return to the TUI
------------------

Actual results:
Can not restart HE setup after abort HE setup process

# hosted-engine --vm-status
Failed to connect to broker, the number of errors has exceeded the limit (1)

# /etc/init.d/ovirt-ha-broker status
ovirt-ha-broker is stopped


Expected results:
Can restart HE setup successful after abort HE process at the first time.


Additional info:

# log snip while aborting HE setup, see bug 11729266 the same.
<snip>
2015-08-31 15:55:39 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:215 DIALOG:RECEIVE    3
2015-08-31 15:55:39 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/vm/runvm.py", line 175, in _boot_from_install_media
RuntimeError: OS installation aborted by user
2015-08-31 15:55:39 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Closing up': OS installation aborted by user
2015-08-31 15:55:39 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN

</snip>

#  log snip when the second time to setup HE via TUI after abort HE
<snip>
2015-09-01 07:23:13 DEBUG otopi.context context.dumpEnvironment:504 ENVIRONMENT DUMP - END
2015-09-01 07:23:13 DEBUG otopi.context context._executeMethod:138 Stage late_setup METHOD otopi.plugins.ovirt_hosted_engine_setup.vm.configurevm.Plugin._late_setup
2015-09-01 07:23:13 DEBUG otopi.plugins.ovirt_hosted_engine_setup.vm.configurevm configurevm._late_setup:101 [{'status': 'Up', 'vmId': '0555aa86-e7a4-47ab-ae6a-7128a6e62022'}]
2015-09-01 07:23:13 ERROR otopi.plugins.ovirt_hosted_engine_setup.vm.configurevm configurevm._late_setup:108 The following VMs has been found: 0555aa86-e7a4-47ab-ae6a-7128a6e62022
2015-09-01 07:23:13 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/vm/configurevm.py", line 112, in _late_setup
RuntimeError: Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Environment setup': Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN
</snip>
Comment 1 Ying Cui 2015-09-01 04:18:51 EDT
This issue should be in RHEL as well, but for RHEL, the user can yum remove the relevant pkgs, not impact a lots, but for RHEV-H, we have to re-install the whole RHEV-H before we find workaround to cleanup the host.
Comment 2 Ying Cui 2015-09-01 04:22:12 EDT
Created attachment 1068874 [details]
sosreport
Comment 3 Ying Cui 2015-09-01 04:22:43 EDT
Created attachment 1068875 [details]
varlog.tar.bz2
Comment 4 Yaniv Lavi (Dary) 2015-09-01 05:44:20 EDT
Any suggested workaround?
Comment 5 Simone Tiraboschi 2015-09-01 06:01:27 EDT
We have some hints here:
http://www.ovirt.org/Hosted_Engine_Howto#Recoving_from_failed_install

The issue is here:
2015-09-01 07:23:13 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/vm/configurevm.py", line 112, in _late_setup
RuntimeError: Cannot setup Hosted Engine with other VMs running

For sure the user has to destroy the previous engine vm with
 hosted-engine --vm-poweroff
and manually cleanup the shared storage in order to try again.

On RHEV-H then almost all the config file are persisted just at the end of the deployment process so nothing should really be there and a second attempt should simply work.
Comment 6 Ying Cui 2015-09-01 06:20:24 EDT
(In reply to Simone Tiraboschi from comment #5)
> For sure the user has to destroy the previous engine vm with
>  hosted-engine --vm-poweroff
> and manually cleanup the shared storage in order to try again.

This workaround works good on RHEV-H after aborting HE setup.
Comment 7 Yaniv Lavi (Dary) 2015-09-01 06:52:43 EDT
(In reply to Ying Cui from comment #6)
> (In reply to Simone Tiraboschi from comment #5)
> > For sure the user has to destroy the previous engine vm with
> >  hosted-engine --vm-poweroff
> > and manually cleanup the shared storage in order to try again.
> 
> This workaround works good on RHEV-H after aborting HE setup.

We should add a button in the TUI to fix this, but I think this is not a blocker.
Can you please add a release note for this?
Moran, is a 3.5.5 target correct in your view?
Comment 8 Fabian Deutsch 2015-09-08 08:56:57 EDT
A button will not have the right context in the current page.
I'd rather favor to pull it into 3.6.0, where we have refactored the page and a submenu/dialog is available for these kind of actions.
Comment 9 Yaniv Lavi (Dary) 2015-09-08 09:18:59 EDT
We have a workaround that requires a drop to shell that customers are not supposed to do without GSS. Moran, please review.
Comment 12 Fabian Deutsch 2015-12-22 06:55:00 EST
Simone, from comment 0:

RuntimeError: Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Environment setup': Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN

Should he-setup bring down a VM it has spawned if the HE setup is getting aborted?
Comment 13 Simone Tiraboschi 2015-12-22 07:51:33 EST
(In reply to Fabian Deutsch from comment #12)
> Should he-setup bring down a VM it has spawned if the HE setup is getting
> aborted?

No, in case of a failure it' currently up to the user with
 hosted-engine --vm-poweroff
command.

It's also up to the user to cleanup the storage if needed.
Comment 14 Fabian Deutsch 2015-12-22 12:07:54 EST
Okay, thanks.

I think we'll stick to how this works on RHEL as well.

Julie, can a note be added to the right documentation to tell the reader that the RHEL guidelines should be followed to clean up a host, in case of an aborted hosted-engine setup?
Comment 15 Julie 2015-12-22 21:05:16 EST
In our docs, we don't tell users how to clean up a failed HE setup. I think the assumed knowledge was to do a fresh installation of the RHEL or RHEV-H in case of a HE setup failure. In out testing, if something goes wrong, we've always spun up a fresh install to proceed with HE setup so I'd like to get some clarification on what is the supported way to go forward- to tell users to always have a fresh installation or provide provide clean-up procedures for RHEL and RHEV-H?
Maybe someone from the support team can weigh in as well?
Comment 16 Marina 2015-12-23 15:04:16 EST
Hi Julie, indeed we are missing some important info in the HE guide. See this bug 1293971.

I think it is not clear what to do on a failed deployment, and I personally think it should be mentioned somewhere - KCS or official docs. I could not find anything related in the knowledgebase and I would be happy to make one kcs, but it is not clear to me, what should be the process. (in the long term, of course, it would be preferable to have some clean-up tool).

Simon, can you please specify the steps for cleaning up a failed HE deployment?
Comment 17 Simone Tiraboschi 2016-01-12 13:37:05 EST
We have more than one RFE and we are working on it for 4.0.

In the mean time:

 hosted-engine --vm-poweroff # to poweroff the engine VM if running
 systemctl stop ovirt-ha-agent; systemctl stop ovirt-ha-broker; systemctl stop vdsmd
 /bin/rm /etc/ovirt-hosted-engine/hosted-engine.conf
 /bin/rm /etc/ovirt-hosted-engine/answers.conf
 /bin/rm /etc/vdsm/vdsm.conf
 /bin/rm /etc/pki/vdsm/*/*.pem
 /bin/rm /etc/pki/CA/cacert.pem
 /bin/rm /etc/pki/libvirt/*.pem
 /bin/rm /etc/pki/libvirt/private/*.pem

And this just acts on the single host while the hosted-engine image is on the shared storage (an iSCSI or FC LUN, an NFS share or a gluster volume in 3.6) and cleaning that it's currently up to the user being on another system and maybe being used also by different hosts.

Then VDSM doesn't automatically disconnect the storage server and also the disconnectStorageServer verb if explicitly called can finish with some LVM volumes leftovers that could cause issues on the next attempt so the easiest (but really ugly!!!) way is to reboot the host before the next attempt.
We have an RFE also on this:
https://bugzilla.redhat.com/show_bug.cgi?id=1149738

In theory it could be possible to redeploy with the answerfile from previous attempts but this also contains the LUN UUID an, related to how the user cleaned the LUN, it could be not valid anymore so also here re-starting from scratch is a safer option.
Comment 19 Lucy Bopf 2016-05-12 01:47:25 EDT
Assigning to Julie for review.

Julie, we'll just need to review the KCS Solution attached to this bug, and then publish it when it's ready.
Comment 20 Marina 2016-05-12 16:55:13 EDT
(In reply to Simone Tiraboschi from comment #17)
> Then VDSM doesn't automatically disconnect the storage server and also the
> disconnectStorageServer verb if explicitly called can finish with some LVM
> volumes leftovers that could cause issues on the next attempt so the easiest
> (but really ugly!!!) way is to reboot the host before the next attempt.
> We have an RFE also on this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1149738
Not sure if reboot would be sufficient.
For instance, if that is iscsi, we would need to clear the /var/lib/iscsi directory. Maybe it is worth asking advice from a storage person, if want to publish this solution through official documentation.
At this point I am publishing this:
https://access.redhat.com/solutions/2121581
Comment 21 Yaniv Lavi (Dary) 2016-05-15 04:49:20 EDT
Please provide the info once you talk with the storage team.
Comment 23 Allon Mureinik 2016-05-24 03:39:41 EDT
Liron - can you take a look please?
Comment 26 Liron Aravot 2016-07-07 03:54:50 EDT
It makes sense to me, Nir - any other opinion?
Comment 28 Nir Soffer 2016-08-29 11:25:43 EDT
(In reply to Liron Aravot from comment #26)
> It makes sense to me, Nir - any other opinion?

No

Note You need to log in before you can comment on or make changes to this bug.