Bug 1258754

Summary: [Docs] - Add steps for cleaning up a failed HE deployment
Product: Red Hat Enterprise Virtualization Manager Reporter: Ying Cui <ycui>
Component: DocumentationAssignee: Julie <juwu>
Status: CLOSED CURRENTRELEASE QA Contact: Byron Gravenorst <bgraveno>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.4CC: adahms, amureini, cshao, dfediuck, fdeutsch, gklein, istein, juwu, laravot, lbopf, leiwang, lsurette, mgoldboi, mkalinin, nsednev, nsoffer, rbalakri, rbarry, srevivo, stirabos, ycui, ylavi, yzhao
Target Milestone: ovirt-3.6.6   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-12 05:38:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Docs RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport
none
varlog.tar.bz2 none

Description Ying Cui 2015-09-01 08:13:21 UTC
Description of problem:
Note: There had 2 bugs before: bug 1172511 and bug 1172966.
Abort HE setup process firstly, the TUI still show "Failed to connect to broker, the number of errors has exceeded the limit (1)", but not slowly.

And try to restart HE setup again via TUI, failed, the HE setup process can not be continued, have to return to TUI.

Version-Release number of selected component (if applicable):
# rpm -qa ovirt-hosted-engine-setup ovirt-hosted-engine-ha 
ovirt-hosted-engine-setup-1.2.5.3-1.el6ev.noarch
ovirt-hosted-engine-ha-1.2.6-3.el6ev.noarch
# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150828.0.el6ev)

How reproducible:
100%

Steps to Reproduce:
1. Install rhevh-20150828.0.el6ev with TUI
2. Enable network, and set root password, enable ssh on rhevh.
3. Config hosted engine with correct steps.
4. The VM has been started.  Install the OS and shut down or reboot it.  To continue please make a selection:
         
          (1) Continue setup - VM installation is complete
          (2) Reboot the VM and restart installation
          (3) Abort setup
          (4) Destroy VM and abort setup
         
          (1, 2, 3, 4)[1]: 3
5. after Abort setup
6. back to TUI
7. restart HE setup via TUI, but failed.

-----console-----
[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]: 

[screen is terminating]
Hit <Return> to return to the TUI
------------------

Actual results:
Can not restart HE setup after abort HE setup process

# hosted-engine --vm-status
Failed to connect to broker, the number of errors has exceeded the limit (1)

# /etc/init.d/ovirt-ha-broker status
ovirt-ha-broker is stopped


Expected results:
Can restart HE setup successful after abort HE process at the first time.


Additional info:

# log snip while aborting HE setup, see bug 11729266 the same.
<snip>
2015-08-31 15:55:39 DEBUG otopi.plugins.otopi.dialog.human dialog.__logString:215 DIALOG:RECEIVE    3
2015-08-31 15:55:39 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/vm/runvm.py", line 175, in _boot_from_install_media
RuntimeError: OS installation aborted by user
2015-08-31 15:55:39 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Closing up': OS installation aborted by user
2015-08-31 15:55:39 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN

</snip>

#  log snip when the second time to setup HE via TUI after abort HE
<snip>
2015-09-01 07:23:13 DEBUG otopi.context context.dumpEnvironment:504 ENVIRONMENT DUMP - END
2015-09-01 07:23:13 DEBUG otopi.context context._executeMethod:138 Stage late_setup METHOD otopi.plugins.ovirt_hosted_engine_setup.vm.configurevm.Plugin._late_setup
2015-09-01 07:23:13 DEBUG otopi.plugins.ovirt_hosted_engine_setup.vm.configurevm configurevm._late_setup:101 [{'status': 'Up', 'vmId': '0555aa86-e7a4-47ab-ae6a-7128a6e62022'}]
2015-09-01 07:23:13 ERROR otopi.plugins.ovirt_hosted_engine_setup.vm.configurevm configurevm._late_setup:108 The following VMs has been found: 0555aa86-e7a4-47ab-ae6a-7128a6e62022
2015-09-01 07:23:13 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/vm/configurevm.py", line 112, in _late_setup
RuntimeError: Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Environment setup': Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN
</snip>

Comment 1 Ying Cui 2015-09-01 08:18:51 UTC
This issue should be in RHEL as well, but for RHEL, the user can yum remove the relevant pkgs, not impact a lots, but for RHEV-H, we have to re-install the whole RHEV-H before we find workaround to cleanup the host.

Comment 2 Ying Cui 2015-09-01 08:22:12 UTC
Created attachment 1068874 [details]
sosreport

Comment 3 Ying Cui 2015-09-01 08:22:43 UTC
Created attachment 1068875 [details]
varlog.tar.bz2

Comment 4 Yaniv Lavi 2015-09-01 09:44:20 UTC
Any suggested workaround?

Comment 5 Simone Tiraboschi 2015-09-01 10:01:27 UTC
We have some hints here:
http://www.ovirt.org/Hosted_Engine_Howto#Recoving_from_failed_install

The issue is here:
2015-09-01 07:23:13 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
  File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/vm/configurevm.py", line 112, in _late_setup
RuntimeError: Cannot setup Hosted Engine with other VMs running

For sure the user has to destroy the previous engine vm with
 hosted-engine --vm-poweroff
and manually cleanup the shared storage in order to try again.

On RHEV-H then almost all the config file are persisted just at the end of the deployment process so nothing should really be there and a second attempt should simply work.

Comment 6 Ying Cui 2015-09-01 10:20:24 UTC
(In reply to Simone Tiraboschi from comment #5)
> For sure the user has to destroy the previous engine vm with
>  hosted-engine --vm-poweroff
> and manually cleanup the shared storage in order to try again.

This workaround works good on RHEV-H after aborting HE setup.

Comment 7 Yaniv Lavi 2015-09-01 10:52:43 UTC
(In reply to Ying Cui from comment #6)
> (In reply to Simone Tiraboschi from comment #5)
> > For sure the user has to destroy the previous engine vm with
> >  hosted-engine --vm-poweroff
> > and manually cleanup the shared storage in order to try again.
> 
> This workaround works good on RHEV-H after aborting HE setup.

We should add a button in the TUI to fix this, but I think this is not a blocker.
Can you please add a release note for this?
Moran, is a 3.5.5 target correct in your view?

Comment 8 Fabian Deutsch 2015-09-08 12:56:57 UTC
A button will not have the right context in the current page.
I'd rather favor to pull it into 3.6.0, where we have refactored the page and a submenu/dialog is available for these kind of actions.

Comment 9 Yaniv Lavi 2015-09-08 13:18:59 UTC
We have a workaround that requires a drop to shell that customers are not supposed to do without GSS. Moran, please review.

Comment 12 Fabian Deutsch 2015-12-22 11:55:00 UTC
Simone, from comment 0:

RuntimeError: Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Environment setup': Cannot setup Hosted Engine with other VMs running
2015-09-01 07:23:13 DEBUG otopi.context context.dumpEnvironment:490 ENVIRONMENT DUMP - BEGIN

Should he-setup bring down a VM it has spawned if the HE setup is getting aborted?

Comment 13 Simone Tiraboschi 2015-12-22 12:51:33 UTC
(In reply to Fabian Deutsch from comment #12)
> Should he-setup bring down a VM it has spawned if the HE setup is getting
> aborted?

No, in case of a failure it' currently up to the user with
 hosted-engine --vm-poweroff
command.

It's also up to the user to cleanup the storage if needed.

Comment 14 Fabian Deutsch 2015-12-22 17:07:54 UTC
Okay, thanks.

I think we'll stick to how this works on RHEL as well.

Julie, can a note be added to the right documentation to tell the reader that the RHEL guidelines should be followed to clean up a host, in case of an aborted hosted-engine setup?

Comment 15 Julie 2015-12-23 02:05:16 UTC
In our docs, we don't tell users how to clean up a failed HE setup. I think the assumed knowledge was to do a fresh installation of the RHEL or RHEV-H in case of a HE setup failure. In out testing, if something goes wrong, we've always spun up a fresh install to proceed with HE setup so I'd like to get some clarification on what is the supported way to go forward- to tell users to always have a fresh installation or provide provide clean-up procedures for RHEL and RHEV-H?
Maybe someone from the support team can weigh in as well?

Comment 16 Marina Kalinin 2015-12-23 20:04:16 UTC
Hi Julie, indeed we are missing some important info in the HE guide. See this bug 1293971.

I think it is not clear what to do on a failed deployment, and I personally think it should be mentioned somewhere - KCS or official docs. I could not find anything related in the knowledgebase and I would be happy to make one kcs, but it is not clear to me, what should be the process. (in the long term, of course, it would be preferable to have some clean-up tool).

Simon, can you please specify the steps for cleaning up a failed HE deployment?

Comment 17 Simone Tiraboschi 2016-01-12 18:37:05 UTC
We have more than one RFE and we are working on it for 4.0.

In the mean time:

 hosted-engine --vm-poweroff # to poweroff the engine VM if running
 systemctl stop ovirt-ha-agent; systemctl stop ovirt-ha-broker; systemctl stop vdsmd
 /bin/rm /etc/ovirt-hosted-engine/hosted-engine.conf
 /bin/rm /etc/ovirt-hosted-engine/answers.conf
 /bin/rm /etc/vdsm/vdsm.conf
 /bin/rm /etc/pki/vdsm/*/*.pem
 /bin/rm /etc/pki/CA/cacert.pem
 /bin/rm /etc/pki/libvirt/*.pem
 /bin/rm /etc/pki/libvirt/private/*.pem

And this just acts on the single host while the hosted-engine image is on the shared storage (an iSCSI or FC LUN, an NFS share or a gluster volume in 3.6) and cleaning that it's currently up to the user being on another system and maybe being used also by different hosts.

Then VDSM doesn't automatically disconnect the storage server and also the disconnectStorageServer verb if explicitly called can finish with some LVM volumes leftovers that could cause issues on the next attempt so the easiest (but really ugly!!!) way is to reboot the host before the next attempt.
We have an RFE also on this:
https://bugzilla.redhat.com/show_bug.cgi?id=1149738

In theory it could be possible to redeploy with the answerfile from previous attempts but this also contains the LUN UUID an, related to how the user cleaned the LUN, it could be not valid anymore so also here re-starting from scratch is a safer option.

Comment 19 Lucy Bopf 2016-05-12 05:47:25 UTC
Assigning to Julie for review.

Julie, we'll just need to review the KCS Solution attached to this bug, and then publish it when it's ready.

Comment 20 Marina Kalinin 2016-05-12 20:55:13 UTC
(In reply to Simone Tiraboschi from comment #17)
> Then VDSM doesn't automatically disconnect the storage server and also the
> disconnectStorageServer verb if explicitly called can finish with some LVM
> volumes leftovers that could cause issues on the next attempt so the easiest
> (but really ugly!!!) way is to reboot the host before the next attempt.
> We have an RFE also on this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1149738
Not sure if reboot would be sufficient.
For instance, if that is iscsi, we would need to clear the /var/lib/iscsi directory. Maybe it is worth asking advice from a storage person, if want to publish this solution through official documentation.
At this point I am publishing this:
https://access.redhat.com/solutions/2121581

Comment 21 Yaniv Lavi 2016-05-15 08:49:20 UTC
Please provide the info once you talk with the storage team.

Comment 23 Allon Mureinik 2016-05-24 07:39:41 UTC
Liron - can you take a look please?

Comment 26 Liron Aravot 2016-07-07 07:54:50 UTC
It makes sense to me, Nir - any other opinion?

Comment 28 Nir Soffer 2016-08-29 15:25:43 UTC
(In reply to Liron Aravot from comment #26)
> It makes sense to me, Nir - any other opinion?

No