1418008 – CFME VM boots into emergency mode

Bug 1418008 - CFME VM boots into emergency mode

Summary: CFME VM boots into emergency mode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Quickstart Cloud Installer
Classification:	Red Hat
Component:	Installation - CloudForms
Sub Component:
Version:	1.1
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	1.1
Assignee:	jkim
QA Contact:	Tasos Papaioannou
Docs Contact:	Dan Macpherson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-31 15:23 UTC by James Olin Oden
Modified:	2017-02-28 01:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-28 01:45:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:0335	0	normal	SHIPPED_LIVE	Red Hat Quickstart Installer 1.1	2017-02-28 06:36:13 UTC

Description James Olin Oden 2017-01-31 15:23:32 UTC

Description of problem:
Was doing a RHV(self hosted, 1 H) + CFME deployment.   The deployment of CFME failed with the following error on the CFEM deployment task:

   Failed to update root password on appliance. Retries exhausted. Error: execution expired

I tried to ping the CFME VM, but it would not respond.   I then went into the RHV engine's UI and saw that the VM existed, so I connected to its console, and found that it was in emergency mode.   Unfortunately I hit CTRL-D by accident and then it rebooted. 

Version-Release number of selected component (if applicable):
QCI-1.1-RHEL-7-20170130.t.0

How reproducible:
First time seeing this.

Steps to Reproduce:
1.  Do a RHV(self-hosted, 1H) + CFME deployment.

Actual results:
Failed to deploy CFME.

Expected results:
CFME deployed fine.

Comment 2 Jason Montleon 2017-01-31 15:34:22 UTC

Can you run journalctl and try to determine why it went into emergency mode?

You should be able to use -b 1 to look at the first boot (or -b -1 to look at the previous boot if you've only rebooted once to look at the prior boot). Without more info there isn't much we can do to troubleshoot this.

Comment 3 James Olin Oden 2017-01-31 15:51:10 UTC

Unfortunately, I could not get logged into the system to do the journalctl.  At this point I've already recycled the nodes and am trying to recreate this bug.   I'll be sure to try to get the journalctl output if it re-occurs.

Comment 4 James Olin Oden 2017-01-31 18:19:34 UTC

When I did this deployment again I got the following error:

   Failed to update admin password on appliance. Error message: Request Timeout

Note its not saying it couldn't update the admin password. 

This time I can login as root, but I can't login to the console.

Comment 5 James Olin Oden 2017-01-31 18:22:11 UTC

I did notice that this is in the production.log file multiple times:   

2017-01-31 10:01:00 [app] [F]
 | ActionView::MissingTemplate (Missing template common/500 with {:locale=>[:en], :formats=>[:json], :variants=>[], :handlers=>[:erb, :builder, :raw, :ruby, :rabl]}. Searched in:
 |   * "/usr/share/foreman/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/redhat_access-1.0.13/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/fusor_ui-1.1.20/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/fusor_server-1.1.34/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.0.90/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.7.14.11/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/foreman_discovery-5.0.0.9/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/bastion-3.2.0.10/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/foreman_bootdisk-6.1.0.3/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/foreman_docker-2.0.1.11/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/foreman_theme_satellite-0.1.36/app/views"
 |   * "/opt/theforeman/tfm/root/usr/share/gems/gems/apipie-rails-0.3.6/app/views"
 | ):
 |   app/controllers/application_controller.rb:295:in `generic_exception'
 |   lib/middleware/catch_json_parse_errors.rb:9:in `call'
 |
 |

Comment 6 James Olin Oden 2017-01-31 18:26:24 UTC

In the deployment.log there is this message over and over at the end:

D, [2017-01-31T12:49:19.249164 #19861] DEBUG -- : ================ CloudForms::WaitForConsole poll_external_task method ====================
D, [2017-01-31T12:49:19.250946 #19861] DEBUG -- : ================ CloudForms::WaitForConsole is_up method ====================
W, [2017-01-31T12:51:21.874424 #19861]  WARN -- : Net::ReadTimeout

Comment 10 jkim 2017-02-01 15:17:03 UTC

I was able to successfully deploy this scenario with the same compose.  However, one thing that I have done differently, was to place the Satellite VM's libvirt image file on a separate physical disk.  That is, I have spawn the Satellite's VM image on one physical disk, and spawn the hypervisor VM's image on another physical disk mounted within the same host.  Also, please note that I've used the Satellite VM's NFS exported folders for storage.

The reason why I've separated the Satellite VM's libvirt image file from the rest of the VM's is that, 

We've seen intermittent deployment failures for scenarios configured with MORE than just RHV (i.e. RHV+CFME, RHV+OCP, RHV+CFME+OCP), because those seem to cause high I/O disk loads.  Separating the Satellite VM image from other VM images to it's own physical disk, seem to resolve those intermittent failures.

Comment 11 John Matthews 2017-02-02 13:22:04 UTC

James, we believe the issue you saw is caused from the environment lacking sufficient resources.

Comment 12 Tasos Papaioannou 2017-02-06 21:50:38 UTC

Verified on QCI-1.1-RHEL-7-20170202.t.2. Deploying on nested virt with insufficient cpu resources in the rhv host to cover the host os + cfme can result in the cfme vm failing over to emergency mode.

Comment 14 errata-xmlrpc 2017-02-28 01:45:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:0335

Note You need to log in before you can comment on or make changes to this bug.