Bug 1258133 - [Docs] RHEV Progress Bar hung at 95.5% (Due to failed engine-setup run deploying RHEV)
Summary: [Docs] RHEV Progress Bar hung at 95.5% (Due to failed engine-setup run deploy...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Quickstart Cloud Installer
Classification: Red Hat
Component: doc-Technical_Notes
Version: 1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: TP2
: 1.0
Assignee: Julie
QA Contact: Tasos Papaioannou
URL:
Whiteboard: integration
Depends On: 1066959 1119225
Blocks: 1155425
TreeView+ depends on / blocked
 
Reported: 2015-08-29 12:55 UTC by John Matthews
Modified: 2016-05-03 05:09 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A bug in RHEV related to an NFS issue Consequence: Deployment is in a hung state, will not recover. Fix: If this error is seen a user may recover from the problem by: 1) Determining the IP address of the rhev engine 2) SSH as root to the rhev engine, using the password entered in the deployment 3) Run "engine-setup" manually 4) Wait for the next puppet run to be invoked on rhev engine which will complete the configuration of the data center 5) Deployment will pick back up and continue executing Result: Deployment will recover
Clone Of: 1119225
Environment:
Last Closed: 2016-05-03 05:09:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 40653 0 None None None Never

Description John Matthews 2015-08-29 12:55:45 UTC
+++ This bug was initially created as a clone of Bug #1119225 +++

Description of problem:
"Execution of setup failed" message is odd if only nfs start fails. In fact setup is done, only NFS got some issue.

I think it should state this more clearly:
- setup done
- but there were some issues!

I have no idea what happened, it was clear install... Anyway, the BZ is about message of failure from engine-setup.

~~~

[ INFO  ] Restarting nfs services
[ ERROR ] Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute
[ INFO  ] Stage: Clean up
          Log file is located at /var/log/ovirt-engine/setup/ovirt-engine-setup-20140714115528-ytp8fx.log
[ INFO  ] Generating answer file '/var/lib/ovirt-engine/setup/answers/20140714115734-setup.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Execution of setup failed

--

2014-07-14 11:57:34 DEBUG otopi.plugins.otopi.services.rhel plugin.execute:866 execute-output: ('/sbin/initctl', 'status', 'nfs') stderr:
initctl: Unknown job: nfs

2014-07-14 11:57:34 DEBUG otopi.plugins.otopi.services.rhel plugin.executeRaw:785 execute: ('/sbin/service', 'nfs', 'start'), executable='None', cwd='None', env=None
2014-07-14 11:57:34 DEBUG otopi.plugins.otopi.services.rhel plugin.executeRaw:803 execute-result: ('/sbin/service', 'nfs', 'start'), rc=1
2014-07-14 11:57:34 DEBUG otopi.plugins.otopi.services.rhel plugin.execute:861 execute-output: ('/sbin/service', 'nfs', 'start') stdout:
Starting NFS services:  [  OK  ]
Starting NFS quotas: [  OK  ]
Starting NFS mountd: [  OK  ]
Starting NFS daemon: [FAILED]

2014-07-14 11:57:34 DEBUG otopi.plugins.otopi.services.rhel plugin.execute:866 execute-output: ('/sbin/service', 'nfs', 'start') stderr:


2014-07-14 11:57:34 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/system/nfs.py", line 276, in _closeup
    state=state,
  File "/usr/share/otopi/plugins/otopi/services/rhel.py", line 188, in state
    'start' if state else 'stop'
  File "/usr/share/otopi/plugins/otopi/services/rhel.py", line 96, in _executeServiceCommand
    raiseOnError=raiseOnError
  File "/usr/lib/python2.6/site-packages/otopi/plugin.py", line 871, in execute
    command=args[0],
RuntimeError: Command '/sbin/service' failed to execute
2014-07-14 11:57:34 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Closing up': Command '/sbin/service' failed to execute

--

[root@ovirt ~]# service nfs status
rpc.svcgssd is stopped
rpc.mountd (pid 32015) is running...
nfsd dead but subsys locked
rpc.rquotad (pid 32011) is running...

--

[root@ovirt ~]# service nfs restart
Shutting down NFS daemon:                                  [FAILED]
Shutting down NFS mountd:                                  [  OK  ]
Shutting down NFS quotas:                                  [  OK  ]
Shutting down NFS services:                                [  OK  ]
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting RPC idmapd:                                       [  OK  ]

--

Jul 14 11:57:32 localhost kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Jul 14 11:57:34 localhost kernel: RPC: Registered named UNIX socket transport module.
Jul 14 11:57:34 localhost kernel: RPC: Registered udp transport module.
Jul 14 11:57:34 localhost kernel: RPC: Registered tcp transport module.
Jul 14 11:57:34 localhost kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Jul 14 11:57:34 localhost kernel: Installing knfsd (copyright (C) 1996 okir.de).
Jul 14 11:57:34 localhost rpc.mountd[32015]: Version 1.2.3 starting
Jul 14 11:57:34 localhost kernel: lockd_up: makesock failed, error=-98
Jul 14 11:57:34 localhost kernel: nfsd: last server has exited, flushing export cache
Jul 14 11:57:34 localhost rpc.nfsd[32019]: error starting threads: errno 98 (Address already in use)
Jul 14 12:01:49 localhost ntpd[5672]: 0.0.0.0 0613 03 spike_detect +44.616753 s
Jul 14 12:04:54 localhost rpc.mountd[32015]: Caught signal 15, un-registering and exiting.
Jul 14 12:04:55 localhost rpc.mountd[32425]: Version 1.2.3 starting
Jul 14 12:04:55 localhost kernel: NFSD: Using /var/lib/nfs/v4recovery

~~~

Version-Release number of selected component (if applicable):
ovirt-engine-setup-3.5.0-0.0.master.20140629172257.git0b16ed7.el6.noarch

How reproducible:
???

Steps to Reproduce:
1. no idea, just happened
2.
3.

Actual results:
nfs issue caused appearance of a message which could make an admin think the whole setup failed

Expected results:
setup is done but something got wrong

Additional info:

Comment 1 John Matthews 2015-08-29 13:02:47 UTC
This is an intermittent issue we hit when deploying RHEV.  

We do not recover from this failure, therefore if this is seen the RHCI deployment will be in a failed state.  

The progress bars will not reflect the failure, they will be 'stuck' indefinitely. 

In Step 5B "Installation Progress":
The RHEV progress bar will remain stuck at 95.5%

The backend task will be looping over checking the RHEV data center for it to come up.

Comment 2 John Matthews 2015-08-29 13:26:48 UTC
If this error is seen a user may recover from the problem by:
1) Determining the IP address of the rhev engine
2) SSH as root to the rhev engine, using the password entered in the deployment
3) Run "engine-setup" manually
4) Wait for the next puppet run to be invoked on rhev engine which will complete the configuration of the data center
5) Deployment will pick back up and continue executing

Comment 3 Andrew Dahms 2015-09-01 02:17:50 UTC
Assigning to Julie for review.

Julie - looks like what we need for this is a short troubleshooting section that outlines the above issue and how to resolve it using the procedure in comment #2.

Comment 7 Tasos Papaioannou 2016-01-26 20:40:07 UTC
Verified.


Note You need to log in before you can comment on or make changes to this bug.