Bug 1386581

Summary: vdsm fails to start when installing hosted-engine on RHV-H after a failure
Product: [oVirt] ovirt-hosted-engine-setup Reporter: RamaKasturi <knarra>
Component: GeneralAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: medium    
Version: ---CC: bugs, cshao, dguo, fdeutsch, huzhao, jiawu, knarra, leiwang, qiyuan, rbarry, stirabos, weiwang, yaniwang, ycui, ylavi, yzhao
Target Milestone: ovirt-4.1.1Flags: fdeutsch: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
cshao: testing_ack+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-06 10:42:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot for vdsm failure
none
vdsm.log none

Description RamaKasturi 2016-10-19 09:17:56 UTC
Description of problem:
On a fresh installed node i was able to setup hosted-engine with out any issues. My setup had issues so i had to reinstall hosted-engine. I followed the steps below to clean up existing hosted-engine installation and when i try to install again it gives me the error "Failed to execute stage 'Misc configuration': Failed to start service 'vdsmd'".

cleanup steps:
===============
Hosts:

    1) check that HE VM does not exist on host - vdsClient -s 0 list table | awk '{print $1}' | xargs vdsClient -s 0 destroy
    2) remove all relevant configuration directories - rm -rf /etc/vdsm/ /etc/ovirt-hosted-engine*
    3) try to umount all storages from /rhev/data-center/mnt/ - umount -f /rhev/data-center/mnt/ (if failed with device is busy reboot host)
    remove all dirs from /rhev/data-center/mnt/ - rm -rf /rhev/data-center/mnt/*
    
Storage:

        1) stop relevant gluster storage - glusterfs volume stop your_storage
        2) clean all stuff from your gluster storage - rm -rf /gluster_vloume/you_storage/* (you need to run it for all replica's)
        3) start gluster storage - glusterfs volume start your_storage


Version-Release number of selected component (if applicable):
vdsm-4.18.13-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install Hosted-engine using cockpit UI
2. Reboot the node
3. Use the clean up steps to clean up hosted engine setup.
4. Now start deploying hosted-engine via cockpit UI

Actual results:
hosted-engine installation fails with error "failed to start vdsmd service"

Expected results:
User should be able to successfully deploy hosted-engine.

Additional info:

Comment 1 RamaKasturi 2016-10-19 09:21:27 UTC
sos report and ovirt-hosted-engine-setup log can be found in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1386581/

Comment 2 RamaKasturi 2016-10-19 09:22:01 UTC
Created attachment 1212094 [details]
screenshot for vdsm failure

Comment 3 Fabian Deutsch 2016-10-19 09:38:00 UTC
Lowering the priority, because according to comment 0 this bug appears on the second attempt of installing hosted-engine, after the cleanup of a previous attempt.

This is an unusual flow.

Rama, I'd suggest to start completely fresh if you need to re-deploy hosted engine. AFAIK there is no defined way for RHEL nor RHVH of cleanig up hosted-engine. But Simone is probably able to clarify this.

Simone, do the cleanup steps look correct to you and is there a defined way of cleaning up hosted-engine?

Comment 4 Simone Tiraboschi 2016-10-19 09:50:03 UTC
(In reply to Fabian Deutsch from comment #3)
> Simone, do the cleanup steps look correct to you and is there a defined way
> of cleaning up hosted-engine?

We have manual recovery instructions here:
https://www.ovirt.org/documentation/how-to/hosted-engine/#recoving-from-failed-install

Comment 5 Simone Tiraboschi 2016-10-19 10:06:34 UTC
Here the issue was with supervdsm that fails due to the lack of /dev/stdout
Pretty strange...

Oct 19 13:06:55 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Oct 19 13:06:55 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: supervdsmd.service failed.
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com polkitd[2322]: Registered Authentication Agent for unix-process:30433:595554 (system bus name :1.75 [/usr/bin/pkttyagent --notify-fd 6 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Started Auxiliary vdsm service for running helper functions as root.
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Starting Auxiliary vdsm service for running helper functions as root...
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Starting Virtual Desktop Server Manager...
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running mkdirs
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running configure_coredump
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running configure_vdsm_logs
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running wait_for_network
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: Traceback (most recent call last):
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/share/vdsm/supervdsmServer", line 45, in <module>
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: level=logging.DEBUG)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/lib64/python2.7/logging/__init__.py", line 1529, in basicConfig
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: hdlr = FileHandler(filename, mode)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: StreamHandler.__init__(self, self._open())
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: stream = open(self.baseFilename, self.mode)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: IOError: [Errno 6] No such device or address: '/dev/stdout'
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: supervdsmd.service: main process exited, code=exited, status=1/FAILURE
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Unit supervdsmd.service entered failed state.
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: supervdsmd.service failed.

Comment 6 Simone Tiraboschi 2016-10-19 10:11:14 UTC
Wed 2016-10-19 13:06:52.780372 IST [s=4f085ef25a1347e78a534b14bf0387d5;i=cfd;b=1d7c4901b75a47799f6b8e2f9f88963b;m=3e192624;t=53f32df6ab554;x=8a1b572c8de9af07]
    PRIORITY=6
    _UID=0
    _GID=0
    _SYSTEMD_SLICE=system.slice
    _BOOT_ID=1d7c4901b75a47799f6b8e2f9f88963b
    _MACHINE_ID=f4cd856b7f414b1f9da665403f07b0ae
    _HOSTNAME=rhsqa-grafton1.lab.eng.blr.redhat.com
    SYSLOG_FACILITY=3
    _CAP_EFFECTIVE=1fffffffff
    _SELINUX_CONTEXT=system_u:system_r:init_t:s0
    _TRANSPORT=stdout
    _EXE=/usr/bin/python2.7
    SYSLOG_IDENTIFIER=daemonAdapter
    _COMM=supervdsmServer
    _CMDLINE=/usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock
    _SYSTEMD_CGROUP=/system.slice/supervdsmd.service
    _SYSTEMD_UNIT=supervdsmd.service
    _PID=30283
    MESSAGE=IOError: [Errno 6] No such device or address: '/dev/stdout'

Rama, any idea on why /dev/stdout could be disappeared?

Comment 7 RamaKasturi 2016-10-19 10:13:50 UTC
(In reply to Simone Tiraboschi from comment #6)
> Wed 2016-10-19 13:06:52.780372 IST
> [s=4f085ef25a1347e78a534b14bf0387d5;i=cfd;b=1d7c4901b75a47799f6b8e2f9f88963b;
> m=3e192624;t=53f32df6ab554;x=8a1b572c8de9af07]
>     PRIORITY=6
>     _UID=0
>     _GID=0
>     _SYSTEMD_SLICE=system.slice
>     _BOOT_ID=1d7c4901b75a47799f6b8e2f9f88963b
>     _MACHINE_ID=f4cd856b7f414b1f9da665403f07b0ae
>     _HOSTNAME=rhsqa-grafton1.lab.eng.blr.redhat.com
>     SYSLOG_FACILITY=3
>     _CAP_EFFECTIVE=1fffffffff
>     _SELINUX_CONTEXT=system_u:system_r:init_t:s0
>     _TRANSPORT=stdout
>     _EXE=/usr/bin/python2.7
>     SYSLOG_IDENTIFIER=daemonAdapter
>     _COMM=supervdsmServer
>     _CMDLINE=/usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile
> /var/run/vdsm/svdsm.sock
>     _SYSTEMD_CGROUP=/system.slice/supervdsmd.service
>     _SYSTEMD_UNIT=supervdsmd.service
>     _PID=30283
>     MESSAGE=IOError: [Errno 6] No such device or address: '/dev/stdout'
> 
> Rama, any idea on why /dev/stdout could be disappeared?

simone, not sure. I performed the steps as listed above and see that supervdsm and vdsm fails to start.

Comment 8 Simone Tiraboschi 2016-10-19 11:02:14 UTC
I'll try to reproduce.

Comment 9 Yihui Zhao 2016-10-19 11:08:20 UTC
Hi,all
    Thank you for RamaKasturi's details.i follow his steps to reproduce.

Steps:
1.Deploy HostedEngine by cockpit UI successfully.
2.Reboot the node
3.Use the clean up steps to clean up hosted engine setup.(comment 0)
4.Now start deploying hosted-engine via cockpit UI


Errors display om cockpit :

 Failed to execute stage 'Environment setup': Failed to start service 'vdsmd'
 Hosted Engine deployment failed

Version-Release number of selected component (if applicable):
vdsm-4.18.13-1.el7ev.x86_64

Thanks,
Yihui

Comment 10 Fabian Deutsch 2016-10-19 12:27:30 UTC
Yihui Zhao, can you also reproduce this on RHEL-H?

Comment 11 RamaKasturi 2016-10-19 13:29:32 UTC
Hi Fabian,
  As suggested by you in comment 3 i reinstalled the problematic RHV-H node and redeployed hosted-engine. Now when i try adding additional host , host addition fails with error "failed to start vdsmd service". 

 Copied vdsm and host-deploy log to the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1386581/

Do we need to reinstall all the hosts? Any idea why this happens?

Thanks
kasturi.

Comment 12 Fabian Deutsch 2016-10-20 04:46:24 UTC
From vdsm.log:

Thread-12::ERROR::2016-10-19 12:59:48,658::monitor::484::Storage.Monitor::(_pathChecked) Error checking path /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/674c131b-01d2-4912-bd64-e4984e1e0d13/dom_md/metadata
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 482, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 367, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/674c131b-01d2-4912-bd64-e4984e1e0d13/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/674c131b-01d2-4912-bd64-e4984e1e0d13/dom_md/metadata\': No such file or directory\n"))
Thread-110::ERROR::2016-10-19 12:59:48,743::monitor::425::Storage.Monitor::(_checkDomainStatus) Error checking domain 674c131b-01d2-4912-bd64-e4984e1e0d13
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 406, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/fileSD.py", line 656, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 260, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 488, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
…
Thread-110::ERROR::2016-10-19 13:06:48,785::monitor::425::Storage.Monitor::(_checkDomainStatus) Error checking domain 674c131b-01d2-4912-bd64-e4984e1e0d13
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 406, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 50, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 125, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/glusterSD.py", line 51, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('674c131b-01d2-4912-bd64-e4984e1e0d13',)
MainThread::DEBUG::2016-10-19 13:06:50,417::vdsm::73::vds::(sigtermHandler) Received signal 15

Thus it looks like there is something up with your storage domain.

Please verify your storage domains.

Comment 13 RamaKasturi 2016-10-20 09:29:50 UTC
Fabian, i see that these errors come up in vdsm logs with out having any issues with my storage domains. I have already logged a bug for StorageDomain DoesNotExist error in the vdsm log.

Comment 14 Fabian Deutsch 2016-12-06 16:58:53 UTC
Ying, can you reproduce this issue?

Comment 15 Yihui Zhao 2016-12-07 07:12:26 UTC
Created attachment 1228902 [details]
vdsm.log

Comment 16 Yihui Zhao 2016-12-07 07:22:34 UTC
I follow the steps to test on rhvh-4.0-0.20161206.0+1,can reproduce this problem.

And the vdsm.log attached on 
https://bugzilla.redhat.com/attachment.cgi?id=1228902

Ps: Reinstall HE failed because vdsm is down, i tried to start vdsm manual by the command "systemctl start vdsmd",but failed.

[root@dell-per510-01 vdsm]# systemctl start vdsmd
A dependency job for vdsmd.service failed. See 'journalctl -xe' for details.

[root@dell-per510-01 vdsm]# journalctl -xe
-- Subject: Unit ovirt-imageio-daemon.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit ovirt-imageio-daemon.service has failed.
-- 
-- The result is failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: ovirt-imageio-daemon.service failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: vdsmd.service holdoff time over, scheduling restart.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: start request repeated too quickly for ovirt-imageio-daemon.service
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: Failed to start oVirt ImageIO Daemon.
-- Subject: Unit ovirt-imageio-daemon.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit ovirt-imageio-daemon.service has failed.
-- 
-- The result is failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: ovirt-imageio-daemon.service failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: Stopping MOM instance configured for VDSM purposes...
-- Subject: Unit mom-vdsm.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit mom-vdsm.service has begun shutting down.
Dec 07 02:21:19 dell-per510-01.lab.eng.pek2.redhat.com momd[6529]: 2016-12-07 02:21:19,506 - mom.HostMonitor - INFO - Host Monitor ending
Dec 07 02:21:20 dell-per510-01.lab.eng.pek2.redhat.com momd[6529]: 2016-12-07 02:21:20,472 - mom.GuestManager - INFO - Guest Manager ending



Thanks,
Yihui

Comment 17 cshao 2016-12-07 08:04:22 UTC
Thanks for yihui's detail testing steps.
Cancel the needinfo now.

Comment 18 Fabian Deutsch 2016-12-07 08:28:02 UTC
Thanks. To me this sounds as if the cleanup steps are removing to much.

Moving this to HE as it looks like this should happen on RHEL-H as well.

Comment 19 Simone Tiraboschi 2016-12-07 09:08:30 UTC
Yihui, can you please check yum logs for:
 warning: user vdsm does not exist - using root
 warning: group kvm does not exist - using root

Comment 20 Yihui Zhao 2016-12-07 10:37:01 UTC
(In reply to Simone Tiraboschi from comment #19)
> Yihui, can you please check yum logs for:
>  warning: user vdsm does not exist - using root
>  warning: group kvm does not exist - using root

Hi,Simone
    I checked yum logs,but there is no content in it.

Thanks,
Yihui

Comment 21 Yaniv Lavi 2017-02-06 10:42:09 UTC
This should be fixed by BZ #1001181.
Please test it and report any bugs if needed.

*** This bug has been marked as a duplicate of bug 1001181 ***

Comment 22 Red Hat Bugzilla 2023-09-14 03:33:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days