Bug 1386581 - vdsm fails to start when installing hosted-engine on RHV-H after a failure
Summary: vdsm fails to start when installing hosted-engine on RHV-H after a failure
Keywords:
Status: CLOSED DUPLICATE of bug 1001181
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.1.1
: ---
Assignee: Simone Tiraboschi
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-19 09:17 UTC by RamaKasturi
Modified: 2023-09-14 03:33 UTC (History)
16 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-02-06 10:42:09 UTC
oVirt Team: Integration
Embargoed:
fdeutsch: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
cshao: testing_ack+


Attachments (Terms of Use)
screenshot for vdsm failure (158.08 KB, image/png)
2016-10-19 09:22 UTC, RamaKasturi
no flags Details
vdsm.log (1.03 MB, application/x-bzip)
2016-12-07 07:12 UTC, Yihui Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1001181 0 high CLOSED [RFE] Provide clean up script for complete cleaning the hosted engine VM installation after failed installation. 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1401901 0 medium CLOSED ovirt-imageio-daemon requires vdsm user but it's not requiring vdsm rpm because vdsm is requiring ovirt-imageio-daemon 2021-02-22 00:41:40 UTC

Internal Links: 1001181 1401901

Description RamaKasturi 2016-10-19 09:17:56 UTC
Description of problem:
On a fresh installed node i was able to setup hosted-engine with out any issues. My setup had issues so i had to reinstall hosted-engine. I followed the steps below to clean up existing hosted-engine installation and when i try to install again it gives me the error "Failed to execute stage 'Misc configuration': Failed to start service 'vdsmd'".

cleanup steps:
===============
Hosts:

    1) check that HE VM does not exist on host - vdsClient -s 0 list table | awk '{print $1}' | xargs vdsClient -s 0 destroy
    2) remove all relevant configuration directories - rm -rf /etc/vdsm/ /etc/ovirt-hosted-engine*
    3) try to umount all storages from /rhev/data-center/mnt/ - umount -f /rhev/data-center/mnt/ (if failed with device is busy reboot host)
    remove all dirs from /rhev/data-center/mnt/ - rm -rf /rhev/data-center/mnt/*
    
Storage:

        1) stop relevant gluster storage - glusterfs volume stop your_storage
        2) clean all stuff from your gluster storage - rm -rf /gluster_vloume/you_storage/* (you need to run it for all replica's)
        3) start gluster storage - glusterfs volume start your_storage


Version-Release number of selected component (if applicable):
vdsm-4.18.13-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install Hosted-engine using cockpit UI
2. Reboot the node
3. Use the clean up steps to clean up hosted engine setup.
4. Now start deploying hosted-engine via cockpit UI

Actual results:
hosted-engine installation fails with error "failed to start vdsmd service"

Expected results:
User should be able to successfully deploy hosted-engine.

Additional info:

Comment 1 RamaKasturi 2016-10-19 09:21:27 UTC
sos report and ovirt-hosted-engine-setup log can be found in the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1386581/

Comment 2 RamaKasturi 2016-10-19 09:22:01 UTC
Created attachment 1212094 [details]
screenshot for vdsm failure

Comment 3 Fabian Deutsch 2016-10-19 09:38:00 UTC
Lowering the priority, because according to comment 0 this bug appears on the second attempt of installing hosted-engine, after the cleanup of a previous attempt.

This is an unusual flow.

Rama, I'd suggest to start completely fresh if you need to re-deploy hosted engine. AFAIK there is no defined way for RHEL nor RHVH of cleanig up hosted-engine. But Simone is probably able to clarify this.

Simone, do the cleanup steps look correct to you and is there a defined way of cleaning up hosted-engine?

Comment 4 Simone Tiraboschi 2016-10-19 09:50:03 UTC
(In reply to Fabian Deutsch from comment #3)
> Simone, do the cleanup steps look correct to you and is there a defined way
> of cleaning up hosted-engine?

We have manual recovery instructions here:
https://www.ovirt.org/documentation/how-to/hosted-engine/#recoving-from-failed-install

Comment 5 Simone Tiraboschi 2016-10-19 10:06:34 UTC
Here the issue was with supervdsm that fails due to the lack of /dev/stdout
Pretty strange...

Oct 19 13:06:55 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Job vdsmd.service/start failed with result 'dependency'.
Oct 19 13:06:55 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: supervdsmd.service failed.
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com polkitd[2322]: Registered Authentication Agent for unix-process:30433:595554 (system bus name :1.75 [/usr/bin/pkttyagent --notify-fd 6 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Started Auxiliary vdsm service for running helper functions as root.
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Starting Auxiliary vdsm service for running helper functions as root...
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Starting Virtual Desktop Server Manager...
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running mkdirs
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running configure_coredump
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running configure_vdsm_logs
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com vdsmd_init_common.sh[30439]: vdsm: Running wait_for_network
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: Traceback (most recent call last):
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/share/vdsm/supervdsmServer", line 45, in <module>
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: level=logging.DEBUG)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/lib64/python2.7/logging/__init__.py", line 1529, in basicConfig
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: hdlr = FileHandler(filename, mode)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/lib64/python2.7/logging/__init__.py", line 902, in __init__
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: StreamHandler.__init__(self, self._open())
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: File "/usr/lib64/python2.7/logging/__init__.py", line 925, in _open
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: stream = open(self.baseFilename, self.mode)
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com daemonAdapter[30438]: IOError: [Errno 6] No such device or address: '/dev/stdout'
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: supervdsmd.service: main process exited, code=exited, status=1/FAILURE
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: Unit supervdsmd.service entered failed state.
Oct 19 14:28:46 rhsqa-grafton1.lab.eng.blr.redhat.com systemd[1]: supervdsmd.service failed.

Comment 6 Simone Tiraboschi 2016-10-19 10:11:14 UTC
Wed 2016-10-19 13:06:52.780372 IST [s=4f085ef25a1347e78a534b14bf0387d5;i=cfd;b=1d7c4901b75a47799f6b8e2f9f88963b;m=3e192624;t=53f32df6ab554;x=8a1b572c8de9af07]
    PRIORITY=6
    _UID=0
    _GID=0
    _SYSTEMD_SLICE=system.slice
    _BOOT_ID=1d7c4901b75a47799f6b8e2f9f88963b
    _MACHINE_ID=f4cd856b7f414b1f9da665403f07b0ae
    _HOSTNAME=rhsqa-grafton1.lab.eng.blr.redhat.com
    SYSLOG_FACILITY=3
    _CAP_EFFECTIVE=1fffffffff
    _SELINUX_CONTEXT=system_u:system_r:init_t:s0
    _TRANSPORT=stdout
    _EXE=/usr/bin/python2.7
    SYSLOG_IDENTIFIER=daemonAdapter
    _COMM=supervdsmServer
    _CMDLINE=/usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock
    _SYSTEMD_CGROUP=/system.slice/supervdsmd.service
    _SYSTEMD_UNIT=supervdsmd.service
    _PID=30283
    MESSAGE=IOError: [Errno 6] No such device or address: '/dev/stdout'

Rama, any idea on why /dev/stdout could be disappeared?

Comment 7 RamaKasturi 2016-10-19 10:13:50 UTC
(In reply to Simone Tiraboschi from comment #6)
> Wed 2016-10-19 13:06:52.780372 IST
> [s=4f085ef25a1347e78a534b14bf0387d5;i=cfd;b=1d7c4901b75a47799f6b8e2f9f88963b;
> m=3e192624;t=53f32df6ab554;x=8a1b572c8de9af07]
>     PRIORITY=6
>     _UID=0
>     _GID=0
>     _SYSTEMD_SLICE=system.slice
>     _BOOT_ID=1d7c4901b75a47799f6b8e2f9f88963b
>     _MACHINE_ID=f4cd856b7f414b1f9da665403f07b0ae
>     _HOSTNAME=rhsqa-grafton1.lab.eng.blr.redhat.com
>     SYSLOG_FACILITY=3
>     _CAP_EFFECTIVE=1fffffffff
>     _SELINUX_CONTEXT=system_u:system_r:init_t:s0
>     _TRANSPORT=stdout
>     _EXE=/usr/bin/python2.7
>     SYSLOG_IDENTIFIER=daemonAdapter
>     _COMM=supervdsmServer
>     _CMDLINE=/usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile
> /var/run/vdsm/svdsm.sock
>     _SYSTEMD_CGROUP=/system.slice/supervdsmd.service
>     _SYSTEMD_UNIT=supervdsmd.service
>     _PID=30283
>     MESSAGE=IOError: [Errno 6] No such device or address: '/dev/stdout'
> 
> Rama, any idea on why /dev/stdout could be disappeared?

simone, not sure. I performed the steps as listed above and see that supervdsm and vdsm fails to start.

Comment 8 Simone Tiraboschi 2016-10-19 11:02:14 UTC
I'll try to reproduce.

Comment 9 Yihui Zhao 2016-10-19 11:08:20 UTC
Hi,all
    Thank you for RamaKasturi's details.i follow his steps to reproduce.

Steps:
1.Deploy HostedEngine by cockpit UI successfully.
2.Reboot the node
3.Use the clean up steps to clean up hosted engine setup.(comment 0)
4.Now start deploying hosted-engine via cockpit UI


Errors display om cockpit :

 Failed to execute stage 'Environment setup': Failed to start service 'vdsmd'
 Hosted Engine deployment failed

Version-Release number of selected component (if applicable):
vdsm-4.18.13-1.el7ev.x86_64

Thanks,
Yihui

Comment 10 Fabian Deutsch 2016-10-19 12:27:30 UTC
Yihui Zhao, can you also reproduce this on RHEL-H?

Comment 11 RamaKasturi 2016-10-19 13:29:32 UTC
Hi Fabian,
  As suggested by you in comment 3 i reinstalled the problematic RHV-H node and redeployed hosted-engine. Now when i try adding additional host , host addition fails with error "failed to start vdsmd service". 

 Copied vdsm and host-deploy log to the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1386581/

Do we need to reinstall all the hosts? Any idea why this happens?

Thanks
kasturi.

Comment 12 Fabian Deutsch 2016-10-20 04:46:24 UTC
From vdsm.log:

Thread-12::ERROR::2016-10-19 12:59:48,658::monitor::484::Storage.Monitor::(_pathChecked) Error checking path /rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/674c131b-01d2-4912-bd64-e4984e1e0d13/dom_md/metadata
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 482, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 367, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/674c131b-01d2-4912-bd64-e4984e1e0d13/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/10.70.36.79:_engine/674c131b-01d2-4912-bd64-e4984e1e0d13/dom_md/metadata\': No such file or directory\n"))
Thread-110::ERROR::2016-10-19 12:59:48,743::monitor::425::Storage.Monitor::(_checkDomainStatus) Error checking domain 674c131b-01d2-4912-bd64-e4984e1e0d13
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 406, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/fileSD.py", line 656, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/outOfProcess.py", line 260, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 488, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
…
Thread-110::ERROR::2016-10-19 13:06:48,785::monitor::425::Storage.Monitor::(_checkDomainStatus) Error checking domain 674c131b-01d2-4912-bd64-e4984e1e0d13
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 406, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 50, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 125, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/glusterSD.py", line 51, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('674c131b-01d2-4912-bd64-e4984e1e0d13',)
MainThread::DEBUG::2016-10-19 13:06:50,417::vdsm::73::vds::(sigtermHandler) Received signal 15

Thus it looks like there is something up with your storage domain.

Please verify your storage domains.

Comment 13 RamaKasturi 2016-10-20 09:29:50 UTC
Fabian, i see that these errors come up in vdsm logs with out having any issues with my storage domains. I have already logged a bug for StorageDomain DoesNotExist error in the vdsm log.

Comment 14 Fabian Deutsch 2016-12-06 16:58:53 UTC
Ying, can you reproduce this issue?

Comment 15 Yihui Zhao 2016-12-07 07:12:26 UTC
Created attachment 1228902 [details]
vdsm.log

Comment 16 Yihui Zhao 2016-12-07 07:22:34 UTC
I follow the steps to test on rhvh-4.0-0.20161206.0+1,can reproduce this problem.

And the vdsm.log attached on 
https://bugzilla.redhat.com/attachment.cgi?id=1228902

Ps: Reinstall HE failed because vdsm is down, i tried to start vdsm manual by the command "systemctl start vdsmd",but failed.

[root@dell-per510-01 vdsm]# systemctl start vdsmd
A dependency job for vdsmd.service failed. See 'journalctl -xe' for details.

[root@dell-per510-01 vdsm]# journalctl -xe
-- Subject: Unit ovirt-imageio-daemon.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit ovirt-imageio-daemon.service has failed.
-- 
-- The result is failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: ovirt-imageio-daemon.service failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: vdsmd.service holdoff time over, scheduling restart.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: start request repeated too quickly for ovirt-imageio-daemon.service
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: Failed to start oVirt ImageIO Daemon.
-- Subject: Unit ovirt-imageio-daemon.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit ovirt-imageio-daemon.service has failed.
-- 
-- The result is failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: ovirt-imageio-daemon.service failed.
Dec 07 02:21:17 dell-per510-01.lab.eng.pek2.redhat.com systemd[1]: Stopping MOM instance configured for VDSM purposes...
-- Subject: Unit mom-vdsm.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit mom-vdsm.service has begun shutting down.
Dec 07 02:21:19 dell-per510-01.lab.eng.pek2.redhat.com momd[6529]: 2016-12-07 02:21:19,506 - mom.HostMonitor - INFO - Host Monitor ending
Dec 07 02:21:20 dell-per510-01.lab.eng.pek2.redhat.com momd[6529]: 2016-12-07 02:21:20,472 - mom.GuestManager - INFO - Guest Manager ending



Thanks,
Yihui

Comment 17 cshao 2016-12-07 08:04:22 UTC
Thanks for yihui's detail testing steps.
Cancel the needinfo now.

Comment 18 Fabian Deutsch 2016-12-07 08:28:02 UTC
Thanks. To me this sounds as if the cleanup steps are removing to much.

Moving this to HE as it looks like this should happen on RHEL-H as well.

Comment 19 Simone Tiraboschi 2016-12-07 09:08:30 UTC
Yihui, can you please check yum logs for:
 warning: user vdsm does not exist - using root
 warning: group kvm does not exist - using root

Comment 20 Yihui Zhao 2016-12-07 10:37:01 UTC
(In reply to Simone Tiraboschi from comment #19)
> Yihui, can you please check yum logs for:
>  warning: user vdsm does not exist - using root
>  warning: group kvm does not exist - using root

Hi,Simone
    I checked yum logs,but there is no content in it.

Thanks,
Yihui

Comment 21 Yaniv Lavi 2017-02-06 10:42:09 UTC
This should be fixed by BZ #1001181.
Please test it and report any bugs if needed.

*** This bug has been marked as a duplicate of bug 1001181 ***

Comment 22 Red Hat Bugzilla 2023-09-14 03:33:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.