Bug 1259303 - [RHEV-H 7.1] supervdsm service cannot be started
Summary: [RHEV-H 7.1] supervdsm service cannot be started
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.4
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.5
Assignee: Yeela Kaplan
QA Contact: Aharon Canan
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-09-02 11:47 UTC by Elad
Modified: 2016-02-10 19:07 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-09-10 08:07:34 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/ and 'strace systemctl start supervdsmd' output (617.95 KB, application/x-gzip)
2015-09-02 11:47 UTC, Elad
no flags Details

Description Elad 2015-09-02 11:47:18 UTC
Created attachment 1069382 [details]
/var/log/ and 'strace systemctl start supervdsmd' output

Description of problem:
On RHEV-H 7.1, tried to deploy hosted-engine. The deployment hangs forever in this phase:

[ INFO  ] Hardware supports virtualization
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup

Then, aborted the deployment.
Tried to start supervdsm manually and it got stuck.

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization Hypervisor release 7.1 (20150827.1.el7ev)
ovirt-node-plugin-rhn-3.2.3-20.el7.noarch
vdsm-python-4.16.26-1.el7ev.noarch
ovirt-node-branding-rhev-3.2.3-20.el7.noarch
vdsm-jsonrpc-4.16.26-1.el7ev.noarch
vdsm-reg-4.16.26-1.el7ev.noarch
ovirt-node-plugin-snmp-3.2.3-20.el7.noarch
vdsm-python-zombiereaper-4.16.26-1.el7ev.noarch
vdsm-xmlrpc-4.16.26-1.el7ev.noarch
ovirt-host-deploy-1.3.2-1.el7ev.noarch
vdsm-yajsonrpc-4.16.26-1.el7ev.noarch
vdsm-hook-vhostmd-4.16.26-1.el7ev.noarch
ovirt-node-selinux-3.2.3-20.el7.noarch
vdsm-4.16.26-1.el7ev.x86_64
ovirt-host-deploy-offline-1.3.0-3.el7ev.x86_64
ovirt-hosted-engine-ha-1.2.6-3.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch
ovirt-node-plugin-cim-3.2.3-20.el7.noarch
ovirt-node-3.2.3-20.el7.noarch
vdsm-cli-4.16.26-1.el7ev.noarch
vdsm-hook-ethtool-options-4.16.26-1.el7ev.noarch
ovirt-hosted-engine-setup-1.2.5.3-1.el7ev.noarch
ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch
libgovirt-0.3.1-3.el7.x86_64


How reproducible:
Not 100%

Steps to Reproduce:
# systemctl start supervdsmd

Actual results:
The service cannot be started.

Snip from 'strace systemctl start supervdsmd':

poll([{fd=4, events=POLLIN}], 1, 25000) = 1 ([{fd=4, revents=POLLIN}])
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1'\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1o\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 277
recvmsg(4, 0x7ffdefff05f0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(4, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\27\0\0\0\2\0\0\0\227\0\0\0\1\1o\0\31\0\0\0/org/fre"..., 168}, {"\22\0\0\0supervdsmd.service\0", 23}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 191
poll([{fd=4, events=POLLIN}], 1, 25000) = 1 ([{fd=4, revents=POLLIN}])
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\0018\0\0\0\2\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1o\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 88
recvmsg(4, 0x7ffdefff05f0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
sendmsg(4, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\0019\0\0\0\3\0\0\0\250\0\0\0\1\1o\0003\0\0\0/org/fre"..., 184}, {"\35\0\0\0org.freedesktop.systemd1.Uni"..., 57}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 241
poll([{fd=4, events=POLLIN}], 1, 25000) = 1 ([{fd=4, revents=POLLIN}])
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\10\0\0\0\3\0\0\0\17\0\0\0\5\1u\0\3\0\0\0\10\1g\0\1v\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 40
recvmsg(4, 0x7ffdefff05f0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN}], 1, 4294967295



Additional info:
/var/log/ and 'strace systemctl start supervdsmd' output

Comment 1 Yaniv Lavi 2015-09-02 11:56:48 UTC
To my understanding this is a CLI issue and is not supported in RHEV-H. Does this happen in TUI usage?

Comment 2 Oved Ourfali 2015-09-02 11:57:56 UTC
Elad - did you see it more than once?
On the same host or not?
What are the hardware details of the host? (RAM?)
How long was it in the setup step before you stopped it?
Why did you even start supervdsm? why not to restart vdsm?

CC-ing also Dan.

Comment 3 Elad 2015-09-02 12:24:51 UTC
Yaniv, it was executed using TUI. (the opens the CLI with a screen session when starting the deployment)

Oved - answered offline

Comment 4 Yeela Kaplan 2015-09-02 15:10:37 UTC
Looks like this is a bug that is not reproducible on any other environment.
Only happened on this one host - one time.

Maybe someone will run into it in the future.

Elad, please in the meantime don't reboot this host so we can try to investigate it further.

When running:
#systemctl start supervdsmd.service

we see that a child process `systemd-tty-ask-password-agent --watch`
is spawned by systemctl:
 
[root@green-vdsc ~]# pstree -p $$
bash(13179)─┬─pstree(13494)
            └─systemctl(13440)───systemd-tty-ask(13441)

On a host that installed successfully with vdsm running we see that systemctl will close this child process (using strace -f):

[pid 27711] execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-passwor"..., "--watch"], [/* 31 vars */]) = 0
.
.
.
[pid 27710] recvmsg(4,  <unfinished ...>
[pid 27711] <... read resumed> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000/\0\0\0\0\0\0"..., 832) = 832
[pid 27710] <... recvmsg resumed> 0x7ffec3e09760, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27711] fstat(3, {st_mode=S_IFREG|0755, st_size=153184, ...}) = 0
[pid 27711] mmap(NULL, 2245240, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0 <unfinished ...>
[pid 27710] close(4 <unfinished ...>
[pid 27711] <... mmap resumed> )        = 0x7f32ce913000
[pid 27710] <... close resumed> )       = 0
[pid 27711] mprotect(0x7f32ce937000, 2093056, PROT_NONE) = 0
[pid 27710] kill(27711, SIGTERM <unfinished ...>
[pid 27711] mmap(0x7f32ceb36000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x23000 <unfinished ...>
[pid 27710] <... kill resumed> )        = 0
[pid 27711] <... mmap resumed> )        = 0x7f32ceb36000
[pid 27710] kill(27711, SIGCONT <unfinished ...>
[pid 27711] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=27710, si_uid=0} ---
[pid 27710] <... kill resumed> )        = 0
[pid 27710] waitid(P_PID, 27711,  <unfinished ...>
[pid 27711] +++ killed by SIGTERM +++
<... waitid resumed> {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=27711, si_status=SIGTERM, si_utime=0, si_stime=0}, WEXITED, NULL) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=27711, si_status=SIGTERM, si_utime=0, si_stime=0} ---
exit_group(0)                           = ?

-> We can see that systemctl(27710) received response from its child systemd-tty-ask-password-agent(27711) and killed it.


On a host where supervdsm cannot start:

[pid 14884] execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-passwor"..., "--watch"], [/* 31 vars */]) = 0
.
.
.
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "\242\233\270\f9\257\326\257_\n\30\264\263\17\32\272\0058\331\245\236f\207\202\321(\322`pS\324\376"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "G\262_$\210%*\275Q`G\240\361F\323\306\254.\334\323]\314:!u\343H\200\227\1\206&"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "\272\177J\330D\252\367\2\367w\302\230-\221'\267\365\376\214\217\334\327\245a\311\377\tfc\177\273j"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "\324\230x\231\222\210\271-\22a\2134\202\274\\\373\305\231;\177\244e\246=k\204\216\327\340iG\17"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "3\210\234X\0N\202G\354\22WFt^\331V\344\32\\A\36\323[\302\370\363\371\210\211\t\2129"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "+\235\26\274\373R\34\377Rs9\370\273\370\t),V\276\v`\233^E\256\257lX#\27\23W"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "\346\237\ru5\311\372\362K?2\203\300\246C2K(\4+\20\341t\4S\370\35\25}>\265\240"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "F\25\257j\233q\315W\25\334\306\233\217!\306$\255Y\33\364\0039Qy\2\223@o\235\257hw"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "\324\316\10k*\332A\253\344^\5y\372\10I\375\216\372\271\277\205\2264\0\35>?]`\263\374\275"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "\34P\10w\314e\270\226X\t\212\30\"\17.\334\3471\342B\334\272\315\351\367\324k\273S\363\323f"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "\210\230;\227\31\261\364A\301#\374\311\206?\17\202\2749\340\246\323gm\315\325\361\335K\200\6h\275"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])
read(3, "i\10Dl\5\312\222\376\10l\37\277X\341CIs\\\"\235{\315|\334\227\10>\246a\5,\36"..., 120) = 120
poll([{fd=3, events=POLLIN}], 1, 100)   = 1 ([{fd=3, revents=POLLIN}])


systemctl is waiting for a response from its child process forever and does not close it. 
As a result, it never even tries to run the start command for supervdsm.

Comment 5 Yaniv Lavi 2015-09-02 15:13:19 UTC
Seems like a edge case.
Moving for review for 3.5.5.

Comment 6 Yaniv Bronhaim 2015-09-08 09:17:23 UTC
Due to comment #4 and no reproduction till now I assume it is related to environment issue. Therefore I close the bug for now. please reopen asap if the issue is raised again

Elad, please try to rerun this case and update with your results

Comment 7 Gil Klein 2015-09-08 16:00:37 UTC
(In reply to Yaniv Bronhaim from comment #6)
> Due to comment #4 and no reproduction till now I assume it is related to
> environment issue. Therefore I close the bug for now. please reopen asap if
> the issue is raised again
> 
> Elad, please try to rerun this case and update with your results
This can be easily reproduced on the environment that was left in this state to allow the investigation.

I'm reopening, cause I can still see it happening on the same environment.

Comment 9 Yeela Kaplan 2015-09-08 16:07:13 UTC
Please don't reinstall or reboot this host.
I am talking to guys from systemd to try and figure this out. 
and it will be useless without this host. thanks.


Note You need to log in before you can comment on or make changes to this bug.