Created attachment 1069382 [details] /var/log/ and 'strace systemctl start supervdsmd' output Description of problem: On RHEV-H 7.1, tried to deploy hosted-engine. The deployment hangs forever in this phase: [ INFO ] Hardware supports virtualization [ INFO ] Stage: Environment packages setup [ INFO ] Stage: Programs detection [ INFO ] Stage: Environment setup Then, aborted the deployment. Tried to start supervdsm manually and it got stuck. Version-Release number of selected component (if applicable): Red Hat Enterprise Virtualization Hypervisor release 7.1 (20150827.1.el7ev) ovirt-node-plugin-rhn-3.2.3-20.el7.noarch vdsm-python-4.16.26-1.el7ev.noarch ovirt-node-branding-rhev-3.2.3-20.el7.noarch vdsm-jsonrpc-4.16.26-1.el7ev.noarch vdsm-reg-4.16.26-1.el7ev.noarch ovirt-node-plugin-snmp-3.2.3-20.el7.noarch vdsm-python-zombiereaper-4.16.26-1.el7ev.noarch vdsm-xmlrpc-4.16.26-1.el7ev.noarch ovirt-host-deploy-1.3.2-1.el7ev.noarch vdsm-yajsonrpc-4.16.26-1.el7ev.noarch vdsm-hook-vhostmd-4.16.26-1.el7ev.noarch ovirt-node-selinux-3.2.3-20.el7.noarch vdsm-4.16.26-1.el7ev.x86_64 ovirt-host-deploy-offline-1.3.0-3.el7ev.x86_64 ovirt-hosted-engine-ha-1.2.6-3.el7ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch ovirt-node-plugin-cim-3.2.3-20.el7.noarch ovirt-node-3.2.3-20.el7.noarch vdsm-cli-4.16.26-1.el7ev.noarch vdsm-hook-ethtool-options-4.16.26-1.el7ev.noarch ovirt-hosted-engine-setup-1.2.5.3-1.el7ev.noarch ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch libgovirt-0.3.1-3.el7.x86_64 How reproducible: Not 100% Steps to Reproduce: # systemctl start supervdsmd Actual results: The service cannot be started. Snip from 'strace systemctl start supervdsmd': poll([{fd=4, events=POLLIN}], 1, 25000) = 1 ([{fd=4, revents=POLLIN}]) recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1'\0\0\0\1\0\0\0\17\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1o\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 277 recvmsg(4, 0x7ffdefff05f0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) sendmsg(4, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1\27\0\0\0\2\0\0\0\227\0\0\0\1\1o\0\31\0\0\0/org/fre"..., 168}, {"\22\0\0\0supervdsmd.service\0", 23}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 191 poll([{fd=4, events=POLLIN}], 1, 25000) = 1 ([{fd=4, revents=POLLIN}]) recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\0018\0\0\0\2\0\0\0\17\0\0\0\5\1u\0\2\0\0\0\10\1g\0\1o\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 88 recvmsg(4, 0x7ffdefff05f0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) sendmsg(4, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\0019\0\0\0\3\0\0\0\250\0\0\0\1\1o\0003\0\0\0/org/fre"..., 184}, {"\35\0\0\0org.freedesktop.systemd1.Uni"..., 57}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 241 poll([{fd=4, events=POLLIN}], 1, 25000) = 1 ([{fd=4, revents=POLLIN}]) recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\10\0\0\0\3\0\0\0\17\0\0\0\5\1u\0\3\0\0\0\10\1g\0\1v\0\0"..., 2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 40 recvmsg(4, 0x7ffdefff05f0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=4, events=POLLIN}], 1, 4294967295 Additional info: /var/log/ and 'strace systemctl start supervdsmd' output
To my understanding this is a CLI issue and is not supported in RHEV-H. Does this happen in TUI usage?
Elad - did you see it more than once? On the same host or not? What are the hardware details of the host? (RAM?) How long was it in the setup step before you stopped it? Why did you even start supervdsm? why not to restart vdsm? CC-ing also Dan.
Yaniv, it was executed using TUI. (the opens the CLI with a screen session when starting the deployment) Oved - answered offline
Looks like this is a bug that is not reproducible on any other environment. Only happened on this one host - one time. Maybe someone will run into it in the future. Elad, please in the meantime don't reboot this host so we can try to investigate it further. When running: #systemctl start supervdsmd.service we see that a child process `systemd-tty-ask-password-agent --watch` is spawned by systemctl: [root@green-vdsc ~]# pstree -p $$ bash(13179)─┬─pstree(13494) └─systemctl(13440)───systemd-tty-ask(13441) On a host that installed successfully with vdsm running we see that systemctl will close this child process (using strace -f): [pid 27711] execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-passwor"..., "--watch"], [/* 31 vars */]) = 0 . . . [pid 27710] recvmsg(4, <unfinished ...> [pid 27711] <... read resumed> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000/\0\0\0\0\0\0"..., 832) = 832 [pid 27710] <... recvmsg resumed> 0x7ffec3e09760, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) [pid 27711] fstat(3, {st_mode=S_IFREG|0755, st_size=153184, ...}) = 0 [pid 27711] mmap(NULL, 2245240, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0 <unfinished ...> [pid 27710] close(4 <unfinished ...> [pid 27711] <... mmap resumed> ) = 0x7f32ce913000 [pid 27710] <... close resumed> ) = 0 [pid 27711] mprotect(0x7f32ce937000, 2093056, PROT_NONE) = 0 [pid 27710] kill(27711, SIGTERM <unfinished ...> [pid 27711] mmap(0x7f32ceb36000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x23000 <unfinished ...> [pid 27710] <... kill resumed> ) = 0 [pid 27711] <... mmap resumed> ) = 0x7f32ceb36000 [pid 27710] kill(27711, SIGCONT <unfinished ...> [pid 27711] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=27710, si_uid=0} --- [pid 27710] <... kill resumed> ) = 0 [pid 27710] waitid(P_PID, 27711, <unfinished ...> [pid 27711] +++ killed by SIGTERM +++ <... waitid resumed> {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=27711, si_status=SIGTERM, si_utime=0, si_stime=0}, WEXITED, NULL) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=27711, si_status=SIGTERM, si_utime=0, si_stime=0} --- exit_group(0) = ? -> We can see that systemctl(27710) received response from its child systemd-tty-ask-password-agent(27711) and killed it. On a host where supervdsm cannot start: [pid 14884] execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-passwor"..., "--watch"], [/* 31 vars */]) = 0 . . . poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\242\233\270\f9\257\326\257_\n\30\264\263\17\32\272\0058\331\245\236f\207\202\321(\322`pS\324\376"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "G\262_$\210%*\275Q`G\240\361F\323\306\254.\334\323]\314:!u\343H\200\227\1\206&"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\272\177J\330D\252\367\2\367w\302\230-\221'\267\365\376\214\217\334\327\245a\311\377\tfc\177\273j"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\324\230x\231\222\210\271-\22a\2134\202\274\\\373\305\231;\177\244e\246=k\204\216\327\340iG\17"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "3\210\234X\0N\202G\354\22WFt^\331V\344\32\\A\36\323[\302\370\363\371\210\211\t\2129"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "+\235\26\274\373R\34\377Rs9\370\273\370\t),V\276\v`\233^E\256\257lX#\27\23W"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\346\237\ru5\311\372\362K?2\203\300\246C2K(\4+\20\341t\4S\370\35\25}>\265\240"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "F\25\257j\233q\315W\25\334\306\233\217!\306$\255Y\33\364\0039Qy\2\223@o\235\257hw"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\324\316\10k*\332A\253\344^\5y\372\10I\375\216\372\271\277\205\2264\0\35>?]`\263\374\275"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\34P\10w\314e\270\226X\t\212\30\"\17.\334\3471\342B\334\272\315\351\367\324k\273S\363\323f"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "\210\230;\227\31\261\364A\301#\374\311\206?\17\202\2749\340\246\323gm\315\325\361\335K\200\6h\275"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) read(3, "i\10Dl\5\312\222\376\10l\37\277X\341CIs\\\"\235{\315|\334\227\10>\246a\5,\36"..., 120) = 120 poll([{fd=3, events=POLLIN}], 1, 100) = 1 ([{fd=3, revents=POLLIN}]) systemctl is waiting for a response from its child process forever and does not close it. As a result, it never even tries to run the start command for supervdsm.
Seems like a edge case. Moving for review for 3.5.5.
Due to comment #4 and no reproduction till now I assume it is related to environment issue. Therefore I close the bug for now. please reopen asap if the issue is raised again Elad, please try to rerun this case and update with your results
(In reply to Yaniv Bronhaim from comment #6) > Due to comment #4 and no reproduction till now I assume it is related to > environment issue. Therefore I close the bug for now. please reopen asap if > the issue is raised again > > Elad, please try to rerun this case and update with your results This can be easily reproduced on the environment that was left in this state to allow the investigation. I'm reopening, cause I can still see it happening on the same environment.
Please don't reinstall or reboot this host. I am talking to guys from systemd to try and figure this out. and it will be useless without this host. thanks.