1267187 – engine-setup hangs indefinitely starting ovirt-websocket-proxy via service using python subprocess module

Bug 1267187 - engine-setup hangs indefinitely starting ovirt-websocket-proxy via service using python subprocess module

Summary: engine-setup hangs indefinitely starting ovirt-websocket-proxy via service us...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Services
Sub Component:
Version:	3.5.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-3.5.6
Target Release:	3.5.6
Assignee:	Sandro Bonazzola
QA Contact:	Karolína Hajná
Docs Contact:
URL:
Whiteboard:	integration
Depends On:	1266881
Blocks:
TreeView+	depends on / blocked

Reported:	2015-09-29 09:23 UTC by Sandro Bonazzola
Modified:	2016-05-20 01:24 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:	1266881
Environment:
Last Closed:	2015-12-22 13:24:33 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.5.z+ rule-engine: blocker+ bmcclain: planning_ack+ sbonazzo: devel_ack+ pnovotny: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	46769	None	None	None	Never
oVirt gerrit	46773	None	None	None	Never
oVirt gerrit	46775	None	None	None	Never
oVirt gerrit	46781	ovirt-engine-3.5	MERGED	packaging: pythonlib: service: by default redirect to /dev/null stdout/stderr	Never

Description Sandro Bonazzola 2015-09-29 09:23:35 UTC

+++ This bug was initially created as a clone of Bug #1266881 +++

While installing ovirt-engine, the setup is stuck on "service ovirt-websocket-proxy start".
Looks like the service is not working as a proper daemon anymore.

Workaround: manually stop the service during the setup and re-start it when setup fnishes

Workaround: choose to not setup websocket proxy on the system while configuring ovirt-engine

--- Additional comment from Simone Tiraboschi on 2015-09-28 11:11:59 EDT ---

Adding details:

ovirt-websocket-proxy correctly starts if we manually invoke service

 [root@c66et1 ~]# /sbin/service ovirt-websocket-proxy start
 Starting oVirt Engine websockets proxy:                    [  OK  ]
 [root@c66et1 ~]# echo $?
 0

bu the issue happens if we start the service thought Otopi.
The python daemon goes on, service command exits but the python code doesn't notify it, service process is marked as defunct while the setup still monitors it:

29983  2913 29983  1416 pts/0    29983 Z+       0   0:00 [service] <defunct>

This few python lines are enough to reproduce it (manually stopping the service before that):
 import subprocess
 p = subprocess.Popen(('/sbin/service', 'ovirt-websocket-proxy', 'start'), stdin=None, stderr=subprocess.PIPE, stdout=subprocess.PIPE, close_fds=True,)
 output = p.communicate()
 print 'output: %s' % str(output)

service concludes, but this python script will wait forever on the communicate call.


If we run it with strace we see that:
 poll([{fd=3, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
 read(3, "Starting oVirt Engine websockets"..., 4096) = 40
 poll([{fd=3, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 2, -1) = 1 ([{fd=3, revents=POLLIN}])
 read(3, "\33[60G[\33[0;32m  OK  \33[0;39m]\r\n", 4096) = 29
 poll([{fd=3, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 2, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10928, si_status=0, si_utime=0, si_stime=0} ---
 restart_syscall(<... resuming interrupted call ...>

The python script correctly gets its SIGCHLD when the service process exits but no code got executed and it continue to wait. And it will wait indefinitely cause at that point service is already died.

It could be related to this one:
https://bugzilla.redhat.com/1065537

--- Additional comment from Simone Tiraboschi on 2015-09-28 11:16:27 EDT ---

Seen with python 2.6.6-64.el6

--- Additional comment from Alon Bar-Lev on 2015-09-28 15:05:27 EDT ---

not sure I understand, if it is a bug in python and a regression, what version of python last work, what is the first that does not?

--- Additional comment from Alon Bar-Lev on 2015-09-28 15:21:52 EDT ---

Checkout the service check I submitted, it closes stdout/stderr of caller, should resolve this issue, I have no el6 environment to test.

--- Additional comment from Simone Tiraboschi on 2015-09-28 16:18:33 EDT ---

(In reply to Alon Bar-Lev from comment #4)
> Checkout the service check I submitted, it closes stdout/stderr of caller,
> should resolve this issue, I have no el6 environment to test.

It works on el6, thanks.

--- Additional comment from Michal Skrivanek on 2015-09-29 03:45:12 EDT ---

since this is a regression caused by el6 python we need to backport it to 3.5.z as well

Comment 1 Red Hat Bugzilla Rules Engine 2015-09-29 09:33:39 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 2 Karolína Hajná 2015-11-04 11:24:47 UTC

Verified on vt18 (rhevm-3.5.6.1-0.1.el6ev.noarch)

Comment 3 Sandro Bonazzola 2015-12-22 13:24:33 UTC

oVirt 3.5.6 has been released and the bz verified, moving to closed current release.

Note You need to log in before you can comment on or make changes to this bug.