Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1831824

Summary:	restraintd crashes when running /distribution/virt-install task
Product:	[Retired] Restraint	Reporter:	Jan Tluka <jtluka>
Component:	general	Assignee:	Daniel Rodríguez <danrodri>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	0.2.0	CC:	asavkov, bpeck, breilly, cbeer, cbouchar, olichtne
Target Milestone:	0.2.1
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 12:43:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Tluka 2020-05-05 17:24:34 UTC

Description of problem:

Beaker job aborts after external watchdog expires. The following pattern is present in the console log as the last entry:

[ 1248.986557] traps: restraintd[2872] trap int3 ip:5fcdf1 sp:7ffcd7f46100 error:0 in restraintd[400000+6da000] 

https://beaker.engineering.redhat.com/recipes/8130224#task108910255             
https://beaker.engineering.redhat.com/recipes/8199715#task109719702             
https://beaker.engineering.redhat.com/recipes/8106653#task108667650

The common pattern of all these jobs seems to be that the recipes are stuck during /distribution/virt-install execution.

Version-Release number of selected component (if applicable):
restraint.x86_64 0:0.2.0-1.el7bkr

How reproducible:

Not sure, probably schedule a job similar to the linked ones above.

Steps to Reproduce:
1.
2.
3.

Actual results:
Beaker job aborted by external watchdog

Expected results:
Job completes successfully.

Additional info:

Comment 1 Carol Bouchard 2020-05-12 10:02:44 UTC

characteristics of this bug very similar to BZ1832918.  Trap message is the same and they are performing a reboot. However, this case is not readily reproducible.  In fact, was told it is rarely reproducible.

Comment 2 Daniel Rodríguez 2020-05-14 13:39:52 UTC

TL;DR, the issue, restraintd crashing, should be fixed in next release, 0.2.1, with,

https://github.com/beaker-project/restraint/pull/46/
https://github.com/beaker-project/restraint/pull/54/

But we still don't know if there is something else affecting the task. Therefore we are keeping this open until we can confirm that the issue is not present in the next release. As with other issues, workaround here can be to use restraint-0.1.45 or latest one in https://beaker-project.org/nightlies/harness/

As mentioned elsewhere, the message,

[ 1248.986557] traps: restraintd[2872] trap int3 ip:5fcdf1 sp:7ffcd7f46100 error:0 in restraintd[400000+6da000]

Is produced by a call to "g_error ()" in restraint. Per GLib's docs, https://developer.gnome.org/glib/stable/glib-Message-Logging.html#g-error

"Error messages are always fatal, resulting in a call to G_BREAKPOINT() to terminate the application."

Calls to "g_error ()" were introduced in 0.2.0 with the changes to remove the libssh library and they are breaking error handling.

We reverted all of them in,

https://github.com/beaker-project/restraint/pull/46
https://github.com/beaker-project/restraint/pull/47
https://github.com/beaker-project/restraint/pull/54

From these, there are two that will kill restraint if reached. The on in server_io_callback,

https://github.com/beaker-project/restraint/blob/f0fb969cf0052b44e0ebaa0b1473ead1fc76d06c/src/server.c#L165

and the one in task_io_callback,

https://github.com/beaker-project/restraint/blob/f0fb969cf0052b44e0ebaa0b1473ead1fc76d06c/src/task.c#L222

Checking the failed recipes, we can see that restraintd is terminated while tasks are running, so one of these two calls is being reached.

The only similarity with BZ1832918 is that there is a call to "g_error ()", but in a completely different part of the code.

This was confirmed in system logs of https://beaker.engineering.redhat.com/recipes/8238510#task110187225,

restraintd[2952]: use_pty:FALSE /usr/share/restraint/plugins/run_task_plugins /usr/share/restraint/plugins/run_plugins
restraintd[2952]: Invalid file descriptor.
restraintd[2952]: IO error: Bad file descriptor
kernel: traps: restraintd[2952] trap int3 ip:5fcdf1 sp:7ffe7f837eb0 error:0 in restraintd[400000+6da000]
systemd[1]: restraintd.service: main process exited, code=killed, status=5/TRAP
systemd[1]: Unit restraintd.service entered failed state.
systemd[1]: restraintd.service failed.

We are hitting the "G_IO_STATUS_ERROR" case here, and it must be because the stdout of the process running the is closed.

With these "g_error ()" calls removed, restraintd is not going to be terminated, and the error that we are hitting here should be handled. Either it's acceptable and the task succeeded, or the task failed and this should logged as usual.

If the task failed, either there is something wrong in the task/system, or there is some other bug in restraint that we need to fix.

As for now, I'm still not sure about the root cause here, so running these tests with Restraint 0.1.45 may help if the problem is in the task/system. This can be done setting "restraint-rhts-0.1.45" for harness in ksmeta. Using restraint from https://beaker-project.org/nightlies/harness/ may help too.