Bug 1831824 - restraintd crashes when running /distribution/virt-install task
Summary: restraintd crashes when running /distribution/virt-install task
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Restraint
Classification: Retired
Component: general
Version: 0.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: 0.2.1
Assignee: Daniel Rodríguez
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-05 17:24 UTC by Jan Tluka
Modified: 2020-10-27 12:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 12:43:48 UTC
Embargoed:


Attachments (Terms of Use)

Description Jan Tluka 2020-05-05 17:24:34 UTC
Description of problem:

Beaker job aborts after external watchdog expires. The following pattern is present in the console log as the last entry:

[ 1248.986557] traps: restraintd[2872] trap int3 ip:5fcdf1 sp:7ffcd7f46100 error:0 in restraintd[400000+6da000] 

https://beaker.engineering.redhat.com/recipes/8130224#task108910255             
https://beaker.engineering.redhat.com/recipes/8199715#task109719702             
https://beaker.engineering.redhat.com/recipes/8106653#task108667650

The common pattern of all these jobs seems to be that the recipes are stuck during /distribution/virt-install execution.

Version-Release number of selected component (if applicable):
restraint.x86_64 0:0.2.0-1.el7bkr

How reproducible:

Not sure, probably schedule a job similar to the linked ones above.

Steps to Reproduce:
1.
2.
3.

Actual results:
Beaker job aborted by external watchdog

Expected results:
Job completes successfully.

Additional info:

Comment 1 Carol Bouchard 2020-05-12 10:02:44 UTC
characteristics of this bug very similar to BZ1832918.  Trap message is the same and they are performing a reboot. However, this case is not readily reproducible.  In fact, was told it is rarely reproducible.

Comment 2 Daniel Rodríguez 2020-05-14 13:39:52 UTC
TL;DR, the issue, restraintd crashing, should be fixed in next release, 0.2.1, with,

https://github.com/beaker-project/restraint/pull/46/
https://github.com/beaker-project/restraint/pull/54/

But we still don't know if there is something else affecting the task. Therefore we are keeping this open until we can confirm that the issue is not present in the next release. As with other issues, workaround here can be to use restraint-0.1.45 or latest one in https://beaker-project.org/nightlies/harness/

--

As mentioned elsewhere, the message,

 [ 1248.986557] traps: restraintd[2872] trap int3 ip:5fcdf1 sp:7ffcd7f46100 error:0 in restraintd[400000+6da000]

Is produced by a call to "g_error ()" in restraint. Per GLib's docs, https://developer.gnome.org/glib/stable/glib-Message-Logging.html#g-error

"Error messages are always fatal, resulting in a call to G_BREAKPOINT() to terminate the application."

Calls to "g_error ()" were introduced in 0.2.0 with the changes to remove the libssh library and they are breaking error handling.

We reverted all of them in,

https://github.com/beaker-project/restraint/pull/46
https://github.com/beaker-project/restraint/pull/47
https://github.com/beaker-project/restraint/pull/54

From these, there are two that will kill restraint if reached. The on in server_io_callback,

https://github.com/beaker-project/restraint/blob/f0fb969cf0052b44e0ebaa0b1473ead1fc76d06c/src/server.c#L165

and the one in task_io_callback,

https://github.com/beaker-project/restraint/blob/f0fb969cf0052b44e0ebaa0b1473ead1fc76d06c/src/task.c#L222

Checking the failed recipes, we can see that restraintd is terminated while tasks are running, so one of these two calls is being reached.

The only similarity with BZ1832918 is that there is a call to "g_error ()", but in a completely different part of the code.

This was confirmed in system logs of https://beaker.engineering.redhat.com/recipes/8238510#task110187225,

restraintd[2952]: use_pty:FALSE /usr/share/restraint/plugins/run_task_plugins /usr/share/restraint/plugins/run_plugins
restraintd[2952]: Invalid file descriptor.
restraintd[2952]: IO error: Bad file descriptor
          kernel: traps: restraintd[2952] trap int3 ip:5fcdf1 sp:7ffe7f837eb0 error:0 in restraintd[400000+6da000]
      systemd[1]: restraintd.service: main process exited, code=killed, status=5/TRAP
      systemd[1]: Unit restraintd.service entered failed state.
      systemd[1]: restraintd.service failed.

We are hitting the "G_IO_STATUS_ERROR" case here, and it must be because the stdout of the process running the is closed. 

With these "g_error ()" calls removed, restraintd is not going to be terminated, and the error that we are hitting here should be handled. Either it's acceptable and the task succeeded, or the task failed and this should logged as usual.

If the task failed, either there is something wrong in the task/system, or there is some other bug in restraint that we need to fix.

As for now, I'm still not sure about the root cause here, so running these tests with Restraint 0.1.45 may help if the problem is in the task/system. This can be done setting "restraint-rhts-0.1.45" for harness in ksmeta. Using restraint from https://beaker-project.org/nightlies/harness/ may help too.


Note You need to log in before you can comment on or make changes to this bug.