Description of problem: Beaker job aborts after external watchdog expires. The following pattern is present in the console log as the last entry: [ 1248.986557] traps: restraintd[2872] trap int3 ip:5fcdf1 sp:7ffcd7f46100 error:0 in restraintd[400000+6da000] https://beaker.engineering.redhat.com/recipes/8130224#task108910255 https://beaker.engineering.redhat.com/recipes/8199715#task109719702 https://beaker.engineering.redhat.com/recipes/8106653#task108667650 The common pattern of all these jobs seems to be that the recipes are stuck during /distribution/virt-install execution. Version-Release number of selected component (if applicable): restraint.x86_64 0:0.2.0-1.el7bkr How reproducible: Not sure, probably schedule a job similar to the linked ones above. Steps to Reproduce: 1. 2. 3. Actual results: Beaker job aborted by external watchdog Expected results: Job completes successfully. Additional info:
characteristics of this bug very similar to BZ1832918. Trap message is the same and they are performing a reboot. However, this case is not readily reproducible. In fact, was told it is rarely reproducible.
TL;DR, the issue, restraintd crashing, should be fixed in next release, 0.2.1, with, https://github.com/beaker-project/restraint/pull/46/ https://github.com/beaker-project/restraint/pull/54/ But we still don't know if there is something else affecting the task. Therefore we are keeping this open until we can confirm that the issue is not present in the next release. As with other issues, workaround here can be to use restraint-0.1.45 or latest one in https://beaker-project.org/nightlies/harness/ -- As mentioned elsewhere, the message, [ 1248.986557] traps: restraintd[2872] trap int3 ip:5fcdf1 sp:7ffcd7f46100 error:0 in restraintd[400000+6da000] Is produced by a call to "g_error ()" in restraint. Per GLib's docs, https://developer.gnome.org/glib/stable/glib-Message-Logging.html#g-error "Error messages are always fatal, resulting in a call to G_BREAKPOINT() to terminate the application." Calls to "g_error ()" were introduced in 0.2.0 with the changes to remove the libssh library and they are breaking error handling. We reverted all of them in, https://github.com/beaker-project/restraint/pull/46 https://github.com/beaker-project/restraint/pull/47 https://github.com/beaker-project/restraint/pull/54 From these, there are two that will kill restraint if reached. The on in server_io_callback, https://github.com/beaker-project/restraint/blob/f0fb969cf0052b44e0ebaa0b1473ead1fc76d06c/src/server.c#L165 and the one in task_io_callback, https://github.com/beaker-project/restraint/blob/f0fb969cf0052b44e0ebaa0b1473ead1fc76d06c/src/task.c#L222 Checking the failed recipes, we can see that restraintd is terminated while tasks are running, so one of these two calls is being reached. The only similarity with BZ1832918 is that there is a call to "g_error ()", but in a completely different part of the code. This was confirmed in system logs of https://beaker.engineering.redhat.com/recipes/8238510#task110187225, restraintd[2952]: use_pty:FALSE /usr/share/restraint/plugins/run_task_plugins /usr/share/restraint/plugins/run_plugins restraintd[2952]: Invalid file descriptor. restraintd[2952]: IO error: Bad file descriptor kernel: traps: restraintd[2952] trap int3 ip:5fcdf1 sp:7ffe7f837eb0 error:0 in restraintd[400000+6da000] systemd[1]: restraintd.service: main process exited, code=killed, status=5/TRAP systemd[1]: Unit restraintd.service entered failed state. systemd[1]: restraintd.service failed. We are hitting the "G_IO_STATUS_ERROR" case here, and it must be because the stdout of the process running the is closed. With these "g_error ()" calls removed, restraintd is not going to be terminated, and the error that we are hitting here should be handled. Either it's acceptable and the task succeeded, or the task failed and this should logged as usual. If the task failed, either there is something wrong in the task/system, or there is some other bug in restraint that we need to fix. As for now, I'm still not sure about the root cause here, so running these tests with Restraint 0.1.45 may help if the problem is in the task/system. This can be done setting "restraint-rhts-0.1.45" for harness in ksmeta. Using restraint from https://beaker-project.org/nightlies/harness/ may help too.