Bug 1121558
Summary: | machine sometimes stops communicating with beaker | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Vladimir Benes <vbenes> |
Component: | general | Assignee: | beaker-dev-list |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | tools-bugs <tools-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 0.17 | CC: | dcallagh, jjelen, jpazdziora, pbunyan, rjoost |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-09-14 07:28:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vladimir Benes
2014-07-21 08:33:50 UTC
I have a workaround now: print "upping eth0" call("nmcli connection up id eth0", shell=True) call('sudo kill $(ps aux|grep -v grep| grep /usr/bin/beah-beaker-backend |awk \'{print $2}\')', shell=True) Popen('beah-beaker-backend -H $(hostname) &', shell=True) (In reply to Vladimir Benes from comment #0) > example and reproducer: > https://beaker.engineering.redhat.com/jobs/698631 In the recipe I see that Beah is using IPv6 to talk to the LC. Are you sure that your tests restore network connectivity to a fully working state, including IPv6? Particularly if you are saying that the network seems stuck until you SSH in, and then it starts working again, suggests that something in the kernel's IPv6 neighbour table might be in a bad state. We have certainly seen it before (kernel bug 1065257). You can try adding "beah_no_ipv6" to ksmeta (force beah to use IPv4 only) to see if the problem is specific to IPv6. I've tried beah_no_ipv6 but that wasn't helpful at all. Behaviour seems to be the same. Restarting beah-beaker-backend just after upping eth0 back again seems to lead to some ugly report repetitions so now I restart beah-beaker-backend with 20 secs timeout two or three times during those 220+ tests instead. I do this in times where network is stable and not down or upped again. With this timeout all works at least the way I have all tests reported and test finishes. I think there should be some check in beah-beaker-backend binary or in service itself that communication works as expected and beah-beaker-backend restarted if something goes wrong. (In reply to Vladimir Benes from comment #4) > I think there should be some check in beah-beaker-backend binary or in > service itself that communication works as expected and beah-beaker-backend > restarted if something goes wrong. There already is, it retries XML-RPC calls in a loop with exponential backoff if anything goes wrong with the request. But maybe there is some problem with this retry logic. I'm not sure if it will be helpful, but my observations are that if the job finishes fast (yesterday, about 3 hours) it is fine, but if it takes longer (4 or 5 hours) this error occurs. Test is running the same tasks and the duration depends on machine load and selection of machine. I'm not even switching interfaces up and down, just messing with firewalld and therefore with iptables. But as Vladimir advised, "pkill beah-beaker-backend" worked for me, but it is not ideal. I have similar/related issue. I'm doing service beah-beaker-backend restart in multiple places for the backend to re-read resolv.conf changes -- testing DNS on localhost or in container. With the latest beah (0.17), that seems to start the runtest.sh from the beginning, no something we'd want. Having either backend which would handle the underlying changes, or having a supported way to tell backend to respawn itself or re-read the configuration and network setup would be nice. (In reply to Jan Pazdziora from comment #7) > I have similar/related issue. I'm doing service beah-beaker-backend restart > in multiple places for the backend to re-read resolv.conf changes -- testing > DNS on localhost or in container. With the latest beah (0.17), that seems to > start the runtest.sh from the beginning, no something we'd want. Off the top of my head: when you restart beah-beaker-backend, it fetches the recipe XML, sees that the tasks are not done yet, so it proceeds execution from the first task. > > Having either backend which would handle the underlying changes, or having a > supported way to tell backend to respawn itself or re-read the configuration > and network setup would be nice. Anything to do with process restart will introduce the same behavior as above. The task during which this happens will be restarted. What perhaps can be done is implement a "beahsh" command which can be used in a test to re-establish the connection to the lab controller without restarting the process itself. hmm, I think we can close this as restraint harness fixed all that. It definitely was a bug, but now with restraint, it's not visible anymore, so very likely connected to beah and python. |