Bug 1121558

Summary:	machine sometimes stops communicating with beaker
Product:	[Retired] Beaker	Reporter:	Vladimir Benes <vbenes>
Component:	general	Assignee:	beaker-dev-list
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	tools-bugs <tools-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	0.17	CC:	dcallagh, jjelen, jpazdziora, pbunyan, rjoost
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-14 07:28:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vladimir Benes 2014-07-21 08:33:50 UTC

Description of problem:
sometimes while testing NetworkManager machine stops communicating back to beaker server after eth0 connection down/up. This leads to External Watchdog Expired warn and incomplete report in beaker web UI. 

While this problems occurs I can ssh into the machine and watch the task going further. I was able to fix this in pre 0.17 era via beah-beaker-backend service restart but this doesn't work anymore as it restarts the task itself too (via dependency to beah-srv) and then there is a mixture of two tasks running simultaneously.

example and reproducer:
https://beaker.engineering.redhat.com/jobs/698631

Comment 1 Vladimir Benes 2014-07-21 12:09:38 UTC

I have a workaround now:
        print "upping eth0"
        call("nmcli connection up id eth0", shell=True)
        call('sudo kill $(ps aux|grep -v grep| grep /usr/bin/beah-beaker-backend |awk \'{print $2}\')', shell=True)
        Popen('beah-beaker-backend -H $(hostname) &', shell=True)

Comment 3 Dan Callaghan 2014-07-24 02:34:08 UTC

(In reply to Vladimir Benes from comment #0)
> example and reproducer:
> https://beaker.engineering.redhat.com/jobs/698631

In the recipe I see that Beah is using IPv6 to talk to the LC.

Are you sure that your tests restore network connectivity to a fully working state, including IPv6?

Particularly if you are saying that the network seems stuck until you SSH in, and then it starts working again, suggests that something in the kernel's IPv6 neighbour table might be in a bad state. We have certainly seen it before (kernel bug 1065257).

You can try adding "beah_no_ipv6" to ksmeta (force beah to use IPv4 only) to see if the problem is specific to IPv6.

Comment 4 Vladimir Benes 2014-08-01 15:06:19 UTC

I've tried beah_no_ipv6 but that wasn't helpful at all. Behaviour seems to be the same. 
Restarting beah-beaker-backend just after upping eth0 back again seems to lead to some ugly report repetitions so now I restart beah-beaker-backend with 20 secs timeout two or three times during those 220+ tests instead. I do this in times where network is stable and not down or upped again. With this timeout all works at least the way I have all tests reported and test finishes.

I think there should be some check in beah-beaker-backend binary or in service itself that communication works as expected and beah-beaker-backend restarted if something goes wrong.

Comment 5 Dan Callaghan 2014-08-04 01:11:26 UTC

(In reply to Vladimir Benes from comment #4)
> I think there should be some check in beah-beaker-backend binary or in
> service itself that communication works as expected and beah-beaker-backend
> restarted if something goes wrong.

There already is, it retries XML-RPC calls in a loop with exponential backoff if anything goes wrong with the request. But maybe there is some problem with this retry logic.

Comment 6 Jakub Jelen 2014-08-04 05:53:53 UTC

I'm not sure if it will be helpful, but my observations are that if the job finishes fast (yesterday, about 3 hours) it is fine, but if it takes longer (4 or 5 hours) this error occurs. Test is running the same tasks and the duration depends on machine load and selection of machine.
I'm not even switching interfaces up and down, just messing with firewalld and therefore with iptables. But as Vladimir advised, "pkill beah-beaker-backend" worked for me, but it is not ideal.

Comment 7 Jan Pazdziora 2014-09-10 09:04:21 UTC

I have similar/related issue. I'm doing service beah-beaker-backend restart in multiple places for the backend to re-read resolv.conf changes -- testing DNS on localhost or in container. With the latest beah (0.17), that seems to start the runtest.sh from the beginning, no something we'd want.

Having either backend which would handle the underlying changes, or having a supported way to tell backend to respawn itself or re-read the configuration and network setup would be nice.

Comment 8 Amit Saha 2014-09-11 00:08:47 UTC

(In reply to Jan Pazdziora from comment #7)
> I have similar/related issue. I'm doing service beah-beaker-backend restart
> in multiple places for the backend to re-read resolv.conf changes -- testing
> DNS on localhost or in container. With the latest beah (0.17), that seems to
> start the runtest.sh from the beginning, no something we'd want.

Off the top of my head: when you restart beah-beaker-backend, it fetches the recipe XML, sees that the tasks are not done yet, so it proceeds execution from the first task.

> 
> Having either backend which would handle the underlying changes, or having a
> supported way to tell backend to respawn itself or re-read the configuration
> and network setup would be nice.

Anything to do with process restart will introduce the same behavior as above. The task during which this happens will be restarted. What perhaps can be done is implement a "beahsh" command which can be used in a test to re-establish the connection to the lab controller without restarting the process itself.

Comment 9 Vladimir Benes 2018-09-14 07:28:29 UTC

hmm, I think we can close this as restraint harness fixed all that.

Comment 10 Vladimir Benes 2018-10-08 08:15:19 UTC

It definitely was a bug, but now with restraint, it's not visible anymore, so very likely connected to beah and python.