Bug 1346068

Summary: [Beaker 22.3] ERROR Remote end send echo without any answer!
Product: [Retired] Beaker Reporter: PaulB <pbunyan>
Component: beahAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact: Roman Joost <rjoost>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 22CC: asavkov, bpeck, dcallagh, jburke, jstancek, mjia, pbunyan, rjoost
Target Milestone: 25.0Keywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-19 04:19:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description PaulB 2016-06-13 20:17:37 UTC
Description of problem:
The following issue was seen during automated testing:
 ERROR Remote end send echo without any answer!

Version-Release number of selected component (if applicable):
Beaker: Beaker 22.3

How reproducible:
 Unknown

.

Actual results:
https://beaker.engineering.redhat.com/recipes/2783924
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2016/06/13649/1364939/2783924/console.log
---<-snip->---
[   70.804728] beah-fwd-backend[4754]: 2016-06-09 16:29:54,811 backend handle_evt_variable_get: ERROR Remote end send echo without any answer! (diagnostic=('echo', True, <beah.backends.forwarder._RemoteBackend object at 0x1003cd65990>, u'cb3bfc9e-8c42-4c47-8bc0-2d9acc0524b5', [u'Event', u'echo', u'3615987a-73bf-4393-9f0d-e98b9a3a67a3', {u'class': u'controller'}, None, {u'message': u'', u'cmd_id': u'cb3bfc9e-8c42-4c47-8bc0-2d9acc0524b5', u'rc': 0}])) 
[   70.825613] beah-fwd-backend[4754]: 2016-06-09 16:29:54,832 backend linfo: INFO BackendFactory: Connected.  Address: IPv4Address(TCP, '10.16.70.81', 12432) 
[   70.825867] beah-fwd-backend[4754]: 2016-06-09 16:29:54,833 backend linfo: INFO BackendFactory: Resetting reconnection delay 
[-- MARK -- Thu Jun  9 20:30:00 2016] 
[   80.806889] beah-fwd-backend[4754]: 2016-06-09 16:30:04,813 backend stop_remote: INFO Closing RemoteBackend for ibm-p8-06-lp1.klab.eng.bos.redhat.com:12432. 
[   80.807943] beah-fwd-backend[4754]: 2016-06-09 16:30:04,815 backend linfo: INFO BackendFactory: Started to connect. 
[   80.808452] beah-fwd-backend[4754]: 2016-06-09 16:30:04,815 backend linfo: INFO BackendFactory: Lost connection.  Reason: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly. 
[   80.808738] beah-fwd-backend[4754]: ] 
[   80.809899] beah-fwd-backend[4754]: 2016-06-09 16:30:04,817 backend linfo: INFO BackendFactory: Connected.  Address: IPv4Address(TCP, '10.16.160.32', 12432) 
[   80.810187] beah-fwd-backend[4754]: 2016-06-09 16:30:04,817 backend linfo: INFO BackendFactory: Resetting reconnection delay 
[   80.810487] beah-srv[4756]: 2016-06-09 16:30:04,817 beah buildProtocol: INFO BackendListener: Connected.  Address: IPv6Address(TCP, '::ffff:10.16.160.32', 55612) 
[   80.811707] beah-fwd-backend[4754]: 2016-06-09 16:30:04,818 backend handle_evt_variable_get: ERROR Remote end send echo without any answer! (diagnostic=('echo', True, <beah.backends.forwarder._RemoteBackend object at 0x1003cd6f350>, u'985f7b5e-397f-4c66-8c31-72fa301743a6', [u'Event', u'echo', u'a885042b-d8e7-4ace-8446-3289a278e22c', {u'class': u'controller'}, None, {u'message': u'', u'cmd_id': u'985f7b5e-397f-4c66-8c31-72fa301743a6', u'rc': 0}])) 
[   90.812340] beah-fwd-backend[4754]: 2016-06-09 16:30:14,819 backend stop_remote: INFO Closing RemoteBackend for ibm-p8-06-lp1.klab.eng.bos.redhat.com:12432. 
---<-snip->---


Expected results:
 successful completion of automates test run


Additional info:

Comment 1 Dan Callaghan 2017-05-29 22:28:07 UTC
Also saw this here, in our self-tests:

https://beaker-devel.app.eng.bos.redhat.com/jobs/12347#set16386

It is definitely *some* kind of network problem between the hosts but I am not sure what exactly. I think the problematic recipe was on hp-dl160gen9-04.rhts.eng.bos.redhat.com and I noticed that it has two NICs so I should double-check what addresses it is getting.

Comment 2 Dan Callaghan 2017-05-30 01:42:46 UTC
(In reply to PaulB from comment #0)
> Actual results:
> https://beaker.engineering.redhat.com/recipes/2783924
> http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2016/06/13649/
> 1364939/2783924/console.log

In this case the problematic host also had multiple NICs cabled up to the same network segment. And the address it is *supposed* to be using according to DNS (10.16.160.32) has a higher metric than one of the others:

10.16.160.0/21 dev eth2  proto kernel  scope link  src 10.16.167.44  metric 100  
10.16.160.0/21 dev eth0  proto kernel  scope link  src 10.16.160.32  metric 101  
10.16.160.0/21 dev eth1  proto kernel  scope link  src 10.16.167.45  metric 102  

So I think that means beah-srv would be connecting off to the other hosts in the recipe set using a different address (10.16.167.44) than what they are expecting (10.16.167.32). I wonder if beah has some source address checks or something?

Comment 3 Dan Callaghan 2017-05-30 03:01:05 UTC
Okay, I think I have spotted it... The problem is not the multiple interfaces themselves, but if the system gets the wrong hostname. For example because it did a reverse lookup on the first IP address (which is not the right one).

So in R:2783924, you can see:

[    0.578855] systemd[1]: Set hostname to <dhcp167-44.klab.eng.bos.redhat.com>. 

which is the hostname matching the random DHCP pool address that eth2 got (10.16.167.44) instead of the "proper" hostname which is on eth0 (10.16.160.32).

Similarly in my reproducer on beaker-devel, it is because hp-dl160gen9-04.rhts.eng.bos.redhat.com comes up with multiple addresses and picks a different hostname for itself based on the secondary NIC.

I'm still not sure why exactly beah *cares* about the hostname (since the real hostname still resolves to an address on the machine) but I do see that beah is passing around hostnames in various parts of its JSON messages so I guess there is something matching happening somewhere.

Comment 4 Dan Callaghan 2017-05-30 03:08:26 UTC
The only hostname stuff that I can find in beah that looks relevant is the _localhosts_aggressive function. And indeed it seems that it does not consider the possibility of multiple NICs with different addresses having different hostnames, which is the exact situation we have here.

Comment 5 Dan Callaghan 2017-05-30 05:49:30 UTC
There is also some non-determinism here. I can't reproduce this 100% of the time with the recipe in comment 1. It seems (at least on Fedora 25 Cloud if not other distros) both NICs on the system are brought up by NetworkManager at the same moment and sent DHCP requests. It seems like sometimes one fails due to never receiving a DHCPOFFER (separate issue?) but whichever one wins and comes up first will determine the system hostname. So the hostname may be correct or incorrect depending on which NIC wins.

This would be related to a recent Anaconda change where it no longer configures the DHCP hostname from the installation as the hostname after installation -- it leaves it unset so that the hostname will be determined from DHCP again on every bootup: bug 1290858.

Comment 6 Dan Callaghan 2017-05-30 06:17:16 UTC
It reproduced in this job:

https://beaker-devel.app.eng.bos.redhat.com/jobs/12365

although on hp-dl120gen9-06.rhts.eng.bos.redhat.com this time, not hp-dl160gen9-04.rhts.eng.bos.redhat.com. (They both have two cabled NICs and therefore a chance of reproducing this.)

This beah patch adds some logging which helps explain it:

https://gerrit.beaker-project.org/5706

In R:18940 console.log we see:

[  202.961678] beah-srv[1177]: 2017-05-30 01:58:37,107 beah _localhosts_aggressive: DEBUG Determined localhost addresses are ['localhost', '::1', '10.16.70.189', 'dhcp70-189.rhts.eng.bos.redhat.com', 'localhost.localdomain', 'localhost6', 'localhost4.localdomain4', 'localhost4', '127.0.0.1', 'localhost6.localdomain6'] 
...
[  225.929773] beah-srv[1177]: 2017-05-30 01:58:45,002 beah _set_local: DEBUG Host hp-dl120gen9-06.rhts.eng.bos.redhat.com is not local 

Compare this to the happy case:

https://beaker-devel.app.eng.bos.redhat.com/jobs/12364

where we see:

[  110.599383] beah-srv[1169]: 2017-05-29 23:37:01,624 beah _localhosts_aggressive: DEBUG Determined localhost addresses are ['10.16.65.164', 'localhost', '::1', 'localhost6', 'localhost.localdomain', 'hp-dl120gen9-06.rhts.eng.bos.redhat.com', 'localhost4.localdomain4', 'localhost4', '127.0.0.1', 'localhost6.localdomain6'] 

so beah knows to answer events for hp-dl120gen9-06.rhts.eng.bos.redhat.com.

This probably means we need to expand the _localhosts_aggressive routine to find addresses for all active interfaces.

Comment 7 Dan Callaghan 2017-05-30 11:06:43 UTC
(In reply to Dan Callaghan from comment #6)
> This probably means we need to expand the _localhosts_aggressive routine to
> find addresses for all active interfaces.

I did that in this patch:

https://gerrit.beaker-project.org/5707

and indeed it does now find a more complete set of addresses for the local system:

[  206.177607] beah-srv[1193]: 2017-05-30 03:40:39,255 beah _localhosts_aggressive: DEBUG Determined localhost addresses are ['fe80::d2ed:6cd4:ed18:7361', '2620:52:0:1040:5a20:b1ff:fe06:7584', '10.16.70.255', 'fe80::5a20:b1ff:fe06:7584', 'fe80::d2ed:6cd4:ed18:7361%eno2', 'localhost', '::1', 'localhost4', 'fe80::5a20:b1ff:fe06:7584%eno1', 'localhost6.localdomain6', '127.0.0.1', 'localhost4.localdomain4', 'localhost6', 'dhcp70-255.rhts.eng.bos.redhat.com', '2620:52:0:1040:3b21:52c8:5936:8d8c', 'localhost.localdomain'] 
[  206.177825] beah-srv[1193]: 2017-05-30 03:40:39,257 beah _set_local: DEBUG Host dhcp70-255.rhts.eng.bos.redhat.com is local 

But it doesn't actually help. If the system hostname is wrong (because it was assigned based on the wrong interface coming up first) then multi-host recipes will still break because they will be expecting $HOSTNAME to actually match what the other systems in the recipe set have.

2017-05-30 03:40:38,545 rhts_task xmlrpc_set: DEBUG XMLRPC: sync.set('16406', '9', 'http://[::1]:7082/cgi-bin/rhts/scheduler_xmlrpc.cgi', 'dhcp70-255.rhts.eng.bos.redhat.com', 'DONE')

So we are basically just back to the same old problem -- if the system hostname is not correct then everything will break.

Comment 8 Dan Callaghan 2017-09-28 06:52:51 UTC
I wonder if we should just go all out, and forcibly set the proper hostname in the kickstart, using "network --hostname=example.com". To avoid all these issues with Anaconda/NetworkManager/dhclient not setting the hostname that we are expecting.

Comment 9 Dan Callaghan 2017-10-05 07:43:53 UTC
(In reply to Dan Callaghan from comment #8)

We can try it for Beaker 25.0.

https://gerrit.beaker-project.org/5859

Comment 11 Dan Callaghan 2017-10-06 07:02:05 UTC
In some ways, this:

https://bugzilla.redhat.com/show_bug.cgi?id=969962#c4

would be a better/more complete fix for this problem. So maybe we should do that too.

Comment 13 Roman Joost 2018-02-15 00:50:48 UTC
I now scheduled three multi host jobs with the fixes which were failing in our self tests. All with a pass now:

https://beaker-devel.app.eng.bos.redhat.com/jobs/13878
https://beaker-devel.app.eng.bos.redhat.com/jobs/13872
https://beaker-devel.app.eng.bos.redhat.com/jobs/13866

I made sure that I used a RHEL version >= 7.3. Setting this to VERIFIED. I think we'll have one last check running our self tests when releasing Beaker 25.

Comment 14 Roman Joost 2018-03-19 04:19:04 UTC
Beaker 25.0 has been released.

Release notes are available upstream: https://beaker-project.org/docs/whats-new/release-25.html