Description of problem: The following issue was seen during automated testing: ERROR Remote end send echo without any answer! Version-Release number of selected component (if applicable): Beaker: Beaker 22.3 How reproducible: Unknown . Actual results: https://beaker.engineering.redhat.com/recipes/2783924 http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2016/06/13649/1364939/2783924/console.log ---<-snip->--- [ 70.804728] beah-fwd-backend[4754]: 2016-06-09 16:29:54,811 backend handle_evt_variable_get: ERROR Remote end send echo without any answer! (diagnostic=('echo', True, <beah.backends.forwarder._RemoteBackend object at 0x1003cd65990>, u'cb3bfc9e-8c42-4c47-8bc0-2d9acc0524b5', [u'Event', u'echo', u'3615987a-73bf-4393-9f0d-e98b9a3a67a3', {u'class': u'controller'}, None, {u'message': u'', u'cmd_id': u'cb3bfc9e-8c42-4c47-8bc0-2d9acc0524b5', u'rc': 0}])) [ 70.825613] beah-fwd-backend[4754]: 2016-06-09 16:29:54,832 backend linfo: INFO BackendFactory: Connected. Address: IPv4Address(TCP, '10.16.70.81', 12432) [ 70.825867] beah-fwd-backend[4754]: 2016-06-09 16:29:54,833 backend linfo: INFO BackendFactory: Resetting reconnection delay [-- MARK -- Thu Jun 9 20:30:00 2016] [ 80.806889] beah-fwd-backend[4754]: 2016-06-09 16:30:04,813 backend stop_remote: INFO Closing RemoteBackend for ibm-p8-06-lp1.klab.eng.bos.redhat.com:12432. [ 80.807943] beah-fwd-backend[4754]: 2016-06-09 16:30:04,815 backend linfo: INFO BackendFactory: Started to connect. [ 80.808452] beah-fwd-backend[4754]: 2016-06-09 16:30:04,815 backend linfo: INFO BackendFactory: Lost connection. Reason: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly. [ 80.808738] beah-fwd-backend[4754]: ] [ 80.809899] beah-fwd-backend[4754]: 2016-06-09 16:30:04,817 backend linfo: INFO BackendFactory: Connected. Address: IPv4Address(TCP, '10.16.160.32', 12432) [ 80.810187] beah-fwd-backend[4754]: 2016-06-09 16:30:04,817 backend linfo: INFO BackendFactory: Resetting reconnection delay [ 80.810487] beah-srv[4756]: 2016-06-09 16:30:04,817 beah buildProtocol: INFO BackendListener: Connected. Address: IPv6Address(TCP, '::ffff:10.16.160.32', 55612) [ 80.811707] beah-fwd-backend[4754]: 2016-06-09 16:30:04,818 backend handle_evt_variable_get: ERROR Remote end send echo without any answer! (diagnostic=('echo', True, <beah.backends.forwarder._RemoteBackend object at 0x1003cd6f350>, u'985f7b5e-397f-4c66-8c31-72fa301743a6', [u'Event', u'echo', u'a885042b-d8e7-4ace-8446-3289a278e22c', {u'class': u'controller'}, None, {u'message': u'', u'cmd_id': u'985f7b5e-397f-4c66-8c31-72fa301743a6', u'rc': 0}])) [ 90.812340] beah-fwd-backend[4754]: 2016-06-09 16:30:14,819 backend stop_remote: INFO Closing RemoteBackend for ibm-p8-06-lp1.klab.eng.bos.redhat.com:12432. ---<-snip->--- Expected results: successful completion of automates test run Additional info:
Also saw this here, in our self-tests: https://beaker-devel.app.eng.bos.redhat.com/jobs/12347#set16386 It is definitely *some* kind of network problem between the hosts but I am not sure what exactly. I think the problematic recipe was on hp-dl160gen9-04.rhts.eng.bos.redhat.com and I noticed that it has two NICs so I should double-check what addresses it is getting.
(In reply to PaulB from comment #0) > Actual results: > https://beaker.engineering.redhat.com/recipes/2783924 > http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2016/06/13649/ > 1364939/2783924/console.log In this case the problematic host also had multiple NICs cabled up to the same network segment. And the address it is *supposed* to be using according to DNS (10.16.160.32) has a higher metric than one of the others: 10.16.160.0/21 dev eth2 proto kernel scope link src 10.16.167.44 metric 100 10.16.160.0/21 dev eth0 proto kernel scope link src 10.16.160.32 metric 101 10.16.160.0/21 dev eth1 proto kernel scope link src 10.16.167.45 metric 102 So I think that means beah-srv would be connecting off to the other hosts in the recipe set using a different address (10.16.167.44) than what they are expecting (10.16.167.32). I wonder if beah has some source address checks or something?
Okay, I think I have spotted it... The problem is not the multiple interfaces themselves, but if the system gets the wrong hostname. For example because it did a reverse lookup on the first IP address (which is not the right one). So in R:2783924, you can see: [ 0.578855] systemd[1]: Set hostname to <dhcp167-44.klab.eng.bos.redhat.com>. which is the hostname matching the random DHCP pool address that eth2 got (10.16.167.44) instead of the "proper" hostname which is on eth0 (10.16.160.32). Similarly in my reproducer on beaker-devel, it is because hp-dl160gen9-04.rhts.eng.bos.redhat.com comes up with multiple addresses and picks a different hostname for itself based on the secondary NIC. I'm still not sure why exactly beah *cares* about the hostname (since the real hostname still resolves to an address on the machine) but I do see that beah is passing around hostnames in various parts of its JSON messages so I guess there is something matching happening somewhere.
The only hostname stuff that I can find in beah that looks relevant is the _localhosts_aggressive function. And indeed it seems that it does not consider the possibility of multiple NICs with different addresses having different hostnames, which is the exact situation we have here.
There is also some non-determinism here. I can't reproduce this 100% of the time with the recipe in comment 1. It seems (at least on Fedora 25 Cloud if not other distros) both NICs on the system are brought up by NetworkManager at the same moment and sent DHCP requests. It seems like sometimes one fails due to never receiving a DHCPOFFER (separate issue?) but whichever one wins and comes up first will determine the system hostname. So the hostname may be correct or incorrect depending on which NIC wins. This would be related to a recent Anaconda change where it no longer configures the DHCP hostname from the installation as the hostname after installation -- it leaves it unset so that the hostname will be determined from DHCP again on every bootup: bug 1290858.
It reproduced in this job: https://beaker-devel.app.eng.bos.redhat.com/jobs/12365 although on hp-dl120gen9-06.rhts.eng.bos.redhat.com this time, not hp-dl160gen9-04.rhts.eng.bos.redhat.com. (They both have two cabled NICs and therefore a chance of reproducing this.) This beah patch adds some logging which helps explain it: https://gerrit.beaker-project.org/5706 In R:18940 console.log we see: [ 202.961678] beah-srv[1177]: 2017-05-30 01:58:37,107 beah _localhosts_aggressive: DEBUG Determined localhost addresses are ['localhost', '::1', '10.16.70.189', 'dhcp70-189.rhts.eng.bos.redhat.com', 'localhost.localdomain', 'localhost6', 'localhost4.localdomain4', 'localhost4', '127.0.0.1', 'localhost6.localdomain6'] ... [ 225.929773] beah-srv[1177]: 2017-05-30 01:58:45,002 beah _set_local: DEBUG Host hp-dl120gen9-06.rhts.eng.bos.redhat.com is not local Compare this to the happy case: https://beaker-devel.app.eng.bos.redhat.com/jobs/12364 where we see: [ 110.599383] beah-srv[1169]: 2017-05-29 23:37:01,624 beah _localhosts_aggressive: DEBUG Determined localhost addresses are ['10.16.65.164', 'localhost', '::1', 'localhost6', 'localhost.localdomain', 'hp-dl120gen9-06.rhts.eng.bos.redhat.com', 'localhost4.localdomain4', 'localhost4', '127.0.0.1', 'localhost6.localdomain6'] so beah knows to answer events for hp-dl120gen9-06.rhts.eng.bos.redhat.com. This probably means we need to expand the _localhosts_aggressive routine to find addresses for all active interfaces.
(In reply to Dan Callaghan from comment #6) > This probably means we need to expand the _localhosts_aggressive routine to > find addresses for all active interfaces. I did that in this patch: https://gerrit.beaker-project.org/5707 and indeed it does now find a more complete set of addresses for the local system: [ 206.177607] beah-srv[1193]: 2017-05-30 03:40:39,255 beah _localhosts_aggressive: DEBUG Determined localhost addresses are ['fe80::d2ed:6cd4:ed18:7361', '2620:52:0:1040:5a20:b1ff:fe06:7584', '10.16.70.255', 'fe80::5a20:b1ff:fe06:7584', 'fe80::d2ed:6cd4:ed18:7361%eno2', 'localhost', '::1', 'localhost4', 'fe80::5a20:b1ff:fe06:7584%eno1', 'localhost6.localdomain6', '127.0.0.1', 'localhost4.localdomain4', 'localhost6', 'dhcp70-255.rhts.eng.bos.redhat.com', '2620:52:0:1040:3b21:52c8:5936:8d8c', 'localhost.localdomain'] [ 206.177825] beah-srv[1193]: 2017-05-30 03:40:39,257 beah _set_local: DEBUG Host dhcp70-255.rhts.eng.bos.redhat.com is local But it doesn't actually help. If the system hostname is wrong (because it was assigned based on the wrong interface coming up first) then multi-host recipes will still break because they will be expecting $HOSTNAME to actually match what the other systems in the recipe set have. 2017-05-30 03:40:38,545 rhts_task xmlrpc_set: DEBUG XMLRPC: sync.set('16406', '9', 'http://[::1]:7082/cgi-bin/rhts/scheduler_xmlrpc.cgi', 'dhcp70-255.rhts.eng.bos.redhat.com', 'DONE') So we are basically just back to the same old problem -- if the system hostname is not correct then everything will break.
I wonder if we should just go all out, and forcibly set the proper hostname in the kickstart, using "network --hostname=example.com". To avoid all these issues with Anaconda/NetworkManager/dhclient not setting the hostname that we are expecting.
(In reply to Dan Callaghan from comment #8) We can try it for Beaker 25.0. https://gerrit.beaker-project.org/5859
In some ways, this: https://bugzilla.redhat.com/show_bug.cgi?id=969962#c4 would be a better/more complete fix for this problem. So maybe we should do that too.
I now scheduled three multi host jobs with the fixes which were failing in our self tests. All with a pass now: https://beaker-devel.app.eng.bos.redhat.com/jobs/13878 https://beaker-devel.app.eng.bos.redhat.com/jobs/13872 https://beaker-devel.app.eng.bos.redhat.com/jobs/13866 I made sure that I used a RHEL version >= 7.3. Setting this to VERIFIED. I think we'll have one last check running our self tests when releasing Beaker 25.
Beaker 25.0 has been released. Release notes are available upstream: https://beaker-project.org/docs/whats-new/release-25.html