Description of problem:
During a deployment over IPv6, the IPA agent send ironic a "callback_url", this url is then used by ironic to communicate with the agent. In the latest version the agent appears to be picking up the IPv6 link local address instead of the routeable IPv6 address.
This results in ironic failing to contact the IPA agent
2017-11-16 02:58:08.463 22632 ERROR ironic.drivers.modules.agent_base_vendor IronicException: Error invoking agent command iscsi.start_iscsi_target for node e8b17346-e8b4-4fa2-920e-d50dda81282e. Error: HTTPConnectionPool(host='fe80::f816:3eff:fe4e:27f7', port=9999): Max retries exceeded with url: /v1/commands?wait=true (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x12599110>: Failed to establish a new connection: [Errno 22] EINVAL',))
Version-Release number of selected component (if applicable):
tcpdump shows the wrong IP being sent to ironic
POST /v1/heartbeat/e0c67420-4f66-4571-9db3-1190e0370a7b HTTP/1.1
Accept-Encoding: gzip, deflate
I suspect something has changed with the timing, the IPA agent used to select the slaac IP address to use, perhapes its now running earlier in the process before the slaac address is assigned.
Running some commands on a running IPA agent shows how this can happen if the fd00:1101.... hasn't been assigned to eth0
$ ip -o addr
1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever
1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever
2: eth0 inet6 fd00:1101::f816:3eff:fe4e:27f7/64 scope global mngtmpaddr dynamic \ valid_lft 86379sec preferred_lft 14379sec
2: eth0 inet6 fe80::f816:3eff:fe4e:27f7/64 scope link \ valid_lft forever preferred_lft forever
$ ip route get fd00:1101::1
fd00:1101::1 dev eth0 proto kernel src fd00:1101::f816:3eff:fe4e:27f7 metric 256
# Removing the fd00:1101:.. address shows the link-local address being returned (this is the command thats run by API to find the ip address to use)
# see ironic_python_agent/agent.py _get_route_source()
$ ip addr del fd00:1101::f816:3eff:fe4e:27f7/64 dev eth0
$ ip route get fd00:1101::1
fd00:1101::1 dev eth0 proto kernel src fe80::f816:3eff:fe4e:27f7 metric 256
As a test I've inserted a 5 second sleep into the IPA agent code, and deployment now works, so I think that the agent is running this command before the interface gets its slaac address, in previous versions it must have been after.
Fix proposed upstream
Fix merged downstream - https://code.engineering.redhat.com/gerrit/#/c/123977/
Tested and fixed
Got a successful IPv6 deploy with the following in the logs
Nov 23 09:37:58 localhost.localdomain ironic-python-agent: 2017-11-23 09:37:57.883 479 INFO ironic_python_agent.agent [-] Ignoring link-local source to fd00:1101::0001: fd00:1101::1 dev eth0 proto kernel src fe80::f816:3eff:fed8:b16e metric 256
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.