Description of problem: Overcloud deployment on baremetal is failing with this error: Looking into the logs of the failed compute node (after rebooting it because it was unreachable) we see: Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] Using config file at: /etc/os-net-config/config.json Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] Ifcfg net config provider created. Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] nic1 mapped to: em2 Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] adding interface: em2 Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] adding bridge: br-ex Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [ERROR] Unable to read mac address: nic2 Jan 26 11:46:59 overcloud-compute-0 os-collect-config: Traceback (most recent call last): Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/bin/os-net-config", line 10, in <module> Jan 26 11:46:59 overcloud-compute-0 os-collect-config: sys.exit(main()) Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 185, in main Jan 26 11:46:59 overcloud-compute-0 os-collect-config: provider.add_object(obj) Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 57, in add_object Jan 26 11:46:59 overcloud-compute-0 os-collect-config: self.add_bridge(obj) Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 436, in add_bridge Jan 26 11:46:59 overcloud-compute-0 os-collect-config: data = self._add_common(bridge) Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 192, in _add_common Jan 26 11:46:59 overcloud-compute-0 os-collect-config: mac = utils.interface_mac(base_opt.primary_interface_name) Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/utils.py", line 81, in interface_mac Jan 26 11:46:59 overcloud-compute-0 os-collect-config: with open('/sys/class/net/%s/address' % name, 'r') as f: Jan 26 11:46:59 overcloud-compute-0 os-collect-config: IOError: [Errno 2] No such file or directory: '/sys/class/net/nic2/address' Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + RETVAL=1 Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + [[ 1 == 2 ]] Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + [[ 1 != 0 ]] Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + echo 'ERROR: os-net-config configuration failed.' Jan 26 11:46:59 overcloud-compute-0 os-collect-config: ERROR: os-net-config configuration failed. So basically it is looking for /sys/class/net/nic2/address which does not exists. Version-Release number of selected component (if applicable): OSP 10, puddle 2017-01-25.1 How reproducible: This seems a race condition. Steps to Reproduce: Just try to deploy the overcloud. Actual results: 2017-01-26 16:28:51Z [overcloud.Compute.0]: CREATE_FAILED CREATE aborted 2017-01-26 16:28:51Z [overcloud.Compute]: CREATE_FAILED Resource CREATE failed: Operation cancelled Stack overcloud CREATE_FAILED Expected results: Stack overcloud CREATE_COMPLETE
sosreport and /etc/os-net-config directory available here: http://file.rdu.redhat.com/~rscarazz/BZ1417103/
I hit again this issue while testing OSP11 puddle 2017-05-09.2. This was a composable deploy with 9 controller nodes, the one failed was galera-0: 2017-05-23 20:39:35Z [overcloud.Galera]: CREATE_FAILED CREATE aborted 2017-05-23 20:39:35Z [overcloud]: CREATE_FAILED Create timed out 2017-05-23 20:39:36Z [overcloud.Galera.0]: CREATE_FAILED CREATE aborted 2017-05-23 20:39:36Z [overcloud.Galera]: CREATE_FAILED Resource CREATE failed: Operation cancelled Error is quite the same: May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 184, in main May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: obj = objects.object_from_json(iface_json) May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 42, in object_from_json May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: return OvsBridge.from_json(json) May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 448, in from_json May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: json, include_primary=False) May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 258, in base_opts_from_json May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: addresses.append(Address.from_json(address)) May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 175, in from_json May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: return Address(ip_netmask) May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 166, in __init__ May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: ip_nw = netaddr.IPNetwork(self.ip_netmask) May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/netaddr/ip/__init__.py", line 933, in __init__ May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: raise AddrFormatError('invalid IPNetwork %s' % addr) May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: netaddr.core.AddrFormatError: invalid IPNetwork /24 May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + RETVAL=1 May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + [[ 1 == 2 ]] May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + [[ 1 != 0 ]] May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + echo 'ERROR: os-net-config configuration failed.' May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: ERROR: os-net-config configuration failed. Since the environment I used is dedicated to a continuous deployment in which the same overcloud gets deployed and uninstalled continuously, all the IPNetwork definitions are good, tested and working. All the sosreport of this new env can be found here: http://file.rdu.redhat.com/~rscarazz/BZ1417103/osp11-composable-deploy/ the one related to the host in error is sosreport-galera-0.localdomain-20170524043640.tar.xz.
Hi Raoul, Looking at the two failures, the 2nd one is quite different and is most likely due to IP address exhaustion because of the way heat does indexing. When deleting nodes and then re-adding them the indices used for IP addresses monotonically increase and are not reused. If you don't add more addresses to the pool they will run out. See discussions about this on the mailing list: http://post-office.corp.redhat.com/archives/rhos-prio-list/2017-June/msg00014.html and: http://post-office.corp.redhat.com/archives/rhos-tech/2017-September/msg00072.html (this one has the same error message "netaddr.core.AddrFormatError: invalid IPNetwork /24") Now, getting back to the original problem of "No such file or directory: '/sys/class/net/nic2/address'", we have not seen that before. The only other thing that would be useful beyond the supplied sosreports is the network environment and nic config files that you used in the deployment. Are you still seeing this problem? Thanks.
From the logs, the error message "No such file or directory: '/sys/class/net/nic2/address'" occurs because nic2 isn't being mapped to a valid interface name, so os-net-config is attempting to use the alias (nic2) which won't have a file in /sys/class/net. it looks like os-net-config runs first and maps the interfaces correctly: Jan 26 10:19:48 host-192-168-24-13 os-collect-config: [2017/01/26 10:19:48 AM] [INFO] nic1 mapped to: em1 Jan 26 10:19:48 host-192-168-24-13 os-collect-config: [2017/01/26 10:19:48 AM] [INFO] nic2 mapped to: em2 and os-net-config successfully configures the interfaces. However after coming up, em1 is seen going down: Jan 26 10:20:08 host-192-168-24-13 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready messages:Jan 26 10:20:08 host-192-168-24-13 NetworkManager[1309]: <info> [1485444008.4400] device (em1): link connected Jan 26 10:20:09 host-192-168-24-13 kernel: tg3 0000:02:00.0 em1: Link is down os-net-config is run shortly afterwards and since it doesn't detect em1 as up it assigns nic1 to em2: Jan 26 10:20:26 host-192-168-24-13 os-collect-config: [2017/01/26 10:20:26 AM] [INFO] nic1 mapped to: em2 nic2 is therefore not mapped, resulting in this error and the stack trace above: Jan 26 10:20:26 host-192-168-24-13 os-collect-config: [2017/01/26 10:20:26 AM] [ERROR] Unable to read mac address: nic2 So the issue seems to be em1 going down as logged by the kernel message at 10:20:09 above. In fact there seems to be a few other times em1 is detected down: Jan 26 10:18:52 host-192-168-24-13 kernel: tg3 0000:02:00.0 em1: Link is down Jan 26 10:20:09 host-192-168-24-13 kernel: tg3 0000:02:00.0 em1: Link is down Jan 26 11:45:40 overcloud-compute-0 kernel: tg3 0000:02:00.0 em1: Link is down Jan 26 11:46:44 overcloud-compute-0 kernel: tg3 0000:02:00.0 em1: Link is down Not sure what is going on with interface em1, perhaps an auto negotiation issue of some sort.
(In reply to Bob Fournier from comment #3) > Hi Raoul, > > Looking at the two failures, the 2nd one is quite different and is most > likely due to IP address exhaustion because of the way heat does indexing. > When deleting nodes and then re-adding them the indices used for IP > addresses monotonically increase and are not reused. If you don't add more > addresses to the pool they will run out. See discussions about this on the > mailing list: > http://post-office.corp.redhat.com/archives/rhos-prio-list/2017-June/ > msg00014.html > > and: > http://post-office.corp.redhat.com/archives/rhos-tech/2017-September/ > msg00072.html (this one has the same error message > "netaddr.core.AddrFormatError: invalid IPNetwork /24") > > Now, getting back to the original problem of "No such file or directory: > '/sys/class/net/nic2/address'", we have not seen that before. The only other > thing that would be useful beyond the supplied sosreports is the network > environment and nic config files that you used in the deployment. > > Are you still seeing this problem? Thanks. Hi Bob, I have the same problem with Raoul. we used 3 control nodes and wanted to replace 1 control node with another one. The new controller has the same error message. ("netaddr.core.AddrFormatError: invalid IPNetwork /24") we also reused controller IPs. I want to see the archives "http://post-office.corp.redhat.com/archives/rhos-tech/2017-September/", but I cannot see it. Would you explain what is the matter? Thank you :)
'Bob, I have been affected by this issue on RHOSP 10 build with 3 controllers and 3 computes. The build fails on one of the controllers> The symptom is that the /etc/os-net-config/config.json has the first IP address set to [{'ip_netmask': '/24')}], which generates error netaddr.core.AddrFormatError: invalid IPNetwork /24 On the other 2 controllers the IP address is set to, e.g. [{'ip_netmask': '172.20.5.19/24')}] This is with os-net-config 5.2.0.3.el7ost I also cannot access the postoffice links. Could you suggest the way to resolve this? Thank you
Miro - were you also deleting and then adding a controller when this problem occurred? Can you provide logs from the controller, specifically journalctl, neutron, and heat? Also any logs that indicate the error. Thank you.
Also, can you indicate the template files you were using for the deployment? Specifically, are you using ips-from-pool-all.yaml, and are you replacing failed controller nodes? There is an article here describing how to replace a failed node and avoid the issue with "invalid IPNetwork /24" - https://access.redhat.com/solutions/2992741.
As its been 4 months with no response to requested logs or template files, closing this out. Please reopen with logs/templates if problem reoccurs.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days