1417103 – Overcloud deployment fails with "Operation cancelled" caused by os-net-config failure.

Bug 1417103 - Overcloud deployment fails with "Operation cancelled" caused by os-net-config failure.

Summary: Overcloud deployment fails with "Operation cancelled" caused by os-net-config...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	os-net-config
Sub Component:
Version:	11.0 (Ocata)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	RHOS Maint
QA Contact:	Shai Revivo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-27 09:00 UTC by Raoul Scarazzini
Modified:	2023-09-14 03:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-03-07 20:39:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Raoul Scarazzini 2017-01-27 09:00:27 UTC

Description of problem:

Overcloud deployment on baremetal is failing with this error:

Looking into the logs of the failed compute node (after rebooting it because it was unreachable) we see: 

Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] Using config file at: /etc/os-net-config/config.json
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] Ifcfg net config provider created.
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] nic1 mapped to: em2
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] adding interface: em2
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [INFO] adding bridge: br-ex
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: [2017/01/26 11:46:59 AM] [ERROR] Unable to read mac address: nic2
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: Traceback (most recent call last):
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/bin/os-net-config", line 10, in <module>
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: sys.exit(main())
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 185, in main
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: provider.add_object(obj)
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 57, in add_object
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: self.add_bridge(obj)
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 436, in add_bridge
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: data = self._add_common(bridge)
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 192, in _add_common
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: mac = utils.interface_mac(base_opt.primary_interface_name)
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/os_net_config/utils.py", line 81, in interface_mac
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: with open('/sys/class/net/%s/address' % name, 'r') as f:
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: IOError: [Errno 2] No such file or directory: '/sys/class/net/nic2/address'
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + RETVAL=1
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + [[ 1 == 2 ]]
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + [[ 1 != 0 ]]
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: + echo 'ERROR: os-net-config configuration failed.'
Jan 26 11:46:59 overcloud-compute-0 os-collect-config: ERROR: os-net-config configuration failed.

So basically it is looking for /sys/class/net/nic2/address which does not exists.

Version-Release number of selected component (if applicable):

OSP 10, puddle 2017-01-25.1

How reproducible:

This seems a race condition.

Steps to Reproduce:

Just try to deploy the overcloud.

Actual results:

2017-01-26 16:28:51Z [overcloud.Compute.0]: CREATE_FAILED  CREATE aborted
2017-01-26 16:28:51Z [overcloud.Compute]: CREATE_FAILED  Resource CREATE failed: Operation cancelled

 Stack overcloud CREATE_FAILED 

Expected results:

 Stack overcloud CREATE_COMPLETE

Comment 1 Raoul Scarazzini 2017-01-27 09:02:31 UTC

sosreport and /etc/os-net-config directory available here: http://file.rdu.redhat.com/~rscarazz/BZ1417103/

Comment 2 Raoul Scarazzini 2017-05-24 09:25:29 UTC

I hit again this issue while testing OSP11 puddle 2017-05-09.2.

This was a composable deploy with 9 controller nodes, the one failed was galera-0:

2017-05-23 20:39:35Z [overcloud.Galera]: CREATE_FAILED  CREATE aborted
2017-05-23 20:39:35Z [overcloud]: CREATE_FAILED  Create timed out
2017-05-23 20:39:36Z [overcloud.Galera.0]: CREATE_FAILED  CREATE aborted
2017-05-23 20:39:36Z [overcloud.Galera]: CREATE_FAILED  Resource CREATE failed: Operation cancelled

Error is quite the same:

May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 184, in main
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: obj = objects.object_from_json(iface_json)
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 42, in object_from_json
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: return OvsBridge.from_json(json)
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 448, in from_json
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: json, include_primary=False)
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 258, in base_opts_from_json
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: addresses.append(Address.from_json(address))
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 175, in from_json
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: return Address(ip_netmask)
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/os_net_config/objects.py", line 166, in __init__
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: ip_nw = netaddr.IPNetwork(self.ip_netmask)
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: File "/usr/lib/python2.7/site-packages/netaddr/ip/__init__.py", line 933, in __init__
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: raise AddrFormatError('invalid IPNetwork %s' % addr)
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: netaddr.core.AddrFormatError: invalid IPNetwork /24
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + RETVAL=1
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + [[ 1 == 2 ]]
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + [[ 1 != 0 ]]
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: + echo 'ERROR: os-net-config configuration failed.'
May 24 04:12:34 galera-0.localdomain os-collect-config[3018]: ERROR: os-net-config configuration failed.

Since the environment I used is dedicated to a continuous deployment in which the same overcloud gets deployed and uninstalled continuously, all the IPNetwork definitions are good, tested and working.

All the sosreport of this new env can be found here: http://file.rdu.redhat.com/~rscarazz/BZ1417103/osp11-composable-deploy/ the one related to the host in error is sosreport-galera-0.localdomain-20170524043640.tar.xz.

Comment 3 Bob Fournier 2017-09-05 18:27:54 UTC

Hi Raoul,

Looking at the two failures, the 2nd one is quite different and is most likely due to IP address exhaustion because of the way heat does indexing. When deleting nodes and then re-adding them the indices used for IP addresses monotonically increase and are not reused.  If you don't add more addresses to the pool they will run out.  See discussions about this on the mailing list: http://post-office.corp.redhat.com/archives/rhos-prio-list/2017-June/msg00014.html

and: http://post-office.corp.redhat.com/archives/rhos-tech/2017-September/msg00072.html (this one has the same error message "netaddr.core.AddrFormatError: invalid IPNetwork /24")

Now, getting back to the original problem of "No such file or directory: '/sys/class/net/nic2/address'", we have not seen that before. The only other thing that would be useful beyond the supplied sosreports is the network environment and nic config files that you used in the deployment.

Are you still seeing this problem?  Thanks.

Comment 4 Bob Fournier 2017-09-05 20:41:01 UTC

From the logs, the error message "No such file or directory: '/sys/class/net/nic2/address'" occurs because nic2 isn't being mapped to a valid interface name, so os-net-config is attempting to use the alias (nic2) which won't have a file in /sys/class/net.

it looks like os-net-config runs first and maps the interfaces correctly:
Jan 26 10:19:48 host-192-168-24-13 os-collect-config: [2017/01/26 10:19:48 AM] [INFO] nic1 mapped to: em1
Jan 26 10:19:48 host-192-168-24-13 os-collect-config: [2017/01/26 10:19:48 AM] [INFO] nic2 mapped to: em2

and os-net-config successfully configures the interfaces.

However after coming up, em1 is seen going down:
Jan 26 10:20:08 host-192-168-24-13 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): em1: link becomes ready
messages:Jan 26 10:20:08 host-192-168-24-13 NetworkManager[1309]: <info>  [1485444008.4400] device (em1): link connected
Jan 26 10:20:09 host-192-168-24-13 kernel: tg3 0000:02:00.0 em1: Link is down

os-net-config is run shortly afterwards and since it doesn't detect em1 as up it assigns nic1 to em2:
Jan 26 10:20:26 host-192-168-24-13 os-collect-config: [2017/01/26 10:20:26 AM] [INFO] nic1 mapped to: em2

nic2 is therefore not mapped, resulting in this error and the stack trace above:
Jan 26 10:20:26 host-192-168-24-13 os-collect-config: [2017/01/26 10:20:26 AM] [ERROR] Unable to read mac address: nic2

So the issue seems to be em1 going down as logged by the kernel message at 10:20:09 above.

In fact there seems to be a few other times em1 is detected down:
Jan 26 10:18:52 host-192-168-24-13 kernel: tg3 0000:02:00.0 em1: Link is down
Jan 26 10:20:09 host-192-168-24-13 kernel: tg3 0000:02:00.0 em1: Link is down
Jan 26 11:45:40 overcloud-compute-0 kernel: tg3 0000:02:00.0 em1: Link is down
Jan 26 11:46:44 overcloud-compute-0 kernel: tg3 0000:02:00.0 em1: Link is down

Not sure what is going on with interface em1, perhaps an auto negotiation issue of some sort.

Comment 5 antiquum 2017-09-29 12:23:04 UTC

(In reply to Bob Fournier from comment #3)
> Hi Raoul,
> 
> Looking at the two failures, the 2nd one is quite different and is most
> likely due to IP address exhaustion because of the way heat does indexing.
> When deleting nodes and then re-adding them the indices used for IP
> addresses monotonically increase and are not reused.  If you don't add more
> addresses to the pool they will run out.  See discussions about this on the
> mailing list:
> http://post-office.corp.redhat.com/archives/rhos-prio-list/2017-June/
> msg00014.html
> 
> and:
> http://post-office.corp.redhat.com/archives/rhos-tech/2017-September/
> msg00072.html (this one has the same error message
> "netaddr.core.AddrFormatError: invalid IPNetwork /24")
> 
> Now, getting back to the original problem of "No such file or directory:
> '/sys/class/net/nic2/address'", we have not seen that before. The only other
> thing that would be useful beyond the supplied sosreports is the network
> environment and nic config files that you used in the deployment.
> 
> Are you still seeing this problem?  Thanks.

Hi Bob,

I have the same problem with Raoul.
we used 3 control nodes and wanted to replace 1 control node with another one.
The new controller has the same error message. 
("netaddr.core.AddrFormatError: invalid IPNetwork /24")
we also reused controller IPs.

I want to see the archives "http://post-office.corp.redhat.com/archives/rhos-tech/2017-September/", but I cannot see it.
Would you explain what is the matter?

Thank you :)

Comment 6 Miro Halas 2017-11-07 17:02:07 UTC

'Bob, I have been affected by this issue on RHOSP 10 build with 3 controllers and 3 computes. The build fails on one of the controllers> The symptom is that the /etc/os-net-config/config.json has the first IP address set to [{'ip_netmask': '/24')}], which generates error

netaddr.core.AddrFormatError: invalid IPNetwork /24

On the other 2 controllers the IP address is set to, e.g.

[{'ip_netmask': '172.20.5.19/24')}]

This is with os-net-config 5.2.0.3.el7ost

I also cannot access the postoffice links. Could you suggest the way to resolve this?

Thank you

Comment 7 Bob Fournier 2017-11-07 21:44:31 UTC

Miro - were you also deleting and then adding a controller when this problem occurred?
Can you provide logs from the controller, specifically journalctl, neutron, and heat?
Also any logs that indicate the error.

Thank you.

Comment 8 Bob Fournier 2017-11-13 19:09:10 UTC

Also, can you indicate the template files you were using for the deployment? Specifically, are you using ips-from-pool-all.yaml, and are you replacing failed controller nodes?  There is an article here describing how to replace a failed node and avoid the issue with "invalid IPNetwork /24" - https://access.redhat.com/solutions/2992741.

Comment 9 Bob Fournier 2018-03-07 20:39:16 UTC

As its been 4 months with no response to requested logs or template files, closing this out. Please reopen with logs/templates if problem reoccurs.

Comment 10 Red Hat Bugzilla 2023-09-14 03:52:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.