Bug 1264226
Summary: | unable to scale beyond 35 compute nodes and 1 controller node | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | bigswitch <rhosp-bugs-internal> | ||||||
Component: | rhosp-director | Assignee: | James Slagle <jslagle> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | yeylon <yeylon> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 7.0 (Kilo) | CC: | calfonso, dblack, jstransk, mburns, morazi, mwagner, rhel-osp-director-maint, rhosp-bugs-internal, srevivo | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 7.0 (Kilo) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-09-29 17:40:45 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
bigswitch
2015-09-18 00:01:37 UTC
I have uploaded the sosreport, and zip up /var/log/ to https://bigswitch.box.com/s/8yc6u8ca7iq9s91vfqven9cwda1zhnag the root partition with 50gig isnt big enough to generate sosreport, and I have to move /var/log to /home partition instead Did you increase the keystone token timeout setting? The error is indicative of the deployment exceeding that threshhold. please provide output of: heat resource-list -n 10 overcloud after the scale out failed. From there, that will help us determine what Heat resources failed, and how to look into those failures. I did not increase the keystone token expiration timeout. its set to 14400. I'll collect the heat resource-list output when attempting to scale again (In reply to chris alfonso from comment #4) > Did you increase the keystone token timeout setting? The error is indicative > of the deployment exceeding that threshhold. Looks like BZ 1235908. The token expiration was previously 3600 (1 hour), and that default was bumped to 14400 (4 hours) in instack-undercloud-2.1.2-9.el7ost (errata https://access.redhat.com/errata/RHEA-2015:1549 released 2015-08-05). So are we saying here that 4 hours is still inadequately low? Is this somehow directly related to the attempted scale change from 35 to 40 compute nodes? How long did the deployment take to fail? If it exceeded 4 hours, that would be why 14400 is too low. I would recommend using the troubleshooting steps to find out which resource is failing and what the error is if it's not just the time to complete the deployment. It took four hours to fail, so it matches the 14400 values. From nova list, all nodes are active and I dont see any nodes stuck in error state, but from overcloud, only the previously deployed nodes are showing, none of the new nodes are showing up. I am rebuilding the system and should be able to test this again tomorrow. Ok, so this legitimately looks like a keystone timeout. Increase it and see what happens. hi Chris, I am trying to scale from 35 compute to 40 compute nodes. would adding five more compute nodes require more than 4 hours to complete? what value should I increase the timeout to if we are planning to deploy upto 120 nodes? thanks Great point, I thought you were doing them all at once. Most certainly it should not take that long to scale 5 nodes. Perhaps look at the os-collect-config logging on the new hosts and see what it's hanging on. hi Chris Thanks, I'll leave it at the default 14400 value for now than, and attempt to scale up again. We are doing 10 or 20 nodes at a time. I will collect the logs when we hit into this issue again. Please provide the resource-list that was requested in; https://bugzilla.redhat.com/show_bug.cgi?id=1264226#c5 this will help us further diagnose the problem. Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: delay-clone [delay] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-server-clone [neutron-server] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: httpd-clone [httpd] httpd (systemd:httpd): FAILED overcloud-controller-2 (unmanaged) Started: [ overcloud-controller-0 overcloud-controller-1 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Failed actions: httpd_monitor_0 on overcloud-controller-0 'OCF_PENDING' (196): call=180, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 13:45:05 2015', queued=0ms, exec=13ms neutron-server_monitor_60000 on overcloud-controller-0 'not running' (7): call=367, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:56:05 2015', queued=0ms, exec=0ms neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=373, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:56:09 2015', queued=0ms, exec=0ms httpd_monitor_0 on overcloud-controller-1 'OCF_PENDING' (196): call=180, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:21:30 2015', queued=0ms, exec=18ms neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=321, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:55:27 2015', queued=0ms, exec=0ms neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=322, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 15:02:50 2015', queued=0ms, exec=0ms httpd_stop_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=211, status=Timed Out, exit-reason='none', last-rc-change='Thu Sep 24 14:31:39 2015', queued=19ms, exec=49ms httpd_stop_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=211, status=Timed Out, exit-reason='none', last-rc-change='Thu Sep 24 14:31:39 2015', queued=19ms, exec=49ms Attached is resource list Created attachment 1076756 [details]
resource list
Created attachment 1077195 [details]
Added resource list output after the scale out failed
Can you also provide this output? for failed_deployment in $(heat resource-list --nested-depth 5 overcloud | grep FAILED | grep 'StructuredDeployment ' | cut -d '|' -f3); do heat deployment-show $failed_deployment; done also output of nova list, and nova show <instance-id> on any instances not in ACTIVE state I could not get the output of all which is mentioned in comment "18" , Output gives nothing. So i divided the output and modified SoftwareDeployment [stack@dell-undercloud ~]$ heat resource-list --nested-depth 5 overcloud | grep FAILED | Compute | d3da6de3-13c9-424f-8412-811bf41e4648 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2015-09-24T21:35:45Z | | | 39 | 64f6532d-d4bf-4062-bfc1-63f60f012859 | OS::TripleO::Compute | CREATE_FAILED | 2015-09-24T21:39:47Z | Compute | | 35 | cca56a74-0feb-4d1b-bbb0-72c039d51d89 | OS::TripleO::Compute | CREATE_FAILED | 2015-09-24T21:40:06Z | Compute | | NetworkDeployment | 275d0de9-e0ee-47cc-8efc-d03788f9a203 | OS::TripleO::SoftwareDeployment | CREATE_FAILED | 2015-09-24T21:40:06Z | 39 | | UpdateDeployment | 0fbc2ad9-db5e-4787-90cd-8fc0eb85cc26 | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2015-09-24T21:40:06Z | 39 | | NovaComputeDeployment | da759b39-1ac4-42a2-9c50-99c43a67c9b3 | OS::TripleO::SoftwareDeployment | CREATE_FAILED | 2015-09-24T21:40:19Z | 35 | [stack@dell-undercloud ~]$ [stack@dell-undercloud ~]$ for failed_deployment in $(heat resource-list --nested-depth 5 overcloud | grep FAILED | grep 'SoftwareDeployment ' | cut -d '|' -f3); do heat deployment-show $failed_deployment; done { "status": "IN_PROGRESS", "server_id": "a73ec252-b74d-41ba-be73-3eb53e3a1f7f", "config_id": "8c21a06a-e15c-4504-907d-b082f221dcf9", "output_values": null, "creation_time": "2015-09-24T21:44:03Z", "input_values": {}, "action": "CREATE", "status_reason": "Deploy data available", "id": "275d0de9-e0ee-47cc-8efc-d03788f9a203" } { "status": "IN_PROGRESS", "server_id": "a73ec252-b74d-41ba-be73-3eb53e3a1f7f", "config_id": "1c575806-3d96-4e4d-8b0b-c52e8da25cb7", "output_values": null, "creation_time": "2015-09-24T21:43:49Z", "input_values": {}, "action": "CREATE", "status_reason": "Deploy data available", "id": "0fbc2ad9-db5e-4787-90cd-8fc0eb85cc26" } { "status": "IN_PROGRESS", "server_id": "2eb32952-24f0-4f36-8134-b6bebd13ec4c", "config_id": "94ebe36d-2680-4ddf-a064-71baa9fbccc1", "output_values": null, "creation_time": "2015-09-24T21:47:04Z", "input_values": {}, "action": "CREATE", "status_reason": "Deploy data available", "id": "da759b39-1ac4-42a2-9c50-99c43a67c9b3" } As per comment #19 : We do not have any instance ID which is not in "Active" State Given that the deployments are still IN_PROGRESS in the output of deployment-show, but in a CREATE_FAILED state in the resource-list output, this indicates that they were aborted by Heat due to a timeout. This typically means the puppet apply on the nodes hung for some reason. Can you ssh to the compute nodes that failed (it should be the 35th and 39th compute node from nova list output) and capture the os-collect-config logs via: sudo journalctl -u os-collect-config As the journal may have been rototated, you may need to look at older journal files under /var/log/journal with --file argument. If you want to upload the entire contents of /var/log/journal from these nodes somewhere, I'll have a look. It is also useful to see what puppet might be blocked on during the scale out attempt. Run the following, then search for puppet to see what child processes it might be blocked on: ps axjf | less Often, it's blocked on the openstack-nova-compute service starting. Can you also have a look in /var/log/nova/nova-compute.log to see if there are any errors there? Hi James, the logs from both compute-35 and 39 is in https://bigswitch.box.com/s/8yc6u8ca7iq9s91vfqven9cwda1zhnag there is no nova-compute.log in both nodes, the directory is empty. I think nova-compute wasnt deployed yet. [root@overcloud-compute-35 nova]# pwd /var/log/nova [root@overcloud-compute-35 nova]# ls -ltr total 0 [root@overcloud-compute-39 heat-admin]# cd /var/log/nova/ [root@overcloud-compute-39 nova]# ls -ltr total 0 [root@overcloud-compute-39 nova]# Here's the error that's causing the failure from the os-collect-config log: from journal-39: Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: dib-run-parts Mon Sep 28 10:24:20 EDT 2015 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: ++ os-apply-config --key os_net_config --type raw --key-default '' Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: + NET_CONFIG='{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"ovs_options": "bond_mode=balance-tcp lacp=active other-config:l Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: + '[' -n '{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"ovs_options": "bond_mode=balance-tcp lacp=active other-config:lacp- Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: + os-net-config -c /etc/os-net-config/config.json -v Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Using config file at: /etc/os-net-config/config.json Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Ifcfg net config provider created. Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] nic1 mapped to: em1 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] nic2 mapped to: p1p2 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding bridge: br-ex Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding bond: bond1 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding interface: p1p2 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding interface: nic3 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding vlan: vlan3980 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding vlan: vlan3981 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] applying network configs... Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Running ovs-appctl bond/set-active-slave ('bond1', 'p1p2') Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Traceback (most recent call last): Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/bin/os-net-config", line 10, in <module> Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: sys.exit(main()) Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 172, in main Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: activate=not opts.no_activate) Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 310, in apply Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: self.bond_primary_ifaces[bond]) Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 145, in ovs_appctl Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: self.execute(msg, '/bin/ovs-appctl', action, *parameters) Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 107, in execute Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: processutils.execute(cmd, *args, **kwargs) Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 233, in execute Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: cmd=sanitized_cmd) Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Command: /bin/ovs-appctl bond/set-active-slave bond1 p1p2 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Exit code: 2 Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Stdout: u'' Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Stderr: u'no such bond\novs-appctl: ovs-vswitchd: server returned an error\n' Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015-09-28 10:24:20,475] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d'] Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015-09-28 10:24:20,475] (os-refresh-config) [ERROR] Aborting... Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: 2015-09-28 10:24:20.478 8457 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit st Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: 2015-09-28 10:24:20.479 8457 WARNING os-collect-config [-] Sleeping 30.00 seconds before re-exec. and from journal-35: Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: dib-run-parts Fri Sep 25 04:20:19 EDT 2015 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: ++ os-apply-config --key os_net_config --type raw --key-default '' Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: + NET_CONFIG='{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"members": [{"type": "interface", "name": "nic2", "primary": tru Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: + '[' -n '{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"members": [{"type": "interface", "name": "nic2", "primary": true}, Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: + os-net-config -c /etc/os-net-config/config.json -v Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Using config file at: /etc/os-net-config/config.json Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Ifcfg net config provider created. Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] nic1 mapped to: em1 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding bridge: br-ex Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding bond: bond1 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding interface: nic2 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding interface: nic3 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding vlan: vlan3980 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding vlan: vlan3981 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] applying network configs... Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Running ovs-appctl bond/set-active-slave ('bond1', 'nic2') Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Traceback (most recent call last): Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/bin/os-net-config", line 10, in <module> Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: sys.exit(main()) Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 172, in main Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: activate=not opts.no_activate) Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 310, in apply Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: self.bond_primary_ifaces[bond]) Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 145, in ovs_appctl Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: self.execute(msg, '/bin/ovs-appctl', action, *parameters) Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 107, in execute Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: processutils.execute(cmd, *args, **kwargs) Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 233, in execute Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: cmd=sanitized_cmd) Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Command: /bin/ovs-appctl bond/set-active-slave bond1 nic2 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Exit code: 2 Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Stdout: u'' Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Stderr: u'no such bond\novs-appctl: ovs-vswitchd: server returned an error\n' So in both cases it's failing to configure the bond. 39 found mapped 2 nics (nic1->em1, nic2->p1p2), but 35 only mapped 1 nic (nic1->em1). Do these boxes have similar network interfaces (same number, same naming)? Does the expected nic2 on node 35 have link up? If not, that would explain why os-net-config did not configure it. Typically, the compute role heat templates (yaml files) have a single interface name (nic1, nic2, etc) for what interface to use for bonding or bridging. So, that needs to map to the same interface across all the physical systems that you intend to use for compute nodes. However, I'm not sure to what extent you've done any customizations with the templates. hi James, yes all nodes has the same config, not sure why the link is not up on this two nodes. I can remove this two nodes from ironic and attempt to redeploy again later. Is there anyway I can force bonding to be created even when the link is not up? Song you can't force bonding that way, as it looks like the error is coming from the openvswitch side. even if you could, the deployment would just fail on the following step since the compute node would not be able to join the cluster. Some time ago, on a debugging session with Xin we uncovered a similar issue where the deployment failed because one of the nodes had a hardware network connectivity problem (link was down on one of the interfaces). The deployment succeeded after removing the troublesome node, if i recall correctly. So perhaps removing those nodes and trying to redeploy, as suggested above, could help. hi Jiri, James, Thanks for the help, I have identified some nodes that have no active uplink, or one active uplink which is causing the deployment to fail. I will spend some time to figure out which node has no links, and remove them from the database. So far I have managed to scale to 50 nodes. thanks. closing this one out as we've identified the root cause as not a bug |