Bug 1264226

Summary: unable to scale beyond 35 compute nodes and 1 controller node
Product: Red Hat OpenStack Reporter: bigswitch <rhosp-bugs-internal>
Component: rhosp-directorAssignee: James Slagle <jslagle>
Status: CLOSED NOTABUG QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: calfonso, dblack, jstransk, mburns, morazi, mwagner, rhel-osp-director-maint, rhosp-bugs-internal, srevivo
Target Milestone: ---   
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-09-29 17:40:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
resource list
none
Added resource list output after the scale out failed none

Description bigswitch 2015-09-18 00:01:37 UTC
Description of problem:
I have a deployment with 70+ nodes. I am not able to successfully deploy a 40 compute nodes and 1 controller node setup. I have incrementally added upto 35 compute nodes, and when attempting to go beyond with 40, or 60 nodes, it failed with the below error
I've edited the heat maximum resource per stack to 10000, and neutron port-quota to -1

DEBUG: heatclient.common.http curl -g -i -X GET -H 'X-Auth-Token: {SHA1}e32b3f1cee5fa9636b345025fe51bbf65cad8223' -H 'Content-Type: application/json' -H 'X-Auth-Url: http://192.0.2.1:5000/v2.0' -H 'Accept: application/json' -H 'User-Agent: python-heatclient' http://192.0.2.1:8004/v1/14e5908293e74b87b2d4c9a31504aa1f/stacks/overcloud
DEBUG: heatclient.common.http
HTTP/1.1 401 Unauthorized
content-length: 23
www-authenticate: Keystone uri='http://192.0.2.1:5000/v2.0'
connection: keep-alive
date: Thu, 17 Sep 2015 23:41:05 GMT
content-type: text/plain
x-openstack-request-id: req-517bf34a-e92c-47fb-ab6c-4923b60e1155

Authentication required

ERROR: openstack ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1
Authentication required
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run
    self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 824, in take_action
    self._deploy_tripleo_heat_templates(stack, parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 498, in _deploy_tripleo_heat_templates
    parsed_args.timeout)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 438, in _heat_deploy
    orchestration_client, "overcloud")
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/utils.py", line 144, in wait_for_stack_ready
    stack = orchestration_client.stacks.get(stack_name)
  File "/usr/lib/python2.7/site-packages/heatclient/v1/stacks.py", line 202, in get
    resp, body = self.client.json_request('GET', '/stacks/%s' % stack_id)
  File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 265, in json_request
    resp = self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 217, in _http_request
    'content': resp.content
HTTPUnauthorized: ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1
Authentication required
DEBUG: openstackclient.shell clean_up DeployOvercloud
DEBUG: openstackclient.shell got an error: ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1
Authentication required
ERROR: openstackclient.shell Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/openstackclient/shell.py", line 176, in run
    return super(OpenStackShell, self).run(argv)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 230, in run
    result = self.run_subcommand(remainder)
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 295, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/command.py", line 53, in run
    self.take_action(parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 824, in take_action
    self._deploy_tripleo_heat_templates(stack, parsed_args)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 498, in _deploy_tripleo_heat_templates
    parsed_args.timeout)
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py", line 438, in _heat_deploy
    orchestration_client, "overcloud")
  File "/usr/lib/python2.7/site-packages/rdomanager_oscplugin/utils.py", line 144, in wait_for_stack_ready
    stack = orchestration_client.stacks.get(stack_name)
  File "/usr/lib/python2.7/site-packages/heatclient/v1/stacks.py", line 202, in get
    resp, body = self.client.json_request('GET', '/stacks/%s' % stack_id)
  File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 265, in json_request
    resp = self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 217, in _http_request
    'content': resp.content
HTTPUnauthorized: ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1
Authentication required

Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
[stack@dell-undercloud ~]$


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Attempt to deploy to 40 compute nodes failed after 4 hours
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 bigswitch 2015-09-18 20:33:31 UTC
I have uploaded the sosreport, and zip up /var/log/ to 

https://bigswitch.box.com/s/8yc6u8ca7iq9s91vfqven9cwda1zhnag

the root partition with 50gig isnt big enough to generate sosreport, and I have to move /var/log to /home partition instead

Comment 4 chris alfonso 2015-09-22 16:23:58 UTC
Did you increase the keystone token timeout setting? The error is indicative of the deployment exceeding that threshhold.

Comment 5 James Slagle 2015-09-22 16:28:26 UTC
please provide output of:

heat resource-list -n 10 overcloud

after the scale out failed.

From there, that will help us determine what Heat resources failed, and how to look into those failures.

Comment 6 bigswitch 2015-09-22 16:33:15 UTC
I did not increase the keystone token expiration timeout. its set to 14400. I'll collect the heat resource-list output when attempting to scale again

Comment 7 Dustin Black 2015-09-22 18:55:14 UTC
(In reply to chris alfonso from comment #4)
> Did you increase the keystone token timeout setting? The error is indicative
> of the deployment exceeding that threshhold.

Looks like BZ 1235908. The token expiration was previously 3600 (1 hour), and that default was bumped to 14400 (4 hours) in instack-undercloud-2.1.2-9.el7ost (errata https://access.redhat.com/errata/RHEA-2015:1549 released 2015-08-05).

So are we saying here that 4 hours is still inadequately low? Is this somehow directly related to the attempted scale change from 35 to 40 compute nodes?

Comment 8 chris alfonso 2015-09-22 19:15:36 UTC
How long did the deployment take to fail? If it exceeded 4 hours, that would be why 14400 is too low. I would recommend using the troubleshooting steps to find out which resource is failing and what the error is if it's not just the time to complete the deployment.

Comment 9 bigswitch 2015-09-22 22:02:00 UTC
It took four hours to fail, so it matches the 14400 values. From nova list, all nodes are active and I dont see any nodes stuck in error state, but from overcloud, only the previously deployed nodes are showing, none of the new nodes are showing up.
I am rebuilding the system and should be able to test this again tomorrow.

Comment 10 chris alfonso 2015-09-23 16:37:42 UTC
Ok, so this legitimately looks like a keystone timeout. Increase it and see what happens.

Comment 11 bigswitch 2015-09-23 16:42:27 UTC
hi Chris,
I am trying to scale from 35 compute to 40 compute nodes. would adding five more compute nodes require more than 4 hours to complete? 
what value should I increase the timeout to if we are planning to deploy upto 120 nodes?

thanks

Comment 12 chris alfonso 2015-09-23 16:47:21 UTC
Great point, I thought you were doing them all at once. Most certainly it should not take that long to scale 5 nodes. Perhaps look at the os-collect-config logging on the new hosts and see what it's hanging on.

Comment 13 bigswitch 2015-09-23 16:57:20 UTC
hi Chris
Thanks, I'll leave it at the default 14400 value for now than, and attempt to scale up again. We are doing 10 or 20 nodes at a time. I will collect the logs when we hit into this issue again.

Comment 14 Mike Orazi 2015-09-24 23:02:54 UTC
Please provide the resource-list that was requested in;  https://bugzilla.redhat.com/show_bug.cgi?id=1264226#c5

this will help us further diagnose the problem.

Comment 15 bigswitch 2015-09-24 23:31:37 UTC
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     httpd      (systemd:httpd):        FAILED overcloud-controller-2 (unmanaged)
     Started: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-0
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed actions:
    httpd_monitor_0 on overcloud-controller-0 'OCF_PENDING' (196): call=180, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 13:45:05 2015', queued=0ms, exec=13ms
    neutron-server_monitor_60000 on overcloud-controller-0 'not running' (7): call=367, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:56:05 2015', queued=0ms, exec=0ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=373, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:56:09 2015', queued=0ms, exec=0ms
    httpd_monitor_0 on overcloud-controller-1 'OCF_PENDING' (196): call=180, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:21:30 2015', queued=0ms, exec=18ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=321, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 14:55:27 2015', queued=0ms, exec=0ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=322, status=complete, exit-reason='none', last-rc-change='Thu Sep 24 15:02:50 2015', queued=0ms, exec=0ms
    httpd_stop_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=211, status=Timed Out, exit-reason='none', last-rc-change='Thu Sep 24 14:31:39 2015', queued=19ms, exec=49ms
    httpd_stop_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=211, status=Timed Out, exit-reason='none', last-rc-change='Thu Sep 24 14:31:39 2015', queued=19ms, exec=49ms


Attached is resource list

Comment 16 bigswitch 2015-09-24 23:32:14 UTC
Created attachment 1076756 [details]
resource list

Comment 17 bigswitch 2015-09-25 15:48:39 UTC
Created attachment 1077195 [details]
Added resource list output after the scale out failed

Comment 18 Mike Burns 2015-09-25 21:02:00 UTC
Can you also provide this output?

for failed_deployment in $(heat resource-list --nested-depth 5 overcloud | grep FAILED | grep 'StructuredDeployment ' | cut -d '|' -f3); do heat deployment-show $failed_deployment; done

Comment 19 James Slagle 2015-09-25 21:09:37 UTC
also output of nova list, and nova show <instance-id> on any instances not in ACTIVE state

Comment 20 bigswitch 2015-09-25 22:05:55 UTC
I could not get the output of all which is mentioned in comment "18" , Output gives nothing. So i divided the output and modified SoftwareDeployment
[stack@dell-undercloud ~]$ heat resource-list --nested-depth 5 overcloud | grep FAILED
| Compute                                     | d3da6de3-13c9-424f-8412-811bf41e4648          | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2015-09-24T21:35:45Z |                                             |
| 39                                          | 64f6532d-d4bf-4062-bfc1-63f60f012859          | OS::TripleO::Compute                              | CREATE_FAILED   | 2015-09-24T21:39:47Z | Compute                                     |
| 35                                          | cca56a74-0feb-4d1b-bbb0-72c039d51d89          | OS::TripleO::Compute                              | CREATE_FAILED   | 2015-09-24T21:40:06Z | Compute                                     |
| NetworkDeployment                           | 275d0de9-e0ee-47cc-8efc-d03788f9a203          | OS::TripleO::SoftwareDeployment                   | CREATE_FAILED   | 2015-09-24T21:40:06Z | 39                                          |
| UpdateDeployment                            | 0fbc2ad9-db5e-4787-90cd-8fc0eb85cc26          | OS::Heat::SoftwareDeployment                      | CREATE_FAILED   | 2015-09-24T21:40:06Z | 39                                          |
| NovaComputeDeployment                       | da759b39-1ac4-42a2-9c50-99c43a67c9b3          | OS::TripleO::SoftwareDeployment                   | CREATE_FAILED   | 2015-09-24T21:40:19Z | 35                                          |
[stack@dell-undercloud ~]$

[stack@dell-undercloud ~]$ for failed_deployment in $(heat resource-list --nested-depth 5 overcloud | grep FAILED | grep 'SoftwareDeployment ' | cut -d '|' -f3); do heat deployment-show $failed_deployment; done
{
  "status": "IN_PROGRESS",
  "server_id": "a73ec252-b74d-41ba-be73-3eb53e3a1f7f",
  "config_id": "8c21a06a-e15c-4504-907d-b082f221dcf9",
  "output_values": null,
  "creation_time": "2015-09-24T21:44:03Z",
  "input_values": {},
  "action": "CREATE",
  "status_reason": "Deploy data available",
  "id": "275d0de9-e0ee-47cc-8efc-d03788f9a203"
}
{
  "status": "IN_PROGRESS",
  "server_id": "a73ec252-b74d-41ba-be73-3eb53e3a1f7f",
  "config_id": "1c575806-3d96-4e4d-8b0b-c52e8da25cb7",
  "output_values": null,
  "creation_time": "2015-09-24T21:43:49Z",
  "input_values": {},
  "action": "CREATE",
  "status_reason": "Deploy data available",
  "id": "0fbc2ad9-db5e-4787-90cd-8fc0eb85cc26"
}
{
  "status": "IN_PROGRESS",
  "server_id": "2eb32952-24f0-4f36-8134-b6bebd13ec4c",
  "config_id": "94ebe36d-2680-4ddf-a064-71baa9fbccc1",
  "output_values": null,
  "creation_time": "2015-09-24T21:47:04Z",
  "input_values": {},
  "action": "CREATE",
  "status_reason": "Deploy data available",
  "id": "da759b39-1ac4-42a2-9c50-99c43a67c9b3"
}


As per comment #19 : We do not have any instance ID which is not in "Active" State

Comment 21 James Slagle 2015-09-28 16:02:57 UTC
Given that the deployments are still IN_PROGRESS in the output of deployment-show, but in a CREATE_FAILED state in the resource-list output, this indicates that they were aborted by Heat due to a timeout.

This typically means the puppet apply on the nodes hung for some reason.

Can you ssh to the compute nodes that failed (it should be the 35th and 39th compute node from nova list output) and capture the os-collect-config logs via:

sudo journalctl -u os-collect-config

As the journal may have been rototated, you may need to look at older journal files under /var/log/journal with --file argument. If you want to upload the entire contents of /var/log/journal from these nodes somewhere, I'll have a look.

It is also useful to see what puppet might be blocked on during the scale out attempt. Run the following, then search for puppet to see what child processes it might be blocked on:

ps axjf | less

Often, it's blocked on the openstack-nova-compute service starting. Can you also have a look in /var/log/nova/nova-compute.log to see if there are any errors there?

Comment 22 bigswitch 2015-09-28 17:07:32 UTC
Hi James,
the logs from both compute-35 and 39 is in

https://bigswitch.box.com/s/8yc6u8ca7iq9s91vfqven9cwda1zhnag

there is no nova-compute.log in both nodes, the directory is empty. I think nova-compute wasnt deployed yet.

[root@overcloud-compute-35 nova]# pwd
/var/log/nova
[root@overcloud-compute-35 nova]# ls -ltr
total 0

[root@overcloud-compute-39 heat-admin]# cd /var/log/nova/
[root@overcloud-compute-39 nova]# ls -ltr
total 0
[root@overcloud-compute-39 nova]#

Comment 23 James Slagle 2015-09-28 18:12:27 UTC
Here's the error that's causing the failure from the os-collect-config log:

from journal-39:
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: dib-run-parts Mon Sep 28 10:24:20 EDT 2015 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: ++ os-apply-config --key os_net_config --type raw --key-default ''
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: + NET_CONFIG='{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"ovs_options": "bond_mode=balance-tcp lacp=active other-config:l
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: + '[' -n '{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"ovs_options": "bond_mode=balance-tcp lacp=active other-config:lacp-
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: + os-net-config -c /etc/os-net-config/config.json -v
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Using config file at: /etc/os-net-config/config.json
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Ifcfg net config provider created.
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] nic1 mapped to: em1
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] nic2 mapped to: p1p2
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding bridge: br-ex
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding bond: bond1
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding interface: p1p2
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding interface: nic3
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding vlan: vlan3980
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] adding vlan: vlan3981
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] applying network configs...
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015/09/28 10:24:20 AM] [INFO] Running ovs-appctl bond/set-active-slave ('bond1', 'p1p2')
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Traceback (most recent call last):
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/bin/os-net-config", line 10, in <module>
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: sys.exit(main())
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 172, in main
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: activate=not opts.no_activate)
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 310, in apply
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: self.bond_primary_ifaces[bond])
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 145, in ovs_appctl
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: self.execute(msg, '/bin/ovs-appctl', action, *parameters)
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 107, in execute
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: processutils.execute(cmd, *args, **kwargs)
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 233, in execute
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: cmd=sanitized_cmd)
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Command: /bin/ovs-appctl bond/set-active-slave bond1 p1p2
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Exit code: 2
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Stdout: u''
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: Stderr: u'no such bond\novs-appctl: ovs-vswitchd: server returned an error\n'
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015-09-28 10:24:20,475] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: [2015-09-28 10:24:20,475] (os-refresh-config) [ERROR] Aborting...
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: 2015-09-28 10:24:20.478 8457 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit st
Sep 28 10:24:20 overcloud-compute-39.localdomain os-collect-config[8457]: 2015-09-28 10:24:20.479 8457 WARNING os-collect-config [-] Sleeping 30.00 seconds before re-exec.


and from journal-35:
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: dib-run-parts Fri Sep 25 04:20:19 EDT 2015 Running /usr/libexec/os-refresh-config/configure.d/20-os-net-config
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: ++ os-apply-config --key os_net_config --type raw --key-default ''
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: + NET_CONFIG='{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"members": [{"type": "interface", "name": "nic2", "primary": tru
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: + '[' -n '{"network_config": [{"type": "ovs_bridge", "name": "br-ex", "members": [{"members": [{"type": "interface", "name": "nic2", "primary": true}, 
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: + os-net-config -c /etc/os-net-config/config.json -v
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Using config file at: /etc/os-net-config/config.json
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Ifcfg net config provider created.
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] nic1 mapped to: em1
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding bridge: br-ex
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding bond: bond1
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding interface: nic2
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding interface: nic3
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding vlan: vlan3980
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] adding vlan: vlan3981
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] applying network configs...
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: [2015/09/25 04:20:19 AM] [INFO] Running ovs-appctl bond/set-active-slave ('bond1', 'nic2')
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Traceback (most recent call last):
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/bin/os-net-config", line 10, in <module>
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: sys.exit(main())
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 172, in main
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: activate=not opts.no_activate)
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 310, in apply
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: self.bond_primary_ifaces[bond])
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 145, in ovs_appctl
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: self.execute(msg, '/bin/ovs-appctl', action, *parameters)
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 107, in execute
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: processutils.execute(cmd, *args, **kwargs)
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 233, in execute
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: cmd=sanitized_cmd)
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Command: /bin/ovs-appctl bond/set-active-slave bond1 nic2
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Exit code: 2
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Stdout: u''
Sep 25 04:20:19 overcloud-compute-35.localdomain os-collect-config[8461]: Stderr: u'no such bond\novs-appctl: ovs-vswitchd: server returned an error\n'


So in both cases it's failing to configure the bond. 39 found mapped 2 nics (nic1->em1, nic2->p1p2), but 35 only mapped 1 nic (nic1->em1).

Do these boxes have similar network interfaces (same number, same naming)? Does the expected nic2 on node 35 have link up? If not, that would explain why os-net-config did not configure it. Typically, the compute role heat templates (yaml files) have a single interface name (nic1, nic2, etc) for what interface to use for bonding or bridging. So, that needs to map to the same interface across all the physical systems that you intend to use for compute nodes. However, I'm not sure to what extent you've done any customizations with the templates.

Comment 24 bigswitch 2015-09-28 18:21:31 UTC
hi James,
yes all nodes has the same config, not sure why the link is not up on this two nodes. I can remove this two nodes from ironic and attempt to redeploy again later.
Is there anyway I can force bonding to be created even when the link is not up?

Song

Comment 25 James Slagle 2015-09-29 11:58:46 UTC
you can't force bonding that way, as it looks like the error is coming from the openvswitch side.

even if you could, the deployment would just fail on the following step since the compute node would not be able to join the cluster.

Comment 26 Jiri Stransky 2015-09-29 15:30:25 UTC
Some time ago, on a debugging session with Xin we uncovered a similar issue where the deployment failed because one of the nodes had a hardware network connectivity problem (link was down on one of the interfaces). The deployment succeeded after removing the troublesome node, if i recall correctly. So perhaps removing those nodes and trying to redeploy, as suggested above, could help.

Comment 27 bigswitch 2015-09-29 16:27:02 UTC
hi Jiri, James,
Thanks for the help, I have identified some nodes that have no active uplink, or one active uplink which is causing the deployment to fail.
I will spend some time to figure out which node has no links, and remove them from the database.
So far I have managed to scale to 50 nodes.

Comment 28 James Slagle 2015-09-29 17:40:45 UTC
thanks. closing this one out as we've identified the root cause as not a bug