Bug 1329419 - overcloud deploy needs babysitting to complete.
Summary: overcloud deploy needs babysitting to complete.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Angus Thomas
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-21 21:38 UTC by Steve Reichard
Modified: 2020-04-15 14:27 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-10-19 22:30:30 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Steve Reichard 2016-04-21 21:38:57 UTC
Description of problem:

I was not able to get OSP8 install since rc1.

I aupdated all f/w in hopes that it would help, it did not appear to make a difference.

So, I watched the install for funkiness and rebooted the node/nodes that are experience the funkiness.  Doing this I have installed twice.

So what to I mean by funkiness
 - It looked like one of my ceph nodes booted an older copy of the OS.  I saw this as it having the same index as another system and the IP didn't match the nova list. This only happened the second install.
 - After rebooting this node, it came up and had two os-net-config errors.  One is a conflicting IP the other is that determining an IP failed.  One the first install saw this on multiple nodes.

Here is some of the journal log around the os-net-conifg errors.

[heat-admin@iaas-cephstorage-0 ~]$ journalctl -u os-collect-config | grep -A 25 -i trace
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Traceback (most recent call last):
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/bin/os-net-config", line 10, in <module>
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: sys.exit(main())
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: activate=not opts.no_activate)
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 572, in apply
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.ifup(interface)
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 163, in ifup
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.execute(msg, '/sbin/ifup', interface)
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 143, in execute
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: processutils.execute(cmd, *args, **kwargs)
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 275, in execute
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: cmd=sanitized_cmd)
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Command: /sbin/ifup em2
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Exit code: 1
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stdout: u'ERROR    : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host already uses address 172.18.13.12.\n'
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stderr: u''
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + RETVAL=1
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 == 2 ]]
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 != 0 ]]
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + echo 'ERROR: os-net-config configuration failed.'
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: ERROR: os-net-config configuration failed.
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + exit 1
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + configure_safe_defaults
Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 == 0 ]]
--
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Traceback (most recent call last):
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/bin/os-net-config", line 10, in <module>
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: sys.exit(main())
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: activate=not opts.no_activate)
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 572, in apply
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.ifup(interface)
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 163, in ifup
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.execute(msg, '/sbin/ifup', interface)
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 143, in execute
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: processutils.execute(cmd, *args, **kwargs)
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 275, in execute
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: cmd=sanitized_cmd)
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Command: /sbin/ifup em4
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Exit code: 1
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stdout: u'\nDetermining IP information for em4... failed.\n'
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stderr: u''
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + RETVAL=1
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 == 2 ]]
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 != 0 ]]
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + echo 'ERROR: configuration of safe defaults failed.'
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: ERROR: configuration of safe defaults failed.
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: [2016-04-21 13:33:09,063] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1]
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: [2016-04-21 13:33:09,063] (os-refresh-config) [ERROR] Aborting...
Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: 2016-04-21 13:33:09.068 6943 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1
[heat-admin@iaas-cephstorage-0 ~]$ 


This is Dell hardware with intel nics.
The director is in a VM.

I've generated 2 sosreport, on for the director node and one for the ceph node that had the -s-net-config issues. Both report will be at:

http://refarch.cloud.lab.eng.bos.redhat.com/pub/tmp/OSPdisFunky/



Version-Release number of selected component (if applicable):

[stack@iaas-inst ~]$ sudo yum list installed | grep -A 1 -e triple -e ironic -e images
This system is not registered with RHN Classic or Red Hat Satellite.
You can use rhn_register to register.
Red Hat Satellite or RHN Classic support will be disabled.
openstack-ironic-api.noarch       1:4.2.2-4.el7ost        @RH7-RHOS-8.0         
openstack-ironic-common.noarch    1:4.2.2-4.el7ost        @RH7-RHOS-8.0         
openstack-ironic-conductor.noarch 1:4.2.2-4.el7ost        @RH7-RHOS-8.0         
openstack-ironic-inspector.noarch 2.2.5-2.el7ost          @RH7-RHOS-8.0-director
openstack-keystone.noarch         1:8.0.1-1.el7ost        @RH7-RHOS-8.0         
--
openstack-tripleo.noarch          0.0.7-1.el7ost          @RH7-RHOS-8.0-director
openstack-tripleo-common.noarch   0.3.1-1.el7ost          @RH7-RHOS-8.0-director
openstack-tripleo-heat-templates.noarch
                                  0.8.14-9.el7ost         @RH7-RHOS-8.0-director
openstack-tripleo-heat-templates-kilo.noarch
                                  0.8.14-9.el7ost         @RH7-RHOS-8.0-director
openstack-tripleo-image-elements.noarch
                                  0.9.9-2.el7ost          @RH7-RHOS-8.0-director
openstack-tripleo-puppet-elements.noarch
                                  0.0.5-1.el7ost          @RH7-RHOS-8.0-director
--
python-ironic-inspector-client.noarch
                                  1.2.0-6.el7ost          @RH7-RHOS-8.0-director
python-ironicclient.noarch        0.8.1-1.el7ost          @RH7-RHOS-8.0         
python-iso8601.noarch             0.1.10-6.1.el7ost       @RH7-RHOS-8.0         
--
python-tripleoclient.noarch       0.3.4-4.el7ost          @RH7-RHOS-8.0-director
python-troveclient.noarch         1.3.0-1.el7ost          @RH7-RHOS-8.0         
--
rhosp-director-images.noarch      8.0-20160415.1.el7ost   @RH7-RHOS-8.0-director
rhosp-director-images-ipa.noarch  8.0-20160415.1.el7ost   @RH7-RHOS-8.0-director
rng-tools.x86_64                  5-7.el7                 @anaconda/7.2         
[stack@iaas-inst ~]$ 





How reproducible:

Seen every time in rc1.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 John Fulton 2016-05-05 15:05:57 UTC
I had ComputeAllNodesValidationDeployment fail for what seem like the same network reasons. 

I see that the validation test only does a `ping -c 1 $IP` [0]. Would a `ping -c 4 $IP` give the network more time to come up? 

More details:

During an OSP8 deploy [1] on 6 Dell servers my Heat Stack create failed because one of my compute nodes could not ping an address on my API network [2]. I got a list of all the IPs for each box and verified everyone one of them was pingable within 5 minutes [3] of this issue. I then simply ran the same deploy command again and the Stack UPDATE completed successfully [4]. The same deployment command on the same hardware/network worked repeatedly with OSP7 a few hours earlier; I see the order was changed [5] so perhaps this test happens earlier in the process. I will rebuild OSP8 a few times again the same way to see if this is easily reproducible and update this BZ with my results. 

[0] https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=blob;f=validation-scripts/all-nodes.sh;h=38a5a55e10b26337aaed4bd7f916ff684d56db9d;hb=a6861730bd3eee0cd419c959048cac9a48ee8482#l18

[1] 
time openstack overcloud deploy --templates ~/templates/ -e ~/templates/clean_osd.yaml -e ~/templates/environments/puppet-pacemaker.yaml -e ~/templates/advanced-networking.yaml -e ~/templates/environments/puppet-ceph-external.yaml -e ~/templates/extraconfig/pre_deploy/rhel-registration/environment-rhel-registration.yaml -e ~/templates/extraconfig/pre_deploy/rhel-registration/rhel-registration-resource-registry.yaml --control-flavor control --control-scale 3 --compute-flavor compute --compute-scale 3 --log-file overcloud_deployment.log --ntp-server 10.5.26.10 --timeout 90 --neutron-bridge-mappings datacentre:br-ex,tenant:br-tenant --neutron-network-type vlan --neutron-network-vlan-ranges tenant:4051:4060 --neutron-disable-tunneling

[2] 
2016-05-04 02:46:19 [0]: SIGNAL_IN_PROGRESS  Signal: deployment failed (1)
2016-05-04 02:46:19 [0]: CREATE_FAILED  Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-05-04 02:46:20 [0]: SIGNAL_COMPLETE  Unknown
2016-05-04 02:46:20 [overcloud-ComputeAllNodesValidationDeployment-uojcvmz4kfyo]: UPDATE_FAILED  Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-05-04 02:46:21 [NetworkDeployment]: SIGNAL_COMPLETE  Unknown
2016-05-04 02:46:21 [NovaComputeDeployment]: SIGNAL_COMPLETE  Unknown
2016-05-04 02:46:21 [0]: SIGNAL_COMPLETE  Unknown
Stack overcloud CREATE_FAILED
Heat Stack create failed.

real    32m33.021s
user    0m25.289s
sys     0m2.416s
[stack@hci-director ~]$ 

[stack@hci-director ~]$ heat stack-show  8409d34e-7c18-4b75-954a-fd4318d58189

| parameters            | {                                                                                                                                            
|                       |   "OS::project_id": "4a075850f7c2405fbd662676845724bf",    
|                       |   "OS::stack_id": "8409d34e-7c18-4b75-954a-fd4318d58189",  
|                       |   "OS::stack_name": "overcloud-ComputeAllNodesValidationDeployment-uojcvmz4kfyo"
|                       | } 
| parent                | f25ba434-00fa-496d-b911-1b1f3484a38c     
| stack_status          | UPDATE_FAILED    
| stack_status_reason   | Error: resources[0]: Deployment to server failed:  
|                       | deploy_status_code : Deployment exited with non-zero  
|                       | status code: 1               
| updated_time          | 2016-05-04T02:41:08     

[stack@hci-director ~]$ heat deployment-show 1dd25c86-3349-4530-a714-8d3737880086 
{
  "status": "FAILED", 
  "server_id": "1b9a8cd1-b0f0-4de8-a960-7ecafd77acc9", 
  "config_id": "60cb9cb3-c18e-42e2-be94-0e55d1b019bf", 
  "output_values": {
    "deploy_stdout": "Trying to ping 172.16.1.14 for local network 172.16.1.0/24...SUCCESS\nTrying to ping 172.16.2.14 for local network 172.16.2.0/24...SUCCESS\nTrying to ping 192.168.2.15 for local network 192.168.2.0/24...FAILURE\n", 
    "deploy_stderr": "192.168.2.15 is not pingable. Local Network: 192.168.2.0/24\n", 
    "deploy_status_code": 1
  }, 
  "creation_time": "2016-05-04T02:41:11", 
  "updated_time": "2016-05-04T02:46:19", 
  "input_values": {}, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", 
  "id": "1dd25c86-3349-4530-a714-8d3737880086"
}
[stack@hci-director ~]$ 

[3] Get list of IPs and generate cross-product ping test command 
ansible all -b -m shell -a "ip a | egrep '192.168|172.16'"

ansible all -b -m shell -a "ping -c 2 192.168.3.15" 
ansible all -b -m shell -a "ping -c 2 172.16.2.16"
ansible all -b -m shell -a "ping -c 2 172.16.1.16"
ansible all -b -m shell -a "ping -c 2 192.168.2.17"
ansible all -b -m shell -a "ping -c 2 192.168.3.11"
ansible all -b -m shell -a "ping -c 2 172.16.2.12"
ansible all -b -m shell -a "ping -c 2 172.16.1.12"
ansible all -b -m shell -a "ping -c 2 192.168.2.13"
ansible all -b -m shell -a "ping -c 2 172.16.2.14"
ansible all -b -m shell -a "ping -c 2 172.16.1.14"
ansible all -b -m shell -a "ping -c 2 192.168.2.15"
ansible all -b -m shell -a "ping -c 2 192.168.3.10"
ansible all -b -m shell -a "ping -c 2 172.16.2.11"
ansible all -b -m shell -a "ping -c 2 172.16.1.11"
ansible all -b -m shell -a "ping -c 2 192.168.2.12"
ansible all -b -m shell -a "ping -c 2 192.168.3.12"
ansible all -b -m shell -a "ping -c 2 172.16.2.13"
ansible all -b -m shell -a "ping -c 2 172.16.1.13"
ansible all -b -m shell -a "ping -c 2 192.168.2.14"
ansible all -b -m shell -a "ping -c 2 192.168.1.38"
ansible all -b -m shell -a "ping -c 2 192.168.3.14"
ansible all -b -m shell -a "ping -c 2 172.16.2.15"
ansible all -b -m shell -a "ping -c 2 172.16.1.15"
ansible all -b -m shell -a "ping -c 2 192.168.2.16"

[4] 
2016-05-04 04:19:47 [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully
Stack overcloud UPDATE_COMPLETE
Overcloud Endpoint: http://10.19.139.37:5000/v2.0
Overcloud Deployed

real    15m6.304s
user    0m16.803s
sys     0m1.629s
[stack@hci-director ~]$

[5] https://bugs.launchpad.net/tripleo/+bug/1553243

Comment 8 Amit Ugol 2018-05-02 10:36:43 UTC
closed, no need for needinfo.


Note You need to log in before you can comment on or make changes to this bug.