Hide Forgot
Description of problem: I was not able to get OSP8 install since rc1. I aupdated all f/w in hopes that it would help, it did not appear to make a difference. So, I watched the install for funkiness and rebooted the node/nodes that are experience the funkiness. Doing this I have installed twice. So what to I mean by funkiness - It looked like one of my ceph nodes booted an older copy of the OS. I saw this as it having the same index as another system and the IP didn't match the nova list. This only happened the second install. - After rebooting this node, it came up and had two os-net-config errors. One is a conflicting IP the other is that determining an IP failed. One the first install saw this on multiple nodes. Here is some of the journal log around the os-net-conifg errors. [heat-admin@iaas-cephstorage-0 ~]$ journalctl -u os-collect-config | grep -A 25 -i trace Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Traceback (most recent call last): Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/bin/os-net-config", line 10, in <module> Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: sys.exit(main()) Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: activate=not opts.no_activate) Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 572, in apply Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.ifup(interface) Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 163, in ifup Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.execute(msg, '/sbin/ifup', interface) Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 143, in execute Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: processutils.execute(cmd, *args, **kwargs) Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 275, in execute Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: cmd=sanitized_cmd) Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Command: /sbin/ifup em2 Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Exit code: 1 Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stdout: u'ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host already uses address 172.18.13.12.\n' Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stderr: u'' Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + RETVAL=1 Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 == 2 ]] Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 != 0 ]] Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + echo 'ERROR: os-net-config configuration failed.' Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: ERROR: os-net-config configuration failed. Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + exit 1 Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + configure_safe_defaults Apr 21 17:32:03 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 == 0 ]] -- Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Traceback (most recent call last): Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/bin/os-net-config", line 10, in <module> Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: sys.exit(main()) Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 187, in main Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: activate=not opts.no_activate) Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 572, in apply Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.ifup(interface) Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 163, in ifup Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: self.execute(msg, '/sbin/ifup', interface) Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/os_net_config/__init__.py", line 143, in execute Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: processutils.execute(cmd, *args, **kwargs) Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: File "/usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py", line 275, in execute Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: cmd=sanitized_cmd) Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Command: /sbin/ifup em4 Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Exit code: 1 Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stdout: u'\nDetermining IP information for em4... failed.\n' Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: Stderr: u'' Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + RETVAL=1 Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 == 2 ]] Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + [[ 1 != 0 ]] Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: + echo 'ERROR: configuration of safe defaults failed.' Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: ERROR: configuration of safe defaults failed. Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: [2016-04-21 13:33:09,063] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1] Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: [2016-04-21 13:33:09,063] (os-refresh-config) [ERROR] Aborting... Apr 21 17:33:09 iaas-cephstorage-0.localdomain os-collect-config[6943]: 2016-04-21 13:33:09.068 6943 ERROR os-collect-config [-] Command failed, will not cache new data. Command 'os-refresh-config' returned non-zero exit status 1 [heat-admin@iaas-cephstorage-0 ~]$ This is Dell hardware with intel nics. The director is in a VM. I've generated 2 sosreport, on for the director node and one for the ceph node that had the -s-net-config issues. Both report will be at: http://refarch.cloud.lab.eng.bos.redhat.com/pub/tmp/OSPdisFunky/ Version-Release number of selected component (if applicable): [stack@iaas-inst ~]$ sudo yum list installed | grep -A 1 -e triple -e ironic -e images This system is not registered with RHN Classic or Red Hat Satellite. You can use rhn_register to register. Red Hat Satellite or RHN Classic support will be disabled. openstack-ironic-api.noarch 1:4.2.2-4.el7ost @RH7-RHOS-8.0 openstack-ironic-common.noarch 1:4.2.2-4.el7ost @RH7-RHOS-8.0 openstack-ironic-conductor.noarch 1:4.2.2-4.el7ost @RH7-RHOS-8.0 openstack-ironic-inspector.noarch 2.2.5-2.el7ost @RH7-RHOS-8.0-director openstack-keystone.noarch 1:8.0.1-1.el7ost @RH7-RHOS-8.0 -- openstack-tripleo.noarch 0.0.7-1.el7ost @RH7-RHOS-8.0-director openstack-tripleo-common.noarch 0.3.1-1.el7ost @RH7-RHOS-8.0-director openstack-tripleo-heat-templates.noarch 0.8.14-9.el7ost @RH7-RHOS-8.0-director openstack-tripleo-heat-templates-kilo.noarch 0.8.14-9.el7ost @RH7-RHOS-8.0-director openstack-tripleo-image-elements.noarch 0.9.9-2.el7ost @RH7-RHOS-8.0-director openstack-tripleo-puppet-elements.noarch 0.0.5-1.el7ost @RH7-RHOS-8.0-director -- python-ironic-inspector-client.noarch 1.2.0-6.el7ost @RH7-RHOS-8.0-director python-ironicclient.noarch 0.8.1-1.el7ost @RH7-RHOS-8.0 python-iso8601.noarch 0.1.10-6.1.el7ost @RH7-RHOS-8.0 -- python-tripleoclient.noarch 0.3.4-4.el7ost @RH7-RHOS-8.0-director python-troveclient.noarch 1.3.0-1.el7ost @RH7-RHOS-8.0 -- rhosp-director-images.noarch 8.0-20160415.1.el7ost @RH7-RHOS-8.0-director rhosp-director-images-ipa.noarch 8.0-20160415.1.el7ost @RH7-RHOS-8.0-director rng-tools.x86_64 5-7.el7 @anaconda/7.2 [stack@iaas-inst ~]$ How reproducible: Seen every time in rc1. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I had ComputeAllNodesValidationDeployment fail for what seem like the same network reasons. I see that the validation test only does a `ping -c 1 $IP` [0]. Would a `ping -c 4 $IP` give the network more time to come up? More details: During an OSP8 deploy [1] on 6 Dell servers my Heat Stack create failed because one of my compute nodes could not ping an address on my API network [2]. I got a list of all the IPs for each box and verified everyone one of them was pingable within 5 minutes [3] of this issue. I then simply ran the same deploy command again and the Stack UPDATE completed successfully [4]. The same deployment command on the same hardware/network worked repeatedly with OSP7 a few hours earlier; I see the order was changed [5] so perhaps this test happens earlier in the process. I will rebuild OSP8 a few times again the same way to see if this is easily reproducible and update this BZ with my results. [0] https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=blob;f=validation-scripts/all-nodes.sh;h=38a5a55e10b26337aaed4bd7f916ff684d56db9d;hb=a6861730bd3eee0cd419c959048cac9a48ee8482#l18 [1] time openstack overcloud deploy --templates ~/templates/ -e ~/templates/clean_osd.yaml -e ~/templates/environments/puppet-pacemaker.yaml -e ~/templates/advanced-networking.yaml -e ~/templates/environments/puppet-ceph-external.yaml -e ~/templates/extraconfig/pre_deploy/rhel-registration/environment-rhel-registration.yaml -e ~/templates/extraconfig/pre_deploy/rhel-registration/rhel-registration-resource-registry.yaml --control-flavor control --control-scale 3 --compute-flavor compute --compute-scale 3 --log-file overcloud_deployment.log --ntp-server 10.5.26.10 --timeout 90 --neutron-bridge-mappings datacentre:br-ex,tenant:br-tenant --neutron-network-type vlan --neutron-network-vlan-ranges tenant:4051:4060 --neutron-disable-tunneling [2] 2016-05-04 02:46:19 [0]: SIGNAL_IN_PROGRESS Signal: deployment failed (1) 2016-05-04 02:46:19 [0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 2016-05-04 02:46:20 [0]: SIGNAL_COMPLETE Unknown 2016-05-04 02:46:20 [overcloud-ComputeAllNodesValidationDeployment-uojcvmz4kfyo]: UPDATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 2016-05-04 02:46:21 [NetworkDeployment]: SIGNAL_COMPLETE Unknown 2016-05-04 02:46:21 [NovaComputeDeployment]: SIGNAL_COMPLETE Unknown 2016-05-04 02:46:21 [0]: SIGNAL_COMPLETE Unknown Stack overcloud CREATE_FAILED Heat Stack create failed. real 32m33.021s user 0m25.289s sys 0m2.416s [stack@hci-director ~]$ [stack@hci-director ~]$ heat stack-show 8409d34e-7c18-4b75-954a-fd4318d58189 | parameters | { | | "OS::project_id": "4a075850f7c2405fbd662676845724bf", | | "OS::stack_id": "8409d34e-7c18-4b75-954a-fd4318d58189", | | "OS::stack_name": "overcloud-ComputeAllNodesValidationDeployment-uojcvmz4kfyo" | | } | parent | f25ba434-00fa-496d-b911-1b1f3484a38c | stack_status | UPDATE_FAILED | stack_status_reason | Error: resources[0]: Deployment to server failed: | | deploy_status_code : Deployment exited with non-zero | | status code: 1 | updated_time | 2016-05-04T02:41:08 [stack@hci-director ~]$ heat deployment-show 1dd25c86-3349-4530-a714-8d3737880086 { "status": "FAILED", "server_id": "1b9a8cd1-b0f0-4de8-a960-7ecafd77acc9", "config_id": "60cb9cb3-c18e-42e2-be94-0e55d1b019bf", "output_values": { "deploy_stdout": "Trying to ping 172.16.1.14 for local network 172.16.1.0/24...SUCCESS\nTrying to ping 172.16.2.14 for local network 172.16.2.0/24...SUCCESS\nTrying to ping 192.168.2.15 for local network 192.168.2.0/24...FAILURE\n", "deploy_stderr": "192.168.2.15 is not pingable. Local Network: 192.168.2.0/24\n", "deploy_status_code": 1 }, "creation_time": "2016-05-04T02:41:11", "updated_time": "2016-05-04T02:46:19", "input_values": {}, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "1dd25c86-3349-4530-a714-8d3737880086" } [stack@hci-director ~]$ [3] Get list of IPs and generate cross-product ping test command ansible all -b -m shell -a "ip a | egrep '192.168|172.16'" ansible all -b -m shell -a "ping -c 2 192.168.3.15" ansible all -b -m shell -a "ping -c 2 172.16.2.16" ansible all -b -m shell -a "ping -c 2 172.16.1.16" ansible all -b -m shell -a "ping -c 2 192.168.2.17" ansible all -b -m shell -a "ping -c 2 192.168.3.11" ansible all -b -m shell -a "ping -c 2 172.16.2.12" ansible all -b -m shell -a "ping -c 2 172.16.1.12" ansible all -b -m shell -a "ping -c 2 192.168.2.13" ansible all -b -m shell -a "ping -c 2 172.16.2.14" ansible all -b -m shell -a "ping -c 2 172.16.1.14" ansible all -b -m shell -a "ping -c 2 192.168.2.15" ansible all -b -m shell -a "ping -c 2 192.168.3.10" ansible all -b -m shell -a "ping -c 2 172.16.2.11" ansible all -b -m shell -a "ping -c 2 172.16.1.11" ansible all -b -m shell -a "ping -c 2 192.168.2.12" ansible all -b -m shell -a "ping -c 2 192.168.3.12" ansible all -b -m shell -a "ping -c 2 172.16.2.13" ansible all -b -m shell -a "ping -c 2 172.16.1.13" ansible all -b -m shell -a "ping -c 2 192.168.2.14" ansible all -b -m shell -a "ping -c 2 192.168.1.38" ansible all -b -m shell -a "ping -c 2 192.168.3.14" ansible all -b -m shell -a "ping -c 2 172.16.2.15" ansible all -b -m shell -a "ping -c 2 172.16.1.15" ansible all -b -m shell -a "ping -c 2 192.168.2.16" [4] 2016-05-04 04:19:47 [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: http://10.19.139.37:5000/v2.0 Overcloud Deployed real 15m6.304s user 0m16.803s sys 0m1.629s [stack@hci-director ~]$ [5] https://bugs.launchpad.net/tripleo/+bug/1553243
closed, no need for needinfo.