Bug 1574473 - Failled to scale up upgraded overcloud
Summary: Failled to scale up upgraded overcloud
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Emilien Macchi
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-03 11:42 UTC by Yurii Prokulevych
Modified: 2018-06-22 12:38 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-22 12:38:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yurii Prokulevych 2018-05-03 11:42:09 UTC
Description of problem:
-----------------------
Attempt to add extra compute after overcloud upgrade to RHOS-11 fails due to timeout.

openstack overcloud deploy \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 3 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/enable-tls.yaml \
-e /home/stack/virt/inject-trust-anchor.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml
...

2018-05-03 09:58:36Z [overcloud-Compute-2ju3dpbou7mu.2.NetworkDeployment]: CREATE_IN_PROGRESS  state changed
2018-05-03 09:58:38Z [overcloud-Compute-2ju3dpbou7mu.2.NovaComputeConfig]: CREATE_COMPLETE  state changed
2018-05-03 11:33:15Z [Compute]: UPDATE_FAILED  UPDATE aborted
2018-05-03 11:33:16Z [overcloud]: UPDATE_FAILED  Timed out
2018-05-03 11:33:16Z [overcloud-Compute-2ju3dpbou7mu.2]: CREATE_FAILED  CREATE aborted
2018-05-03 11:33:16Z [overcloud-Compute-2ju3dpbou7mu]: UPDATE_FAILED  Operation cancelled

 Stack overcloud UPDATE_FAILED 
Version-Release number of selected component (if applicable):

Checking logs on newly added compute:
-------------------------------------
May 03 07:30:22 compute-2 os-collect-config[3052]: /usr/libexec/os-refresh-config/configure.d/20-os-net-config: line 81: /etc/os-net-config/dhcp_all_interfaces.yaml: No such file or directory
May 03 07:30:22 compute-2 os-collect-config[3052]: + os-net-config -c /etc/os-net-config/dhcp_all_interfaces.yaml -v --detailed-exit-codes --cleanup
May 03 07:30:22 compute-2 os-collect-config[3052]: [2018/05/03 07:30:22 AM] [INFO] Using config file at: /etc/os-net-config/dhcp_all_interfaces.yaml
May 03 07:30:22 compute-2 os-collect-config[3052]: [2018/05/03 07:30:22 AM] [INFO] Using mapping file at: /etc/os-net-config/mapping.yaml
May 03 07:30:22 compute-2 os-collect-config[3052]: [2018/05/03 07:30:22 AM] [INFO] Ifcfg net config provider created.
May 03 07:30:22 compute-2 os-collect-config[3052]: [2018/05/03 07:30:22 AM] [ERROR] No config file exists at: /etc/os-net-config/dhcp_all_interfaces.yaml
May 03 07:30:22 compute-2 os-collect-config[3052]: + RETVAL=1
May 03 07:30:22 compute-2 os-collect-config[3052]: + [[ 1 == 2 ]]
May 03 07:30:22 compute-2 os-collect-config[3052]: + [[ 1 != 0 ]]
May 03 07:30:22 compute-2 os-collect-config[3052]: + echo 'ERROR: configuration of safe defaults failed.'
May 03 07:30:22 compute-2 os-collect-config[3052]: ERROR: configuration of safe defaults failed.
May 03 07:30:22 compute-2 os-collect-config[3052]: [2018-05-03 07:30:22,953] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returne
May 03 07:30:22 compute-2 os-collect-config[3052]: [2018-05-03 07:30:22,954] (os-refresh-config) [ERROR] Aborting...
May 03 07:30:22 compute-2 os-collect-config[3052]: Command failed, will not cache new data. Command 'os-refresh-config --timeout 14400' returned non-zero exit status 1
May 03 07:30:22 compute-2 os-collect-config[3052]: Sleeping 1.00 seconds before re-exec.

Packages:
---------
openstack-tripleo-heat-templates-6.2.12-2.el7ost.noarch

openstack-tripleo-validations-5.6.4-1.el7ost.noarch
python-tripleoclient-6.2.4-1.el7ost.noarch
openstack-tripleo-common-6.1.5-1.el7ost.noarch
openstack-tripleo-image-elements-6.1.3-1.el7ost.noarch
openstack-tripleo-heat-templates-6.2.12-2.el7ost.noarch
openstack-tripleo-ui-3.2.2-2.el7ost.noarch
openstack-tripleo-puppet-elements-6.2.5-1.el7ost.noarch
puppet-tripleo-6.5.11-1.el7ost.noarch
openstack-tripleo-0.0.8-0.3.4de13b3git.el7ost.noarch

rhosp-director-images-ipa-11.0-20180501.1.el7ost.noarch
rhosp-director-images-10.0-20180501.1.el7ost.noarch
rhosp-director-images-ipa-10.0-20180501.1.el7ost.noarch
rhosp-director-images-11.0-20180501.1.el7ost.noarch



Steps to Reproduce:
--------------------
1. Upgrade UC/OC to RHOS-11(2018-05-01.2)
2. Scale up overcloud


Actual results:
----------------
Scale up failed

Expected results:
-----------------
Scale up succeeds

Comment 3 Bob Fournier 2018-05-09 17:40:12 UTC
Regarding:
May 03 07:30:22 compute-2 os-collect-config[3052]: /usr/libexec/os-refresh-config/configure.d/20-os-net-config: line 81: /etc/os-net-config/dhcp_all_interfaces.yaml: No such file or directory

Since 20-os-net-config is actually creating /etc/os-net-config/dhcp_all_interfaces.yaml starting at line 55, the only way that error message would occur is if /etc/os-net-config/ did not exist.  Need to track down where this dir is getting created and why it would not when scaling up in this case.

It appears that the logs for compute-2 are not in the Build artifacts link.  I only see compute-0 and compute-1.  Can you indicate where the logs were obtained in the initial bug comment?

Comment 4 Yurii Prokulevych 2018-05-10 10:30:04 UTC
@Bob, scale up failed so inventory wasn't updated properly, hence logs collection failed:
fatal: [compute-2]: FAILED! => {
    "changed": false, 
    "module_stderr": "Shared connection to 172.16.0.11 closed.\r\n", 
    "module_stdout": "Please login as the user \"heat-admin\" rather than the user \"root\".\r\n\r\n", 
    "rc": 0
}

Comment 7 Bob Fournier 2018-05-15 14:38:25 UTC
>Bob, You are also talking about old/new format. Can you point me about an example of >each one ? I dont know which one I'm using

In Ocata the nic config files were changed to use a script instead of the os-apply-config to drive os-net-config.  The "old-style" nic config files could be identified by:  
   Software Config to drive os-net-config to configure multiple interfaces
      group: os-apply-config

While the "new-style" doesn't use os-apply-config, instead it includes the script:
       str_replace:
          template:
            get_file: ../../scripts/run-os-net-config.sh

The old-style was still supported until Queens.  In Queens a script is available in $THT/tools/yaml-nic-config-2-script.py to do the conversion.

We've seen some issues when upgrading and using the new-style configs in that /etc/os-net-config/config.json was overwritten by os-apply-config, e.g. - https://bugzilla.redhat.com/show_bug.cgi?id=1514949.

Not sure yet what is going on here or if the problem you are seeing is the same as comment 3. Would be useful to get some logs to help figure out what is going on.

Comment 8 Bob Fournier 2018-05-19 22:27:49 UTC
Yuri - the 2nd set of logs looks quite different from the first, I don't see the os-net-config issues from the initial description.  I do see many connectivity issues when trying to access the metadata server from controller-2

I also see these libvirt issues in /var/log/messages on controller-2, which may be causing problems.
messages:May 11 13:58:06 controller-2 libvirtd: 2018-05-11 17:58:06.846+0000: 1649: error : logStrToLong_ui:2564 : Failed to convert 'virtio0' to unsigned int
messages:May 11 13:58:06 controller-2 libvirtd: 2018-05-11 17:58:06.849+0000: 1649: error : virPCIGetDeviceAddressFromSysfsLink:2643 : internal error: Failed to parse PCI config address 'virtio0'
messages:May 11 13:58:06 controller-2 libvirtd: 2018-05-11 17:58:06.850+0000: 1649: error : logStrToLong_ui:2564 : Failed to convert 'virtio1' to unsigned int
messages:May 11 13:58:06 controller-2 libvirtd: 2018-05-11 17:58:06.850+0000: 1649: error : virPCIGetDeviceAddressFromSysfsLink:2643 : internal error: Failed to parse PCI config address 'virtio1'
messages:May 11 13:58:06 controller-2 libvirtd: 2018-05-11 17:58:06.852+0000: 1649: error : logStrToLong_ui:2564 : Failed to convert 'virtio2' to unsigned int
messages:May 11 13:58:06 controller-2 libvirtd: 2018-05-11 17:58:06.852+0000: 1649: error : virPCIGetDeviceAddressFromSysfsLink:2643 : internal error: Failed to parse PCI config address 'virtio2'
messages:May 11 21:11:38 controller-2 libvirtd: 2018-05-11 21:11:38.808+0000: 2289: error : logStrToLong_ui:2564 : Failed to convert 'virtio0' to unsigned int
messages:May 11 21:11:38 controller-2 libvirtd: 2018-05-11 21:11:38.809+0000: 2289: error : virPCIGetDeviceAddressFromSysfsLink:2643 : internal error: Failed to parse PCI config address 'virtio0'
messages:May 11 21:11:38 controller-2 libvirtd: 2018-05-11 21:11:38.812+0000: 2289: error : logStrToLong_ui:2564 : Failed to convert 'virtio1' to unsigned int
messages:May 11 21:11:38 controller-2 libvirtd: 2018-05-11 21:11:38.812+0000: 2289: error : virPCIGetDeviceAddressFromSysfsLink:2643 : internal error: Failed to parse PCI config address 'virtio1'
messages:May 11 21:11:38 controller-2 libvirtd: 2018-05-11 21:11:38.813+0000: 2289: error : logStrToLong_ui:2564 : Failed to convert 'virtio2' to unsigned int
messages:May 11 21:11:38 controller-2 libvirtd: 2018-05-11 21:11:38.813+0000: 2289: error : virPCIGetDeviceAddressFromSysfsLink:2643 : internal error: Failed to parse PCI config address 'virtio2'

Comment 9 Scott Lewis 2018-06-22 12:38:44 UTC
OSP11 is now retired, see details at https://access.redhat.com/errata/product/191/ver=11/rhel---7/x86_64/RHBA-2018:1828


Note You need to log in before you can comment on or make changes to this bug.