Hide Forgot
rhel-osp-director: 9.0 minor update fails, During the controllers update, a controllers becomes unreachable. The update fails Environment: instack-undercloud-4.0.0-15.el7ost.noarch openstack-puppet-modules-8.1.8-3.el7ost.noarch openstack-tripleo-heat-templates-liberty-2.0.0-36.el7ost.noarch openstack-tripleo-heat-templates-2.0.0-36.el7ost.noarch python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 openstack-neutron-openvswitch-8.1.2-5.el7ost.noarch Steps to reproduce: Try to minor update the overcloud 9.0 with: [stack@instack ~]$ openstack overcloud update stack -i overcloud --templates -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml starting package update on stack overcloud IN_PROGRESS WAITING not_started: [u'overcloud-controller-0', u'overcloud-controller-2', u'overcloud-controller-1', u'overcloud-compute-0'] on_breakpoint: [u'overcloud-cephstorage-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear b36b7201-c1f6-4916-942e-72f125b2daf3), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS WAITING completed: [u'overcloud-cephstorage-0'] on_breakpoint: [u'overcloud-controller-0', u'overcloud-controller-2', u'overcloud-controller-1', u'overcloud-compute-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 771d9d4b-9fd3-4fdc-9799-b50089ad9c12), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS WAITING completed: [u'overcloud-cephstorage-0', u'overcloud-compute-0'] on_breakpoint: [u'overcloud-controller-0', u'overcloud-controller-2', u'overcloud-controller-1'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 86dba1fe-8ad1-481b-abe0-166599c75554), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS At this point one of the controllers already became unreachable. After a lont time and many prints of IN_PROGRESS the update fails. [stack@instack ~]$ cat network-environment.yaml resource_registry: OS::TripleO::BlockStorage::Net::SoftwareConfig: /home/stack/nic-configs/cinder-storage.yaml OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/nic-configs/compute.yaml OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/nic-configs/controller.yaml OS::TripleO::ObjectStorage::Net::SoftwareConfig: /home/stack/nic-configs/swift-storage.yaml OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/nic-configs/ceph-storage.yaml This is the network-environment.yaml file: parameters: # Customize all these values to match the local environment NeutronExternalNetworkBridge: "''" parameter_defaults: InternalApiNetCidr: 192.168.100.0/24 StorageNetCidr: 192.168.110.0/24 StorageMgmtNetCidr: 192.168.120.0/24 TenantNetCidr: 192.168.150.0/24 ExternalNetCidr: 192.168.200.0/24 InternalApiAllocationPools: [{'start': '192.168.100.10', 'end': '192.168.100.200'}] StorageAllocationPools: [{'start': '192.168.110.10', 'end': '192.168.110.200'}] StorageMgmtAllocationPools: [{'start': '192.168.120.10', 'end': '192.168.120.200'}] TenantAllocationPools: [{'start': '192.168.150.10', 'end': '192.168.150.200'}] # Use an External allocation pool which will leave room for floating IPs ExternalAllocationPools: [{'start': '192.168.200.180', 'end': '192.168.200.200'}] # Set to the router gateway on the external network ExternalInterfaceDefaultRoute: 192.168.200.1 DnsServers: ["10.16.36.29"] ControlPlaneSubnetCidr: "24" ControlPlaneDefaultRoute: 192.0.2.1 EC2MetadataIp: 192.0.2.1 [stack@instack ~]$ cat /home/stack/nic-configs/controller.yaml heat_template_version: 2015-04-30 description: > Software Config to drive os-net-config to configure VLANs for the controller role. parameters: ManagementIpSubnet: # Only populated when including environments/network-management.yaml default: '' description: IP address/subnet on the management network type: string ControlPlaneIp: default: '' description: IP address/subnet on the ctlplane network type: string ExternalIpSubnet: default: '' description: IP address/subnet on the external network type: string InternalApiIpSubnet: default: '' description: IP address/subnet on the internal API network type: string StorageIpSubnet: default: '' description: IP address/subnet on the storage network type: string StorageMgmtIpSubnet: default: '' description: IP address/subnet on the storage mgmt network type: string TenantIpSubnet: default: '' description: IP address/subnet on the tenant network type: string ExternalNetworkVlanID: default: 10 description: Vlan ID for the external network traffic. type: number InternalApiNetworkVlanID: default: 20 description: Vlan ID for the internal_api network traffic. type: number StorageNetworkVlanID: default: 30 description: Vlan ID for the storage network traffic. type: number StorageMgmtNetworkVlanID: default: 40 description: Vlan ID for the storage mgmt network traffic. type: number TenantNetworkVlanID: default: 50 description: Vlan ID for the tenant network traffic. type: number ExternalInterfaceDefaultRoute: default: '10.0.0.1' description: default route for the external network type: string ControlPlaneSubnetCidr: # Override this via parameter_defaults default: '24' description: The subnet CIDR of the control plane network. type: string DnsServers: # Override this via parameter_defaults default: [] description: A list of DNS servers (2 max for some implementations) that will be added to resolv.conf. type: json EC2MetadataIp: # Override this via parameter_defaults description: The IP address of the EC2 metadata server. type: string resources: OsNetConfigImpl: type: OS::Heat::StructuredConfig properties: group: os-apply-config config: os_net_config: network_config: - type: ovs_bridge name: {get_input: bridge_name} use_dhcp: false dns_servers: {get_param: DnsServers} addresses: - ip_netmask: list_join: - '/' - - {get_param: ControlPlaneIp} - {get_param: ControlPlaneSubnetCidr} routes: - ip_netmask: 169.254.169.254/32 next_hop: {get_param: EC2MetadataIp} members: - type: interface name: nic1 # force the MAC address of the bridge to this interface primary: true - type: vlan vlan_id: {get_param: ExternalNetworkVlanID} addresses: - ip_netmask: {get_param: ExternalIpSubnet} routes: - ip_netmask: 0.0.0.0/0 next_hop: {get_param: ExternalInterfaceDefaultRoute} - type: vlan vlan_id: {get_param: InternalApiNetworkVlanID} addresses: - ip_netmask: {get_param: InternalApiIpSubnet} - type: vlan vlan_id: {get_param: StorageNetworkVlanID} addresses: - ip_netmask: {get_param: StorageIpSubnet} - type: vlan vlan_id: {get_param: StorageMgmtNetworkVlanID} addresses: - ip_netmask: {get_param: StorageMgmtIpSubnet} - type: vlan vlan_id: {get_param: TenantNetworkVlanID} addresses: - ip_netmask: {get_param: TenantIpSubnet} outputs: OS::stack_id: description: The OsNetConfigImpl resource. value: {get_resource: OsNetConfigImpl} Note: Attempt to check the OS on all nodes at this point looks as following: 192.0.2.7 overcloud-cephstorage-0.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) 192.0.2.8 overcloud-compute-0.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) 192.0.2.10 overcloud-controller-0.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) 192.0.2.9 ssh: connect to host 192.0.2.9 port 22: No route to host 192.0.2.11 overcloud-controller-2.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) Before the controller became unreachable it looked as following: [stack@instack ~]$ for i in `nova list|awk '/ACTIVE/ {print $(NF-1)}' |awk -F"=" '{print $NF}'`; do echo $i; ssh -o StrictHostKeyChecking=no heat-admin@$i "hostname; cat /etc/redhat-release"; done 192.0.2.7 overcloud-cephstorage-0.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) 192.0.2.8 overcloud-compute-0.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) 192.0.2.10 overcloud-controller-0.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) 192.0.2.9 overcloud-controller-1.localdomain Red Hat Enterprise Linux Server release 7.3 (Maipo) 192.0.2.11 overcloud-controller-2.localdomain Red Hat Enterprise Linux Server release 7.2 (Maipo) Note that OS switched to rhel7.3
Hi Sasha, I just had a look at the environment you sent via email. Looks like you are hitting BZ 1388543. I looked at controller-0 and see it has openvswitch-2.4.0-1.el7.x86_64 . The node that is unreachable is controller-1 and from comment #0 looks like it was the node being updated. I checked out the tripleo-heat-templates in /usr/share and it looks like you don't have the ovs upgrade workaround we landed for BZ 1388543 - so I expect is may have tried to upgrade ovs lost connectivity at that point. That BZ is post so we may need to make some noise about getting that into a build.
Created attachment 1221784 [details] output of list_nodes_status for sanity check - no failed heat resources
Created attachment 1221786 [details] /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/yum_update.sh
attached the output from list_nodes_status (checked to see if there was a failed resource/error message but there is none). I also attach https://bugzilla.redhat.com/attachment.cgi?id=1221786 which is the yum_update.sh file which I checked to see if the fixes from BZ 1388543 were applied for the manual ovs upgrade (they aren't, it should look like https://github.com/openstack/tripleo-heat-templates/blob/fa260860ff8aff868ff62ca25465d9f6eb96a9ee/extraconfig/tasks/yum_update.sh#L65
*** This bug has been marked as a duplicate of bug 1388543 ***