Bug 1396313 - rhel-osp-director: 9.0 minor update fails, During the controllers update, a controllers becomes unreachable. The update fails
Summary: rhel-osp-director: 9.0 minor update fails, During the controllers update, a c...
Keywords:
Status: CLOSED DUPLICATE of bug 1388543
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: async
: 9.0 (Mitaka)
Assignee: Marios Andreou
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-18 00:23 UTC by Alexander Chuzhoy
Modified: 2016-12-29 16:59 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-18 15:46:23 UTC
Target Upstream Version:


Attachments (Terms of Use)
output of list_nodes_status for sanity check - no failed heat resources (1.41 KB, text/plain)
2016-11-18 09:30 UTC, Marios Andreou
no flags Details
/usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/yum_update.sh (2.66 KB, text/plain)
2016-11-18 09:31 UTC, Marios Andreou
no flags Details

Description Alexander Chuzhoy 2016-11-18 00:23:54 UTC
rhel-osp-director: 9.0 minor update fails, During the controllers update, a controllers becomes unreachable. The update fails


Environment:
instack-undercloud-4.0.0-15.el7ost.noarch
openstack-puppet-modules-8.1.8-3.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-36.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-36.el7ost.noarch
python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
openstack-neutron-openvswitch-8.1.2-5.el7ost.noarch



Steps to reproduce:

Try to minor update the overcloud 9.0 with:

[stack@instack ~]$ openstack overcloud update stack  -i overcloud --templates -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml
starting package update on stack overcloud
IN_PROGRESS
WAITING
not_started: [u'overcloud-controller-0', u'overcloud-controller-2', u'overcloud-controller-1', u'overcloud-compute-0']
on_breakpoint: [u'overcloud-cephstorage-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear b36b7201-c1f6-4916-942e-72f125b2daf3), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'overcloud-cephstorage-0']
on_breakpoint: [u'overcloud-controller-0', u'overcloud-controller-2', u'overcloud-controller-1', u'overcloud-compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 771d9d4b-9fd3-4fdc-9799-b50089ad9c12), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'overcloud-cephstorage-0', u'overcloud-compute-0']
on_breakpoint: [u'overcloud-controller-0', u'overcloud-controller-2', u'overcloud-controller-1']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 86dba1fe-8ad1-481b-abe0-166599c75554), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS


At this point one of the controllers already became unreachable.
After a lont time and many prints of IN_PROGRESS the update fails.



[stack@instack ~]$ cat network-environment.yaml
resource_registry:
  OS::TripleO::BlockStorage::Net::SoftwareConfig: /home/stack/nic-configs/cinder-storage.yaml
  OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/nic-configs/compute.yaml
  OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/nic-configs/controller.yaml
  OS::TripleO::ObjectStorage::Net::SoftwareConfig: /home/stack/nic-configs/swift-storage.yaml
  OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/nic-configs/ceph-storage.yaml


This is the network-environment.yaml file:

parameters:
  # Customize all these values to match the local environment
  NeutronExternalNetworkBridge: "''"
parameter_defaults:
  InternalApiNetCidr: 192.168.100.0/24
  StorageNetCidr: 192.168.110.0/24
  StorageMgmtNetCidr: 192.168.120.0/24
  TenantNetCidr: 192.168.150.0/24
  ExternalNetCidr: 192.168.200.0/24
  InternalApiAllocationPools: [{'start': '192.168.100.10', 'end': '192.168.100.200'}]
  StorageAllocationPools: [{'start': '192.168.110.10', 'end': '192.168.110.200'}]
  StorageMgmtAllocationPools: [{'start': '192.168.120.10', 'end': '192.168.120.200'}]
  TenantAllocationPools: [{'start': '192.168.150.10', 'end': '192.168.150.200'}]
  # Use an External allocation pool which will leave room for floating IPs
  ExternalAllocationPools: [{'start': '192.168.200.180', 'end': '192.168.200.200'}]
  # Set to the router gateway on the external network
  ExternalInterfaceDefaultRoute: 192.168.200.1
  DnsServers: ["10.16.36.29"]
  ControlPlaneSubnetCidr: "24"
  ControlPlaneDefaultRoute: 192.0.2.1
  EC2MetadataIp: 192.0.2.1



[stack@instack ~]$ cat /home/stack/nic-configs/controller.yaml
heat_template_version: 2015-04-30                             

description: >
  Software Config to drive os-net-config to configure VLANs for the
  controller role.                                                 

parameters:
  ManagementIpSubnet: # Only populated when including environments/network-management.yaml
    default: ''                                                                           
    description: IP address/subnet on the management network                              
    type: string                                                                          
  ControlPlaneIp:                                                                         
    default: ''                                                                           
    description: IP address/subnet on the ctlplane network                                
    type: string                                                                          
  ExternalIpSubnet:                                                                       
    default: ''                                                                           
    description: IP address/subnet on the external network                                
    type: string                                                                          
  InternalApiIpSubnet:                                                                    
    default: ''                                                                           
    description: IP address/subnet on the internal API network                            
    type: string                                                                          
  StorageIpSubnet:                                                                        
    default: ''                                                                           
    description: IP address/subnet on the storage network                                 
    type: string                                                                          
  StorageMgmtIpSubnet:                                                                    
    default: ''                                                                           
    description: IP address/subnet on the storage mgmt network                            
    type: string                                                                          
  TenantIpSubnet:                                                                         
    default: ''                                                                           
    description: IP address/subnet on the tenant network                                  
    type: string                                                                          
  ExternalNetworkVlanID:                                                                  
    default: 10                                                                           
    description: Vlan ID for the external network traffic.                                
    type: number                                                                          
  InternalApiNetworkVlanID:                                                               
    default: 20                                                                           
    description: Vlan ID for the internal_api network traffic.                            
    type: number                                                                          
  StorageNetworkVlanID:                                                                   
    default: 30                                                                           
    description: Vlan ID for the storage network traffic.                                 
    type: number                                                                          
  StorageMgmtNetworkVlanID:                                                               
    default: 40                                                                           
    description: Vlan ID for the storage mgmt network traffic.                            
    type: number                                                                          
  TenantNetworkVlanID:                                                                    
    default: 50                                                                           
    description: Vlan ID for the tenant network traffic.                                  
    type: number                                                                          
  ExternalInterfaceDefaultRoute:                                                          
    default: '10.0.0.1'                                                                   
    description: default route for the external network                                   
    type: string                                                                          
  ControlPlaneSubnetCidr: # Override this via parameter_defaults                          
    default: '24'                                                                         
    description: The subnet CIDR of the control plane network.                            
    type: string                                                                          
  DnsServers: # Override this via parameter_defaults                                      
    default: []                                                                           
    description: A list of DNS servers (2 max for some implementations) that will be added to resolv.conf.
    type: json                                                                                            
  EC2MetadataIp: # Override this via parameter_defaults                                                   
    description: The IP address of the EC2 metadata server.                                               
    type: string                                                                                          

resources:
  OsNetConfigImpl:
    type: OS::Heat::StructuredConfig
    properties:                     
      group: os-apply-config        
      config:                       
        os_net_config:              
          network_config:           
            -                       
              type: ovs_bridge      
              name: {get_input: bridge_name}
              use_dhcp: false               
              dns_servers: {get_param: DnsServers}
              addresses:                          
                -                                 
                  ip_netmask:                     
                    list_join:                    
                      - '/'                       
                      - - {get_param: ControlPlaneIp}
                        - {get_param: ControlPlaneSubnetCidr}
              routes:                                        
                -                                            
                  ip_netmask: 169.254.169.254/32             
                  next_hop: {get_param: EC2MetadataIp}       
              members:                                       
                -
                  type: interface
                  name: nic1
                  # force the MAC address of the bridge to this interface
                  primary: true
                -
                  type: vlan
                  vlan_id: {get_param: ExternalNetworkVlanID}
                  addresses:
                  -
                    ip_netmask: {get_param: ExternalIpSubnet}
                  routes:
                    -
                      ip_netmask: 0.0.0.0/0
                      next_hop: {get_param: ExternalInterfaceDefaultRoute}
                -
                  type: vlan
                  vlan_id: {get_param: InternalApiNetworkVlanID}
                  addresses:
                  -
                    ip_netmask: {get_param: InternalApiIpSubnet}
                -
                  type: vlan
                  vlan_id: {get_param: StorageNetworkVlanID}
                  addresses:
                  -
                    ip_netmask: {get_param: StorageIpSubnet}
                -
                  type: vlan
                  vlan_id: {get_param: StorageMgmtNetworkVlanID}
                  addresses:
                  -
                    ip_netmask: {get_param: StorageMgmtIpSubnet}
                -
                  type: vlan
                  vlan_id: {get_param: TenantNetworkVlanID}
                  addresses:
                  -
                    ip_netmask: {get_param: TenantIpSubnet}

outputs:
  OS::stack_id:
    description: The OsNetConfigImpl resource.
    value: {get_resource: OsNetConfigImpl}




Note:
Attempt to check the OS on all nodes at this point looks as following:
192.0.2.7
overcloud-cephstorage-0.localdomain
Red Hat Enterprise Linux Server release 7.2 (Maipo)
192.0.2.8
overcloud-compute-0.localdomain
Red Hat Enterprise Linux Server release 7.2 (Maipo)
192.0.2.10
overcloud-controller-0.localdomain
Red Hat Enterprise Linux Server release 7.2 (Maipo)
192.0.2.9
ssh: connect to host 192.0.2.9 port 22: No route to host
192.0.2.11
overcloud-controller-2.localdomain
Red Hat Enterprise Linux Server release 7.2 (Maipo)


Before the controller became unreachable it looked as following:
[stack@instack ~]$ for i in `nova list|awk '/ACTIVE/ {print $(NF-1)}' |awk -F"=" '{print $NF}'`; do echo $i; ssh -o StrictHostKeyChecking=no heat-admin@$i "hostname; cat /etc/redhat-release"; done
192.0.2.7                           
overcloud-cephstorage-0.localdomain 
Red Hat Enterprise Linux Server release 7.2 (Maipo)
192.0.2.8                                                                   
overcloud-compute-0.localdomain                                             
Red Hat Enterprise Linux Server release 7.2 (Maipo)                     
192.0.2.10                                                
overcloud-controller-0.localdomain                        
Red Hat Enterprise Linux Server release 7.2 (Maipo)
192.0.2.9                            
overcloud-controller-1.localdomain           
Red Hat Enterprise Linux Server release 7.3 (Maipo)
192.0.2.11     
overcloud-controller-2.localdomain       
Red Hat Enterprise Linux Server release 7.2 (Maipo)


Note that OS switched to rhel7.3

Comment 1 Marios Andreou 2016-11-18 09:26:41 UTC
Hi Sasha, I just had a look at the environment you sent via email. Looks like you are hitting BZ 1388543. I looked at controller-0 and see it has openvswitch-2.4.0-1.el7.x86_64 . The  node that is unreachable is controller-1 and from comment #0 looks like it was the node being updated. 

I checked out the tripleo-heat-templates in  /usr/share and it looks like you don't have the ovs upgrade workaround we landed for BZ 1388543 - so I expect is may have tried to upgrade ovs lost connectivity at that point. That BZ is post so we may need to make some noise about getting that into a build.

Comment 2 Marios Andreou 2016-11-18 09:30:57 UTC
Created attachment 1221784 [details]
output of list_nodes_status for sanity check - no failed heat resources

Comment 3 Marios Andreou 2016-11-18 09:31:52 UTC
Created attachment 1221786 [details]
/usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/yum_update.sh

Comment 4 Marios Andreou 2016-11-18 09:34:30 UTC
attached the output from list_nodes_status (checked to see if there was a failed resource/error message but there is none). 

I also attach https://bugzilla.redhat.com/attachment.cgi?id=1221786 which is the yum_update.sh file which I checked to see if the fixes from BZ 1388543 were applied for the manual ovs upgrade (they aren't, it should look like https://github.com/openstack/tripleo-heat-templates/blob/fa260860ff8aff868ff62ca25465d9f6eb96a9ee/extraconfig/tasks/yum_update.sh#L65

Comment 5 Omri Hochman 2016-11-18 15:46:43 UTC

*** This bug has been marked as a duplicate of bug 1388543 ***


Note You need to log in before you can comment on or make changes to this bug.