Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1426253

Summary: - Failed Services during Upgrade from OSP 8 - OSP 9 Step - major-upgrade-pacemaker-converge.yaml
Product: Red Hat OpenStack Reporter: Randy Perryman <randy_perryman>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED DUPLICATE QA Contact: Amit Ugol <augol>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.0 (Mitaka)CC: arkady_kanevsky, aschultz, audra_cooper, cdevine, christopher_dearborn, dbecker, dcain, John_walsh, kurt_hey, mburns, michele, morazi, randy_perryman, rhel-osp-director-maint, smerrow, sreichar
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-24 17:59:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1305654    
Attachments:
Description Flags
sosreport from one of the controllers
none
sosreport from one of the controllers part b none

Description Randy Perryman 2017-02-23 14:29:54 UTC
Description of problem:
At the Major major-upgrade-pacemaker-converge.yaml step after running many services failed to start on the controllers.  Prior to running the step they were all running cleanly. 
Version-Release number of selected component (if applicable):
 Upgrade from OSP 8 to 9

How reproducible:

Seen this one time so far

Steps to Reproduce:
1. Install OSP 8
2. Complete Minor Update
3. Begin Major Upgrade to OSP 9

Actual results:
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Wed Feb 22 23:33:24 2017          Last change: Wed Feb 22 22:22:56 2017 by hacluster via crmd on overcloud-controller-2

3 nodes and 130 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-192.168.190.183     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-192.168.120.184     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.168.170.20      (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 ip-192.168.120.185     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-192.168.140.21      (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-192.168.140.20      (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-0
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 overcloud-controller-1-ipmi    (stonith:fence_ipmilan):        Started overcloud-controller-2
 overcloud-controller-0-ipmi    (stonith:fence_ipmilan):        Started overcloud-controller-1
 overcloud-controller-2-ipmi    (stonith:fence_ipmilan):        Started overcloud-controller-0
 Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-core-clone [openstack-core]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-api-clone [openstack-sahara-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd]
     openstack-gnocchi-statsd   (systemd:openstack-gnocchi-statsd):     FAILED overcloud-controller-0
     openstack-gnocchi-statsd   (systemd:openstack-gnocchi-statsd):     FAILED overcloud-controller-2
     openstack-gnocchi-statsd   (systemd:openstack-gnocchi-statsd):     FAILED overcloud-controller-1

Failed Actions:
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-0 'not running' (7): call=729, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:30:47 2017', queued=0ms, exec=0ms
* openstack-nova-scheduler_start_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=319, status=Timed Out, exitreason='none',
    last-rc-change='Wed Feb 22 22:10:20 2017', queued=0ms, exec=199987ms
* openstack-cinder-volume_monitor_60000 on overcloud-controller-0 'not running' (7): call=378, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 22:23:10 2017', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-0 'not running' (7): call=727, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:30:44 2017', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_monitor_60000 on overcloud-controller-0 'not running' (7): call=735, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:33:23 2017', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-0 'not running' (7): call=394, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 22:22:57 2017', queued=0ms, exec=90163ms
* openstack-nova-scheduler_start_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=311, status=Timed Out, exitreason='none',
    last-rc-change='Wed Feb 22 22:10:20 2017', queued=0ms, exec=199987ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-2 'not running' (7): call=627, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:30:47 2017', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-2 'not running' (7): call=625, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:30:44 2017', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_monitor_60000 on overcloud-controller-2 'not running' (7): call=631, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:33:23 2017', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-2 'not running' (7): call=374, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 22:22:57 2017', queued=0ms, exec=90161ms
* openstack-nova-scheduler_start_0 on overcloud-controller-1 'OCF_TIMEOUT' (198): call=316, status=Timed Out, exitreason='none',
    last-rc-change='Wed Feb 22 22:10:20 2017', queued=0ms, exec=199988ms
* openstack-cinder-scheduler_monitor_60000 on overcloud-controller-1 'not running' (7): call=632, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:30:47 2017', queued=0ms, exec=0ms
* openstack-cinder-api_monitor_60000 on overcloud-controller-1 'not running' (7): call=630, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:30:44 2017', queued=0ms, exec=0ms
* openstack-gnocchi-statsd_monitor_60000 on overcloud-controller-1 'not running' (7): call=636, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 23:33:23 2017', queued=0ms, exec=0ms
* neutron-server_start_0 on overcloud-controller-1 'not running' (7): call=379, status=complete, exitreason='none',
    last-rc-change='Wed Feb 22 22:22:57 2017', queued=0ms, exec=92161ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@overcloud-controller-0 ~]#


Expected results:
No Stopped Services

Additional info:

Comment 1 Randy Perryman 2017-02-23 15:25:32 UTC
Created attachment 1256950 [details]
sosreport from one of the controllers

Comment 2 Randy Perryman 2017-02-23 15:28:43 UTC
Created attachment 1256951 [details]
sosreport from one of the controllers part b

Comment 3 Michele Baldessari 2017-02-23 20:24:29 UTC
So the issue of the failed services is that they can't connect to galera.
I am not yet sure as to why for example heat-engine claims 'Started' but in the logs it is still showing failure to connect to the DB:
2017-02-22 23:14:25.280 48930 WARNING oslo_db.sqlalchemy.engines [req-84063df0-eb92-491a-a63f-cae43f42a85d - - - - -] SQL connection failed. 5 attempts left.


This seems to me some manifestation of server.conf being set with bind-address set to localhost again in the mysql config?
# grep -ir bind etc/my.cnf.d/ 
etc/my.cnf.d/galera.cnf:bind-address = overcloud-controller-0
etc/my.cnf.d/galera.cnf.rpmnew:# Override bind-address
etc/my.cnf.d/galera.cnf.rpmnew:# In some systems bind-address defaults to 127.0.0.1, and with mysqldump SST
etc/my.cnf.d/galera.cnf.rpmnew:bind-address=0.0.0.0
etc/my.cnf.d/server.cnf:bind-address = 127.0.0.1

I'll check with Sofer, shortly

Comment 4 Audra Cooper 2017-02-24 17:49:46 UTC
(In reply to Michele Baldessari from comment #3)
> So the issue of the failed services is that they can't connect to galera.
> I am not yet sure as to why for example heat-engine claims 'Started' but in
> the logs it is still showing failure to connect to the DB:
> 2017-02-22 23:14:25.280 48930 WARNING oslo_db.sqlalchemy.engines
> [req-84063df0-eb92-491a-a63f-cae43f42a85d - - - - -] SQL connection failed.
> 5 attempts left.
> 
> 
> This seems to me some manifestation of server.conf being set with
> bind-address set to localhost again in the mysql config?
> # grep -ir bind etc/my.cnf.d/ 
> etc/my.cnf.d/galera.cnf:bind-address = overcloud-controller-0
> etc/my.cnf.d/galera.cnf.rpmnew:# Override bind-address
> etc/my.cnf.d/galera.cnf.rpmnew:# In some systems bind-address defaults to
> 127.0.0.1, and with mysqldump SST
> etc/my.cnf.d/galera.cnf.rpmnew:bind-address=0.0.0.0
> etc/my.cnf.d/server.cnf:bind-address = 127.0.0.1
> 
> I'll check with Sofer, shortly

We realized we didn't have all of the patches applied.  After doing so, Upgrade is now successful and all services are Started.

Comment 5 Michele Baldessari 2017-02-24 17:59:23 UTC
Thanks Audra, I'll close as duplicate of the other one so we can also track the proper errata release there

*** This bug has been marked as a duplicate of bug 1413686 ***