Bug 1351784
Summary: | osp-director-9: Upgrade from OSP8 to OSP9 causes services on the controllers to fail, which requires a manual resource cleanup in pacemaker and cluster restart. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Omri Hochman <ohochman> |
Component: | rhosp-director | Assignee: | Marios Andreou <mandreou> |
Status: | CLOSED DUPLICATE | QA Contact: | Omri Hochman <ohochman> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 9.0 (Mitaka) | CC: | dbecker, jason.dobies, jcoufal, jeckersb, jjoyce, jstransk, mandreou, mburns, mcornea, mkrcmari, morazi, ohochman, rhel-osp-director-maint, sasha, sclewis, tvignaud |
Target Milestone: | ga | Keywords: | Triaged |
Target Release: | 9.0 (Mitaka) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-07-29 15:32:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1343905, 1353031, 1359760 | ||
Bug Blocks: |
Description
Omri Hochman
2016-06-30 20:27:35 UTC
Although there's workaround, I'm setting the blocker flag for PM, due to poor user experience in this case. To get the PCS status from the undercloud machine run : # make sure there are no stopped or unmanaged services ssh heat-admin@overcloud-controller-0 sudo pcs status | grep -i stopped -B2 ssh heat-admin@overcloud-controller-0 sudo pcs status | grep -i unmanaged -B2 Steps were according to : http://etherpad.corp.redhat.com/ospd9-upgrade Reproduced: After the step with " -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml" checked the pcs resources on a controller. The following services were down: rabbitmq-server.service openstack-heat-engine To fix: 'pcs resource cleanup' Post Successful Upgrade from OSP8 -> OSP9 the heat-engine service was down on the controllers from heat-engine.log : ----------------------- 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service dbapi_connection.rollback() 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 724, in rollback 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service self._read_ok_packet() 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 698, in _read_ok_packet 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service pkt = self._read_packet() 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service packet_header = self._read_bytes(4) 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 912, in _read_bytes 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service data = self._rfile.read(num_bytes) 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/pymysql/_socketio.py", line 59, in readinto 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service return self._sock.recv_into(b) 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 346, in recv_into 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service timeout_exc=socket.timeout("timed out")) 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 201, in _trampoline 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service mark_as_closed=self._mark_as_closed) 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 144, in trampoline 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service assert hub.greenlet is not current, 'do not call blocking functions from the mainloop' 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service DBAPIError: (exceptions.AssertionError) do not call blocking functions from the mainloop 2016-06-30 19:20:04.026 5697 ERROR oslo_service.service 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters [req-33c633c4-d079-41a6-bf84-af8a23fcfbac - -] DB exception wrapped. 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last): 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execut e_context 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters context) 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_ex ecute 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters result.read() 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1138, in read 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters first_packet = self.connection._read_packet() 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters packet_header = self._read_bytes(4) 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 912, in _read_bytes 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters data = self._rfile.read(num_bytes) 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/pymysql/_socketio.py", line 59, in readinto 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters return self._sock.recv_into(b) 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 346, in recv_into 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters timeout_exc=socket.timeout("timed out")) 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 201, in _trampoline 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters mark_as_closed=self._mark_as_closed) 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 144, in trampoline 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters assert hub.greenlet is not current, 'do not call blocking functions from the mainloop' 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters AssertionError: do not call blocking functions from the mainloop 2016-06-30 19:20:03.989 5696 ERROR oslo_db.sqlalchemy.exc_filters PCS Status ( Post-Upgrade ): ---------------------------- [root@overcloud-controller-0 ~]# pcs status Cluster name: tripleo_cluster Last updated: Fri Jul 1 13:30:17 2016 Last change: Thu Jun 30 21:37:13 2016 by root via crm_resource on overcloud-controller-0 Stack: corosync Current DC: overcloud-controller-1 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 3 nodes and 127 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Full list of resources: ip-10.19.184.210 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-192.168.200.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-192.168.0.6 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-10.19.104.11 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-10.19.105.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-10.19.104.10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-1 ] Slaves: [ overcloud-controller-0 overcloud-controller-2 ] Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: mongod-clone [mongod] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-clone [openstack-heat-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-api-clone [openstack-nova-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-registry-clone [openstack-glance-registry] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-api-clone [openstack-cinder-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: delay-clone [delay] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-server-clone [neutron-server] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: httpd-clone [httpd] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-core-clone [openstack-core] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-sahara-api-clone [openstack-sahara-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Failed Actions: * openstack-heat-engine_start_0 on overcloud-controller-0 'not running' (7): call=1260, status=complete, exitreason='none', last-rc-change='Thu Jun 30 21:36:50 2016', queued=0ms, exec=2152ms * openstack-gnocchi-metricd_start_0 on overcloud-controller-0 'not running' (7): call=1111, status=complete, exitreason='none', last-rc-change='Thu Jun 30 21:31:29 2016', queued=0ms, exec=2209ms * openstack-gnocchi-statsd_start_0 on overcloud-controller-0 'not running' (7): call=1106, status=complete, exitreason='none', last-rc-change='Thu Jun 30 21:31:22 2016', queued=0ms, exec=2455ms * openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=1254, status=complete, exitreason='none', last-rc-change='Thu Jun 30 21:36:50 2016', queued=0ms, exec=2078ms * openstack-gnocchi-metricd_start_0 on overcloud-controller-1 'not running' (7): call=1102, status=complete, exitreason='none', last-rc-change='Thu Jun 30 21:31:11 2016', queued=0ms, exec=2154ms * openstack-heat-engine_start_0 on overcloud-controller-2 'not running' (7): call=1240, status=complete, exitreason='none', last-rc-change='Thu Jun 30 21:36:50 2016', queued=0ms, exec=2201ms * openstack-gnocchi-metricd_start_0 on overcloud-controller-2 'not running' (7): call=1095, status=complete, exitreason='none', last-rc-change='Thu Jun 30 21:31:11 2016', queued=0ms, exec=2294ms PCSD Status: overcloud-controller-0: Online overcloud-controller-1: Online overcloud-controller-2: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Marios - Can you double check that this is covered in Jirka's manual workaround write up? o/ jdob... so no it wasn't and I've added a note at the controller upgrade step. I initially thought this was the same as but 1351204 but that is for the aodh migration (there is already a note on the etherpad regarding cluster cleanup for that though). WRT the deployment @omri or @sasha can you verify the heat stack went to UPDATE_COMPLETE ? (i.e. the error here is confined to the stopped services, and the heat stack is OK? so the step really did execute completely, is what I'm after) (In reply to marios from comment #7) > WRT the deployment @omri or @sasha can you verify the heat stack went to > UPDATE_COMPLETE ? (i.e. the error here is confined to the stopped services, > and the heat stack is OK? so the step really did execute completely, is what > I'm after) Marios that's right - we managed to get the stack in: UPDATE_COMPLETE. As you mentioned, we do have notes for it in the etherpad. This bug is exactly to track this issue, the requirement to start services and clean failed resources after each of the upgrade steps, causes poor user experience which we would like to avoid if possible. by failing to clean the resources after each step, the upgrade process will end with : UPDATE_FAILED. heat stack-list reports: | 87838b75-5687-4dce-9a5a-d4750e9e7b48 | overcloud | UPDATE_COMPLETE | 2016-06-28T18:27:58 | 2016-06-29T16:00:39 | On the console I see: 2016-06-29 16:28:59 [ControllerDeployment]: SIGNAL_COMPLETE Unknown 2016-06-29 16:29:00 [0]: SIGNAL_COMPLETE Unknown 2016-06-29 16:29:01 [NetworkDeployment]: SIGNAL_COMPLETE Unknown 2016-06-29 16:29:01 [0]: SIGNAL_COMPLETE Unknown Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: http://10.19.184.180:5000/v2.0 Overcloud Deployed Yet, when I check the pcs resources: [root@overcloud-controller-0 ~]# pcs status|grep -B2 -i stop Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] -- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-server-clone [neutron-server] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Running 'pcs resource cleanup' doesn't help - unable to start rabbitmq on all controllers. I believe that heat-engine fail could be related to this bug (it's worthy to check) - https://bugzilla.redhat.com/show_bug.cgi?id=1357229 wrt rabbitmq, pacemaker does not use systemd unit file for managing rabbitmq service but it uses pacemaker resource agent so the better way how to recover failed rabbitmq resource is "pcs resource cleanup rabbitmq", standard unit systemd file does not support HA mode and recreation of rabbitmq cluster. Hi Omri, Sasha, from the original description above and comments here the common theme is, for the controllers upgrade step (... -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml ) heat says UPDATE_COMPLETE but you see services, down. Of particular interest to me is: Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] which is rabbit not coming back on nodes 1/2 but fine on 0. That sounds to me like it could be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1343905 (whatever the fix ultimately is there). I think other services failing is likely a consequence of the pcs constraints (since rabbit is down...) What do you gents think? duplicate of 1343905 and track there? We can re open if necessary for any specific service that isn't starting (if any) once rabbit is working? thanks. Should we mark this one as duplicate or make it TestOnly? The fixes i had to pull in to get a green upgrade run are already associated with more specific bugzillas: RabbitMQ rejoin [1], python-cradox install [2] and gnocchi pacemaker constraint [3]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1343905 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1359760 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1353031 I'm going to close it as a dupe for now. If there is additional new information that suggests this is a distinct issue, feel free to provide additional reproducer information and reopen. *** This bug has been marked as a duplicate of bug 1359760 *** |