Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1315651 - httpd/keystone is restarted during scale up/down [NEEDINFO]
httpd/keystone is restarted during scale up/down
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
medium Severity medium
: ---
: 10.0 (Newton)
Assigned To: Michele Baldessari
Omri Hochman
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-08 05:49 EST by Robin Cernin
Modified: 2018-03-06 15:43 EST (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
The High Availability architecture in this release is more simplified, resulting in a less invasive process when services need to be restarted. During scaling operations, only the needed services are restarted. Previously, a scaling operation required the entire cluster to be restarted.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-03-06 15:43:42 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mburns: needinfo? (rcernin)


Attachments (Terms of Use)

  None (edit)
Description Robin Cernin 2016-03-08 05:49:12 EST
Hello,

Command 'pcs resource disable openstack-keystone' runs during scale up/down
 
https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_resource_restart.sh#L21

"""
    pcs resource disable httpd
    check_resource httpd stopped 300
>   pcs resource disable openstack-keystone
    check_resource openstack-keystone stopped 1800
"""

This disable the openstack-keystone thus the whole cloud is down even during the scale up/down.

Thank you,
Robin Cernin
Comment 4 Mike Burns 2016-04-07 17:14:44 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 6 David Juran 2016-08-15 05:18:29 EDT
I've filed a related Bz, https://bugzilla.redhat.com/show_bug.cgi?id=1316215 suggesting that we should do a rolling restart of the services, i.e. one node at a time. 
Also, there is https://bugzilla.redhat.com/show_bug.cgi?id=1339559, which I think is very similar to this bug
Comment 11 Omri Hochman 2016-11-17 14:55:25 EST
Failed_QA : 
-----------
the openstack-keystone was replaced by httpd,  I will attempt to scale-up the  environment, and verify that httpd service is not being restart on the controllers during scale operation. 


Checking httpd Uptime on controllers before Scale: 
------------------------------------

[root@overcloud-controller-0 ~]# systemctl status httpd | grep Active
   Active: active (running) since Thu 2016-11-17 03:50:12 UTC; 10h ago


Verifying status of service post scale 
---------------------------------------

[heat-admin@overcloud-controller-0 ~]$ sudo systemctl status httpd | grep Active
   Active: active (running) since Thu 2016-11-17 19:02:34 UTC; 44min ago


results: 
It seems that httpd on controller-0 was restarted during scale operation
Comment 12 Omri Hochman 2016-11-17 14:56:45 EST
(In reply to Omri Hochman from comment #11)
> Failed_QA : 

the environment was upgraded from : OSP9 to latest OSP10 

adding SOS reports.
Comment 14 Omri Hochman 2016-11-17 15:07:11 EST
Adding the full list here :

[heat-admin@overcloud-controller-0 ~]$ sudo systemctl status httpd | grep Active
   Active: active (running) since Thu 2016-11-17 19:02:34 UTC; 1h 1min ago

[heat-admin@overcloud-controller-1 ~]$ sudo systemctl status httpd | grep Active
   Active: active (running) since Thu 2016-11-17 03:49:02 UTC; 16h ago

[heat-admin@overcloud-controller-2 ~]$ sudo systemctl status httpd | grep Active
   Active: active (running) since Thu 2016-11-17 03:49:29 UTC; 16h ago


As you can see from the list Above ^^ the only httpd restart that happened during scale --> took place on controller-0 .  

on the other controller-1/2  the httpd service was not restarted.
Comment 16 Omri Hochman 2016-11-17 15:27:20 EST
full SOS report available in the link: 

 http://rhos-release.virt.bos.redhat.com/log/bz
Comment 17 Michele Baldessari 2016-11-18 05:44:27 EST
1) The httpd resource was removed from pacemaker's control at around:
Nov 17 02:11:17 [46106] overcloud-controller-0.localdomain       crmd:     info: notify_deleted:        Notifying tengine on overcloud-controller-1 that httpd was deleted

As this is an OSP9->OSP10 upgrade, this is expected.

On controller-0 we see the following:
Nov 17 02:41:27 overcloud-controller-0.localdomain os-collect-config[6001]: Thu Nov 17 02:40:03 UTC 2016 cfe15a77-f455-4763-864c-c14a3f6f64bf tripleo-upgrade overcloud-controller-0 Going to 
check_resource_systemd for httpd to be started
Nov 17 03:31:45 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:31:47 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:31:47 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1418 messages from /system.slice/httpd.service
Nov 17 03:32:01 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 03:43:02 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:43:03 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:43:04 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 03:43:18 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 03:49:56 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:49:57 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:49:59 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 03:50:12 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.

Nov 17 15:22:49 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 15:22:51 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 15:22:53 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 15:23:06 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 18:54:48 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 18:54:50 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 18:54:51 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 18:55:05 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 19:02:18 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 19:02:20 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 19:02:21 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 19:02:34 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.

on controller-1:
Nov 17 02:41:21 overcloud-controller-1.localdomain os-collect-config[5340]: Thu Nov 17 02:39:56 UTC 2016 4d8265b2-c82e-4a9d-a49b-59818008b4e8 tripleo-upgrade overcloud-controller-1 Going to 
check_resource_systemd for httpd to be started
Nov 17 03:42:33 overcloud-controller-1.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:42:34 overcloud-controller-1.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:42:35 overcloud-controller-1.localdomain systemd-journal[614]: Suppressed 1418 messages from /system.slice/httpd.service
Nov 17 03:42:49 overcloud-controller-1.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 03:48:46 overcloud-controller-1.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:48:47 overcloud-controller-1.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:48:48 overcloud-controller-1.localdomain systemd-journal[614]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 03:49:02 overcloud-controller-1.localdomain systemd[1]: Started The Apache HTTP Server.

So we are interested to find out the restarts around 15:22, 18:54 and 19:02 on controller-0

So it seems that it is puppet doing this:
Nov 17 15:21:01 overcloud-controller-0.localdomain os-collect-config[6001]: [2016-11-17 15:21:01,291] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/d
eployed/6a6daad4-0b40-46c5-947f-8d2bc2ecef36.json
Nov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: [2016-11-17 15:37:54,869] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet

The reason seems to be that on controller-0 the ports changed:
/Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf]/content: content changed '{

We see the puppet running on controller-1 roughly at the same time:
Nov 17 15:20:58 overcloud-controller-1.localdomain os-collect-config[5340]: [2016-11-17 15:20:58,400] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/d
eployed/44a16dfd-4b2a-4024-abac-2f51c070550a.json
Nov 17 15:21:31 overcloud-controller-1.localdomain os-collect-config[5340]: [2016-11-17 15:21:31,530] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet

Also the scale up is failing due to unrelated reasons:

The reason seems to be that on controller-0 the ports changed:
\"/usr/lib/python2.7/site-packages/stevedore/named.py\", line 77, in __init__\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 15:22:19.656 
392783 ERROR gnocchi     self._init_plugins(extensions)\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 15:22:19.656 392783 ERROR gnocchi  
 File \"/usr/lib/python2.7/site-packages/stevedore/driver.py\", line 98, in _init_plugins\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 1
5:22:19.656 392783 ERROR gnocchi                                                                                                                                                             N
ov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: (self.namespace, name))\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 201
6-11-17 15:22:19.656 392783 ERROR gnocchi NoMatches: No 'gnocchi.indexer' driver found, looking for 'sqlite'\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/re
turns: 2016-11-17 15:22:19.656 392783 ERROR gnocchi \u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: Option \"verbose\" from group \"DEFAULT\" is
 deprecated for removal.  Its value may be silently ignored in the future.\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: /usr/lib/python2.7/sit
e-packages/oslo_db/sqlalchemy/enginefacade.py:1171: OsloDBDeprecationWarning: EngineFacade is deprecated; please use oslo_db.sqlalchemy.enginefacade\u001b[0m\n\u001b[mNotice: /Stage[main]/Gl
ance::Db::Sync/Exec[glance-manage db_sync]/returns:   expire_on_commit=expire_on_commit, _conf=conf)\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/retur
ns: 2016-11-17 15:22:23.308 392863 CRITICAL glance [-] OperationalError: (sqlite3.OperationalError) unable to open database file\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[
glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance Traceback (most recent call last):\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_s
ync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance   File \"/usr/bin/glance-manage\", line 10, in <module>\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage d
b_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance     sys.exit(main())\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:2
2:23.308 392863 ERROR glance   File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 330, in main\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db
_sync]/returns: 2016-11-17 15                                                                                                                                                                N
ov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: :22:23.308 392863 ERROR glance     return CONF.command.action_fn()\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::D
b::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance   File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 190, in sync\u001b[0m\n\u001b[
mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance     CONF.command.current_version)\u001b[0m\n\u001b[mNotice: /Stage[mai
n]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance   File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 106, in sync\u001b
[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance     migration.db_sync(db_api.get_engine(),\u001b[0m\n\u001
b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance   File \"/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/api.p
y\", line 98, in get_engine\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance     facade = _create_faca
de_lazily()\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance   File \"/usr/lib/python2.7/site-packages
/glance/db/sqlalchemy/api.py\", line 88, in _create_facade_lazily\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 
ERROR glance     _FACADE = session.EngineFacade.from_config(CONF)\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 
ERROR glance   File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 1171, in from_config\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-ma


So to sum up:
- I think this specific bug was more about the whole control plane (keystone/httpd) being restarted during a scaleout operation.
- Now during a scaleout we have short-lived restarts of some services (httpd) but they are much less intrusive because 1) they take less time 2) they happen at different times on each controller, meaning that haproxy will just reroute the request to a controller that has the services running

The only thing that leaves me wondering is why the httpd restart happens during scaleout on controller-0 and not the other ones. That might be worthwhile investigating in a separate bug.

Robin, is my summary in line with your expectations here?
Comment 18 Michele Baldessari 2016-11-18 05:45:19 EST
Omri, also it might be worth investigating the scaleup failures in a separate bug. I kind of assumed this was already the case, but probably worth noting anyway ;)

Note You need to log in before you can comment on or make changes to this bug.