Red Hat Bugzilla – Bug 1315651
httpd/keystone is restarted during scale up/down
Last modified: 2018-03-06 15:43:42 EST
Hello, Command 'pcs resource disable openstack-keystone' runs during scale up/down https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_resource_restart.sh#L21 """ pcs resource disable httpd check_resource httpd stopped 300 > pcs resource disable openstack-keystone check_resource openstack-keystone stopped 1800 """ This disable the openstack-keystone thus the whole cloud is down even during the scale up/down. Thank you, Robin Cernin
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
I've filed a related Bz, https://bugzilla.redhat.com/show_bug.cgi?id=1316215 suggesting that we should do a rolling restart of the services, i.e. one node at a time. Also, there is https://bugzilla.redhat.com/show_bug.cgi?id=1339559, which I think is very similar to this bug
Failed_QA : ----------- the openstack-keystone was replaced by httpd, I will attempt to scale-up the environment, and verify that httpd service is not being restart on the controllers during scale operation. Checking httpd Uptime on controllers before Scale: ------------------------------------ [root@overcloud-controller-0 ~]# systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 03:50:12 UTC; 10h ago Verifying status of service post scale --------------------------------------- [heat-admin@overcloud-controller-0 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 19:02:34 UTC; 44min ago results: It seems that httpd on controller-0 was restarted during scale operation
(In reply to Omri Hochman from comment #11) > Failed_QA : the environment was upgraded from : OSP9 to latest OSP10 adding SOS reports.
Adding the full list here : [heat-admin@overcloud-controller-0 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 19:02:34 UTC; 1h 1min ago [heat-admin@overcloud-controller-1 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 03:49:02 UTC; 16h ago [heat-admin@overcloud-controller-2 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 03:49:29 UTC; 16h ago As you can see from the list Above ^^ the only httpd restart that happened during scale --> took place on controller-0 . on the other controller-1/2 the httpd service was not restarted.
full SOS report available in the link: http://rhos-release.virt.bos.redhat.com/log/bz
1) The httpd resource was removed from pacemaker's control at around: Nov 17 02:11:17 [46106] overcloud-controller-0.localdomain crmd: info: notify_deleted: Notifying tengine on overcloud-controller-1 that httpd was deleted As this is an OSP9->OSP10 upgrade, this is expected. On controller-0 we see the following: Nov 17 02:41:27 overcloud-controller-0.localdomain os-collect-config[6001]: Thu Nov 17 02:40:03 UTC 2016 cfe15a77-f455-4763-864c-c14a3f6f64bf tripleo-upgrade overcloud-controller-0 Going to check_resource_systemd for httpd to be started Nov 17 03:31:45 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 03:31:47 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 03:31:47 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1418 messages from /system.slice/httpd.service Nov 17 03:32:01 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server. Nov 17 03:43:02 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 03:43:03 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 03:43:04 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service Nov 17 03:43:18 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server. Nov 17 03:49:56 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 03:49:57 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 03:49:59 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service Nov 17 03:50:12 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server. Nov 17 15:22:49 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 15:22:51 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 15:22:53 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service Nov 17 15:23:06 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server. Nov 17 18:54:48 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 18:54:50 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 18:54:51 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service Nov 17 18:55:05 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server. Nov 17 19:02:18 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 19:02:20 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 19:02:21 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service Nov 17 19:02:34 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server. on controller-1: Nov 17 02:41:21 overcloud-controller-1.localdomain os-collect-config[5340]: Thu Nov 17 02:39:56 UTC 2016 4d8265b2-c82e-4a9d-a49b-59818008b4e8 tripleo-upgrade overcloud-controller-1 Going to check_resource_systemd for httpd to be started Nov 17 03:42:33 overcloud-controller-1.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 03:42:34 overcloud-controller-1.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 03:42:35 overcloud-controller-1.localdomain systemd-journal[614]: Suppressed 1418 messages from /system.slice/httpd.service Nov 17 03:42:49 overcloud-controller-1.localdomain systemd[1]: Started The Apache HTTP Server. Nov 17 03:48:46 overcloud-controller-1.localdomain systemd[1]: Stopping The Apache HTTP Server... Nov 17 03:48:47 overcloud-controller-1.localdomain systemd[1]: Starting The Apache HTTP Server... Nov 17 03:48:48 overcloud-controller-1.localdomain systemd-journal[614]: Suppressed 1417 messages from /system.slice/httpd.service Nov 17 03:49:02 overcloud-controller-1.localdomain systemd[1]: Started The Apache HTTP Server. So we are interested to find out the restarts around 15:22, 18:54 and 19:02 on controller-0 So it seems that it is puppet doing this: Nov 17 15:21:01 overcloud-controller-0.localdomain os-collect-config[6001]: [2016-11-17 15:21:01,291] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/d eployed/6a6daad4-0b40-46c5-947f-8d2bc2ecef36.json Nov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: [2016-11-17 15:37:54,869] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet The reason seems to be that on controller-0 the ports changed: /Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf]/content: content changed '{ We see the puppet running on controller-1 roughly at the same time: Nov 17 15:20:58 overcloud-controller-1.localdomain os-collect-config[5340]: [2016-11-17 15:20:58,400] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/d eployed/44a16dfd-4b2a-4024-abac-2f51c070550a.json Nov 17 15:21:31 overcloud-controller-1.localdomain os-collect-config[5340]: [2016-11-17 15:21:31,530] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet Also the scale up is failing due to unrelated reasons: The reason seems to be that on controller-0 the ports changed: \"/usr/lib/python2.7/site-packages/stevedore/named.py\", line 77, in __init__\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 15:22:19.656 392783 ERROR gnocchi self._init_plugins(extensions)\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 15:22:19.656 392783 ERROR gnocchi File \"/usr/lib/python2.7/site-packages/stevedore/driver.py\", line 98, in _init_plugins\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 1 5:22:19.656 392783 ERROR gnocchi N ov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: (self.namespace, name))\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 201 6-11-17 15:22:19.656 392783 ERROR gnocchi NoMatches: No 'gnocchi.indexer' driver found, looking for 'sqlite'\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/re turns: 2016-11-17 15:22:19.656 392783 ERROR gnocchi \u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: Option \"verbose\" from group \"DEFAULT\" is deprecated for removal. Its value may be silently ignored in the future.\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: /usr/lib/python2.7/sit e-packages/oslo_db/sqlalchemy/enginefacade.py:1171: OsloDBDeprecationWarning: EngineFacade is deprecated; please use oslo_db.sqlalchemy.enginefacade\u001b[0m\n\u001b[mNotice: /Stage[main]/Gl ance::Db::Sync/Exec[glance-manage db_sync]/returns: expire_on_commit=expire_on_commit, _conf=conf)\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/retur ns: 2016-11-17 15:22:23.308 392863 CRITICAL glance [-] OperationalError: (sqlite3.OperationalError) unable to open database file\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[ glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance Traceback (most recent call last):\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_s ync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/bin/glance-manage\", line 10, in <module>\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage d b_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance sys.exit(main())\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:2 2:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 330, in main\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db _sync]/returns: 2016-11-17 15 N ov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: :22:23.308 392863 ERROR glance return CONF.command.action_fn()\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::D b::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 190, in sync\u001b[0m\n\u001b[ mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance CONF.command.current_version)\u001b[0m\n\u001b[mNotice: /Stage[mai n]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 106, in sync\u001b [0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance migration.db_sync(db_api.get_engine(),\u001b[0m\n\u001 b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/api.p y\", line 98, in get_engine\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance facade = _create_faca de_lazily()\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages /glance/db/sqlalchemy/api.py\", line 88, in _create_facade_lazily\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance _FACADE = session.EngineFacade.from_config(CONF)\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 1171, in from_config\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-ma So to sum up: - I think this specific bug was more about the whole control plane (keystone/httpd) being restarted during a scaleout operation. - Now during a scaleout we have short-lived restarts of some services (httpd) but they are much less intrusive because 1) they take less time 2) they happen at different times on each controller, meaning that haproxy will just reroute the request to a controller that has the services running The only thing that leaves me wondering is why the httpd restart happens during scaleout on controller-0 and not the other ones. That might be worthwhile investigating in a separate bug. Robin, is my summary in line with your expectations here?
Omri, also it might be worth investigating the scaleup failures in a separate bug. I kind of assumed this was already the case, but probably worth noting anyway ;)