| Summary: | httpd/keystone is restarted during scale up/down | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Robin Cernin <rcernin> |
| Component: | openstack-tripleo | Assignee: | Michele Baldessari <michele> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Omri Hochman <ohochman> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.0 (Kilo) | CC: | djuran, fdinitto, gfidente, hrybacki, jcoufal, jschluet, jstransk, mburns, michele, oblaut, ohochman, panbalag, rbrady, rcernin, rhel-osp-director-maint, sclewis, shardy, srevivo |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
The High Availability architecture in this release is more simplified, resulting in a less invasive process when services need to be restarted. During scaling operations, only the needed services are restarted. Previously, a scaling operation required the entire cluster to be restarted.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-03-06 20:43:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Robin Cernin
2016-03-08 10:49:12 UTC
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. I've filed a related Bz, https://bugzilla.redhat.com/show_bug.cgi?id=1316215 suggesting that we should do a rolling restart of the services, i.e. one node at a time. Also, there is https://bugzilla.redhat.com/show_bug.cgi?id=1339559, which I think is very similar to this bug Failed_QA : ----------- the openstack-keystone was replaced by httpd, I will attempt to scale-up the environment, and verify that httpd service is not being restart on the controllers during scale operation. Checking httpd Uptime on controllers before Scale: ------------------------------------ [root@overcloud-controller-0 ~]# systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 03:50:12 UTC; 10h ago Verifying status of service post scale --------------------------------------- [heat-admin@overcloud-controller-0 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 19:02:34 UTC; 44min ago results: It seems that httpd on controller-0 was restarted during scale operation (In reply to Omri Hochman from comment #11) > Failed_QA : the environment was upgraded from : OSP9 to latest OSP10 adding SOS reports. Adding the full list here : [heat-admin@overcloud-controller-0 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 19:02:34 UTC; 1h 1min ago [heat-admin@overcloud-controller-1 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 03:49:02 UTC; 16h ago [heat-admin@overcloud-controller-2 ~]$ sudo systemctl status httpd | grep Active Active: active (running) since Thu 2016-11-17 03:49:29 UTC; 16h ago As you can see from the list Above ^^ the only httpd restart that happened during scale --> took place on controller-0 . on the other controller-1/2 the httpd service was not restarted. full SOS report available in the link: http://rhos-release.virt.bos.redhat.com/log/bz 1) The httpd resource was removed from pacemaker's control at around:
Nov 17 02:11:17 [46106] overcloud-controller-0.localdomain crmd: info: notify_deleted: Notifying tengine on overcloud-controller-1 that httpd was deleted
As this is an OSP9->OSP10 upgrade, this is expected.
On controller-0 we see the following:
Nov 17 02:41:27 overcloud-controller-0.localdomain os-collect-config[6001]: Thu Nov 17 02:40:03 UTC 2016 cfe15a77-f455-4763-864c-c14a3f6f64bf tripleo-upgrade overcloud-controller-0 Going to
check_resource_systemd for httpd to be started
Nov 17 03:31:45 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:31:47 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:31:47 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1418 messages from /system.slice/httpd.service
Nov 17 03:32:01 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 03:43:02 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:43:03 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:43:04 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 03:43:18 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 03:49:56 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:49:57 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:49:59 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 03:50:12 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 15:22:49 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 15:22:51 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 15:22:53 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 15:23:06 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 18:54:48 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 18:54:50 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 18:54:51 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 18:55:05 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 19:02:18 overcloud-controller-0.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 19:02:20 overcloud-controller-0.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 19:02:21 overcloud-controller-0.localdomain systemd-journal[616]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 19:02:34 overcloud-controller-0.localdomain systemd[1]: Started The Apache HTTP Server.
on controller-1:
Nov 17 02:41:21 overcloud-controller-1.localdomain os-collect-config[5340]: Thu Nov 17 02:39:56 UTC 2016 4d8265b2-c82e-4a9d-a49b-59818008b4e8 tripleo-upgrade overcloud-controller-1 Going to
check_resource_systemd for httpd to be started
Nov 17 03:42:33 overcloud-controller-1.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:42:34 overcloud-controller-1.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:42:35 overcloud-controller-1.localdomain systemd-journal[614]: Suppressed 1418 messages from /system.slice/httpd.service
Nov 17 03:42:49 overcloud-controller-1.localdomain systemd[1]: Started The Apache HTTP Server.
Nov 17 03:48:46 overcloud-controller-1.localdomain systemd[1]: Stopping The Apache HTTP Server...
Nov 17 03:48:47 overcloud-controller-1.localdomain systemd[1]: Starting The Apache HTTP Server...
Nov 17 03:48:48 overcloud-controller-1.localdomain systemd-journal[614]: Suppressed 1417 messages from /system.slice/httpd.service
Nov 17 03:49:02 overcloud-controller-1.localdomain systemd[1]: Started The Apache HTTP Server.
So we are interested to find out the restarts around 15:22, 18:54 and 19:02 on controller-0
So it seems that it is puppet doing this:
Nov 17 15:21:01 overcloud-controller-0.localdomain os-collect-config[6001]: [2016-11-17 15:21:01,291] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/d
eployed/6a6daad4-0b40-46c5-947f-8d2bc2ecef36.json
Nov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: [2016-11-17 15:37:54,869] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet
The reason seems to be that on controller-0 the ports changed:
/Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf]/content: content changed '{
We see the puppet running on controller-1 roughly at the same time:
Nov 17 15:20:58 overcloud-controller-1.localdomain os-collect-config[5340]: [2016-11-17 15:20:58,400] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/d
eployed/44a16dfd-4b2a-4024-abac-2f51c070550a.json
Nov 17 15:21:31 overcloud-controller-1.localdomain os-collect-config[5340]: [2016-11-17 15:21:31,530] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/puppet
Also the scale up is failing due to unrelated reasons:
The reason seems to be that on controller-0 the ports changed:
\"/usr/lib/python2.7/site-packages/stevedore/named.py\", line 77, in __init__\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 15:22:19.656
392783 ERROR gnocchi self._init_plugins(extensions)\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 15:22:19.656 392783 ERROR gnocchi
File \"/usr/lib/python2.7/site-packages/stevedore/driver.py\", line 98, in _init_plugins\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-11-17 1
5:22:19.656 392783 ERROR gnocchi N
ov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: (self.namespace, name))\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 201
6-11-17 15:22:19.656 392783 ERROR gnocchi NoMatches: No 'gnocchi.indexer' driver found, looking for 'sqlite'\u001b[0m\n\u001b[mNotice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/re
turns: 2016-11-17 15:22:19.656 392783 ERROR gnocchi \u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: Option \"verbose\" from group \"DEFAULT\" is
deprecated for removal. Its value may be silently ignored in the future.\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: /usr/lib/python2.7/sit
e-packages/oslo_db/sqlalchemy/enginefacade.py:1171: OsloDBDeprecationWarning: EngineFacade is deprecated; please use oslo_db.sqlalchemy.enginefacade\u001b[0m\n\u001b[mNotice: /Stage[main]/Gl
ance::Db::Sync/Exec[glance-manage db_sync]/returns: expire_on_commit=expire_on_commit, _conf=conf)\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/retur
ns: 2016-11-17 15:22:23.308 392863 CRITICAL glance [-] OperationalError: (sqlite3.OperationalError) unable to open database file\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[
glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance Traceback (most recent call last):\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_s
ync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/bin/glance-manage\", line 10, in <module>\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage d
b_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance sys.exit(main())\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:2
2:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 330, in main\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db
_sync]/returns: 2016-11-17 15 N
ov 17 15:37:54 overcloud-controller-0.localdomain os-collect-config[6001]: :22:23.308 392863 ERROR glance return CONF.command.action_fn()\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::D
b::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 190, in sync\u001b[0m\n\u001b[
mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance CONF.command.current_version)\u001b[0m\n\u001b[mNotice: /Stage[mai
n]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/cmd/manage.py\", line 106, in sync\u001b
[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance migration.db_sync(db_api.get_engine(),\u001b[0m\n\u001
b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/api.p
y\", line 98, in get_engine\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance facade = _create_faca
de_lazily()\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863 ERROR glance File \"/usr/lib/python2.7/site-packages
/glance/db/sqlalchemy/api.py\", line 88, in _create_facade_lazily\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863
ERROR glance _FACADE = session.EngineFacade.from_config(CONF)\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-manage db_sync]/returns: 2016-11-17 15:22:23.308 392863
ERROR glance File \"/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py\", line 1171, in from_config\u001b[0m\n\u001b[mNotice: /Stage[main]/Glance::Db::Sync/Exec[glance-ma
So to sum up:
- I think this specific bug was more about the whole control plane (keystone/httpd) being restarted during a scaleout operation.
- Now during a scaleout we have short-lived restarts of some services (httpd) but they are much less intrusive because 1) they take less time 2) they happen at different times on each controller, meaning that haproxy will just reroute the request to a controller that has the services running
The only thing that leaves me wondering is why the httpd restart happens during scaleout on controller-0 and not the other ones. That might be worthwhile investigating in a separate bug.
Robin, is my summary in line with your expectations here?
Omri, also it might be worth investigating the scaleup failures in a separate bug. I kind of assumed this was already the case, but probably worth noting anyway ;) |