1563443 – FFU and regular upgrade from OSP 12: networks and routers created before the upgrade are not functional

Bug 1563443 - FFU and regular upgrade from OSP 12: networks and routers created before the upgrade are not functional

Summary: FFU and regular upgrade from OSP 12: networks and routers created before the ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	13.0 (Queens)
Assignee:	Brent Eagles
QA Contact:	Marius Cornea
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1577945 (view as bug list)
Depends On:
Blocks:	1574950 1578793
TreeView+	depends on / blocked

Reported:	2018-04-04 01:17 UTC by Marius Cornea
Modified:	2018-06-27 13:50 UTC (History)
CC List:	17 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-8.0.2-27.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1574950 (view as bug list)
Environment:
Last Closed:	2018-06-27 13:50:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1772072	None	None	None	2018-05-18 17:19:58 UTC
OpenStack gerrit	567655	'None'	MERGED	Add acl to paths that are shared among related neutron processes	2020-12-04 16:52:07 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:50:53 UTC

Internal Links: 1826981

Description Marius Cornea 2018-04-04 01:17:53 UTC

Description of problem:
FFU: post upgrade and after rebooting the controller node hosting the active router all controllers show standby ha_state.

Version-Release number of selected component (if applicable):
29-Mar-2018 build

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 3 computes
2. Upgrade to OSP13 by running the FFU procedure
3. Check neutron l3-agent-list-hosting-router for a router created before upgrade:

(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router workload_router_0
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 91f4dcc8-88cd-4c38-877d-1bf29d982da7 | controller-2.localdomain | True           | :-)   | standby  |
| ac66240b-4141-4c6f-95d1-21a734c96692 | controller-0.localdomain | True           | :-)   | active   |
| 713ce0f6-fa24-4c85-9b57-95ac6488633a | controller-1.localdomain | True           | :-)   | standby  |

4. Hard reboot controller-0:
ironic node-set-power-state 59c3cfc5-984b-4014-a20a-68c01c6ae1fb reboot

5. Check neutron l3-agent-list-hosting-router workload_router_0
We can see controller-0 went down:

(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router workload_router_0
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 91f4dcc8-88cd-4c38-877d-1bf29d982da7 | controller-2.localdomain | True           | :-)   | standby  |
| ac66240b-4141-4c6f-95d1-21a734c96692 | controller-0.localdomain | True           | xxx   | active   |
| 713ce0f6-fa24-4c85-9b57-95ac6488633a | controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+

6. Wait for controller-0 to come back up and check again:
(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router workload_router_0
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 91f4dcc8-88cd-4c38-877d-1bf29d982da7 | controller-2.localdomain | True           | :-)   | standby  |
| ac66240b-4141-4c6f-95d1-21a734c96692 | controller-0.localdomain | True           | :-)   | standby  |
| 713ce0f6-fa24-4c85-9b57-95ac6488633a | controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+


Actual results:
No router becomes active and all 3 controllers report standby ha_state.

Expected results:
I would expect that during reboot one of the 2 remaining controllers become active.

Additional info:
Note that the router's ip address is reacheable.

Comment 2 Lukas Bezdicka 2018-04-04 14:38:42 UTC

Is this real networking issue?

Comment 3 Jakub Libosvar 2018-04-05 17:12:07 UTC

It looks like an issue on filesystem on controller-2, the router became master but it cannot write to state file:
2018-04-03 19:10:51.121 558188 INFO neutron.agent.linux.daemon [-] Process runs with uid/gid: 997/994
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change [-] Failed to process or handle event for line 16: ha-3c7c4d6b-7f    inet 169.254.0.1/24 scope global ha-3c7c4d6b-7f\       valid_lft forever preferred_lft forever
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change Traceback (most recent call last):
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/keepalived_state_change.py", line 82, in parse_and_handle_event
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change     # contain bug where gratuitous ARPs are not sent on receiving
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/keepalived_state_change.py", line 99, in write_state_change
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change     resp, content = httplib2.Http().request(
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change IOError: [Errno 13] Permission denied: '/var/lib/neutron/ha_confs/e6d0764b-1cd4-4591-ab63-29a45769fb65/state'
2018-04-04 01:01:49.456 558188 ERROR neutron.agent.l3.keepalived_state_change

Comment 4 Jakub Libosvar 2018-04-05 17:15:33 UTC

Can you please check permissions and owner of the directory structure in /var/lib/neutron/ on controller-2? I cannot see that in sosreport as the uid of file is changed.

It's a bug in Neutron, I think Neutron should be resilient to similar errors.

Comment 5 Marius Cornea 2018-04-06 15:18:47 UTC

(In reply to Jakub Libosvar from comment #4)
> Can you please check permissions and owner of the directory structure in
> /var/lib/neutron/ on controller-2? I cannot see that in sosreport as the uid
> of file is changed.
> 
> It's a bug in Neutron, I think Neutron should be resilient to similar errors.

I'm checking this on another deployment(hostname controller-r02-02 in this case):

On host:

[root@controller-r02-02 ~]# ls -ld /var/lib/neutron/
drwxr-xr-x. 7 42435 42435 140 Apr  6 00:20 /var/lib/neutron/

[root@controller-r02-02 ~]# ls -l /var/lib/neutron/
total 8
drwxr-xr-x. 5 42435 42435  138 Apr  6 13:57 dhcp
drwxr-xr-x. 3 42435 42435   18 Apr  5 20:48 external
drwxr-xr-x. 4 42435 42435 4096 Apr  5 20:54 ha_confs
srwxr-xr-x. 1 42435 42435    0 Apr  6 00:20 keepalived-state-change
drwxr-xr-x. 2 42435 42435 4096 Apr  5 20:54 lock
srw-r--r--. 1 42435 42435    0 Apr  6 00:20 metadata_proxy
drwxr-xr-x. 2 42435 42435   55 Apr  6 00:21 ns-metadata-proxy


Inside neutron_l3_agent container:

[root@controller-r02-02 ~]# docker exec neutron_l3_agent ls -ld /var/lib/neutron/
drwxr-xr-x. 7 neutron neutron 140 Apr  6 00:20 /var/lib/neutron/
[root@controller-r02-02 ~]# docker exec neutron_l3_agent ls -ls /var/lib/neutron/
total 8
0 drwxr-xr-x. 5 neutron neutron  138 Apr  6 13:57 dhcp
0 drwxr-xr-x. 3 neutron neutron   18 Apr  5 20:48 external
4 drwxr-xr-x. 4 neutron neutron 4096 Apr  5 20:54 ha_confs
0 srwxr-xr-x. 1 neutron neutron    0 Apr  6 00:20 keepalived-state-change
4 drwxr-xr-x. 2 neutron neutron 4096 Apr  5 20:54 lock
0 srw-r--r--. 1 neutron neutron    0 Apr  6 00:20 metadata_proxy
0 drwxr-xr-x. 2 neutron neutron   55 Apr  6 00:21 ns-metadata-proxy


Additional info:

It looks that the keepalived is running on the baremetal(18641af8-3275-4152-a800-eb1010cf1d88 is the router that I tested the failover for):

[root@controller-r02-02 ~]# ps axu | grep 18641af8-3275-4152-a800-eb1010cf1d88 | grep -v grep
neutron   177031  0.0  0.4 320084 52548 ?        S    Apr05   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=18641af8-3275-4152-a800-eb1010cf1d88 --namespace=qrouter-18641af8-3275-4152-a800-eb1010cf1d88 --conf_dir=/var/lib/neutron/ha_confs/18641af8-3275-4152-a800-eb1010cf1d88 --monitor_interface=ha-688a1aeb-9a --monitor_cidr=169.254.0.2/24 --pid_file=/var/lib/neutron/external/pids/18641af8-3275-4152-a800-eb1010cf1d88.monitor.pid --state_path=/var/lib/neutron --user=997 --group=994
root      177539  0.0  0.0 118668  1540 ?        Ss   Apr05   0:01 keepalived -P -f /var/lib/neutron/ha_confs/18641af8-3275-4152-a800-eb1010cf1d88/keepalived.conf -p /var/lib/neutron/ha_confs/18641af8-3275-4152-a800-eb1010cf1d88.pid -r /var/lib/neutron/ha_confs/18641af8-3275-4152-a800-eb1010cf1d88.pid-vrrp
root      177540  0.0  0.0 118668  2076 ?        S    Apr05   0:03 keepalived -P -f /var/lib/neutron/ha_confs/18641af8-3275-4152-a800-eb1010cf1d88/keepalived.conf -p /var/lib/neutron/ha_confs/18641af8-3275-4152-a800-eb1010cf1d88.pid -r /var/lib/neutron/ha_confs/18641af8-3275-4152-a800-eb1010cf1d88.pid-vrrp

[root@controller-r02-02 ~]# pstree -p 177031
neutron-keepali(177031)─┬─ip(177033)
                        └─sudo(734073)───neutron-rootwra(734074)─┬─{neutron-rootwra}(734079)
                                                                 └─{neutron-rootwra}(734082)
[root@controller-r02-02 ~]# pstree -p 177539
keepalived(177539)───keepalived(177540)

[root@controller-r02-02 ~]# pstree -p 177540
keepalived(177540)

Comment 6 Marius Cornea 2018-04-06 15:45:58 UTC

Looks like the issue is caused by the neutron-keepalived-state-change process running on baremetal under 'neutron' user:

neutron   160922  0.0  0.4 320084 52540 ?        S    Apr05   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=5d68f8fd-d658-43e1-b4cb-60808c839e0c --namespace=qrouter-5d68f8fd-d658-43e1-b4cb-60808c839e0c --conf_dir=/var/lib/neutron/ha_confs/5d68f8fd-d658-43e1-b4cb-60808c839e0c --monitor_interface=ha-581a1d86-05 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/5d68f8fd-d658-43e1-b4cb-60808c839e0c.monitor.pid --state_path=/var/lib/neutron --user=997 --group=994


while /var/lib/neutron/ has 42435 as owner on the baremetal:

drwxr-xr-x. 7 42435 42435 140 Apr  6 00:20 /var/lib/neutron/

[root@controller-r02-02 ~]# ls -l /var/lib/neutron/
total 8
drwxr-xr-x. 5 42435 42435  138 Apr  6 13:57 dhcp
drwxr-xr-x. 3 42435 42435   18 Apr  5 20:48 external
drwxr-xr-x. 4 42435 42435 4096 Apr  5 20:54 ha_confs
srwxr-xr-x. 1 42435 42435    0 Apr  6 00:20 keepalived-state-change
drwxr-xr-x. 2 42435 42435 4096 Apr  5 20:54 lock
srw-r--r--. 1 42435 42435    0 Apr  6 00:20 metadata_proxy
drwxr-xr-x. 2 42435 42435   55 Apr  6 00:21 ns-metadata-proxy

Comment 7 Marius Cornea 2018-04-06 16:06:27 UTC

One possible solution: setfacl -m user:neutron:rw /var/lib/neutron/ during upgrade to allow the neutron owned processes to write into /var/lib/neutron/

Comment 8 Jakub Libosvar 2018-04-09 12:51:40 UTC

(In reply to Marius Cornea from comment #6)
> Looks like the issue is caused by the neutron-keepalived-state-change
> process running on baremetal under 'neutron' user:
> 
> neutron   160922  0.0  0.4 320084 52540 ?        S    Apr05   0:00
> /usr/bin/python2 /bin/neutron-keepalived-state-change
> --router_id=5d68f8fd-d658-43e1-b4cb-60808c839e0c
> --namespace=qrouter-5d68f8fd-d658-43e1-b4cb-60808c839e0c
> --conf_dir=/var/lib/neutron/ha_confs/5d68f8fd-d658-43e1-b4cb-60808c839e0c
> --monitor_interface=ha-581a1d86-05 --monitor_cidr=169.254.0.1/24
> --pid_file=/var/lib/neutron/external/pids/5d68f8fd-d658-43e1-b4cb-
> 60808c839e0c.monitor.pid --state_path=/var/lib/neutron --user=997 --group=994
> 
> 
> while /var/lib/neutron/ has 42435 as owner on the baremetal:
> 
> drwxr-xr-x. 7 42435 42435 140 Apr  6 00:20 /var/lib/neutron/
> 
> [root@controller-r02-02 ~]# ls -l /var/lib/neutron/
> total 8
> drwxr-xr-x. 5 42435 42435  138 Apr  6 13:57 dhcp
> drwxr-xr-x. 3 42435 42435   18 Apr  5 20:48 external
> drwxr-xr-x. 4 42435 42435 4096 Apr  5 20:54 ha_confs
> srwxr-xr-x. 1 42435 42435    0 Apr  6 00:20 keepalived-state-change
> drwxr-xr-x. 2 42435 42435 4096 Apr  5 20:54 lock
> srw-r--r--. 1 42435 42435    0 Apr  6 00:20 metadata_proxy
> drwxr-xr-x. 2 42435 42435   55 Apr  6 00:21 ns-metadata-proxy

What is 42435 user? Any chance the Neutron user has changed its uid during the upgrade to containers? That would affect OSP12 -> OSP13 upgrade too.

Comment 9 Marius Cornea 2018-04-09 15:13:23 UTC

> What is 42435 user? Any chance the Neutron user has changed its uid during
> the upgrade to containers? That would affect OSP12 -> OSP13 upgrade too.

This is the neutron uid used inside the container:

https://github.com/openstack/kolla/blob/master/kolla/common/config.py#L926-L929

Yes, I believe OSP12 -> OSP13 upgrade is affected by this issue as well.

Comment 10 Jakub Libosvar 2018-04-09 16:12:24 UTC

(In reply to Marius Cornea from comment #9)
> > What is 42435 user? Any chance the Neutron user has changed its uid during
> > the upgrade to containers? That would affect OSP12 -> OSP13 upgrade too.
> 
> This is the neutron uid used inside the container:
> 
> https://github.com/openstack/kolla/blob/master/kolla/common/config.py#L926-
> L929
> 
> Yes, I believe OSP12 -> OSP13 upgrade is affected by this issue as well.

Well, that explains a lot :) Neutron RPM doesn't have a static value for group and user, we removed it back then when migrating from Grizzly to Havana (Quantum -> Neutron). That means Neutron will have random uuid prior to containers.

Comment 13 Marius Cornea 2018-04-16 14:07:54 UTC

Reaching out to Brent to see if this is something that could be addressed with the sidecar Neutron containers.

In my opinion if we're running under the premise that all Neutron services are running inside containers after upgrade then we should not see the keepalived processes running on the baremetal but inside containers which should solve the permissions issue.

If we're moving the keepalived processes inside containers during upgrade then we need to make sure that network connectivity to the Neutron routers/floating IPs is not affected during this process. @Brent, is this something that can be addressed by the work done for the Neutron sidecar containers? If that's the case do you think it can be ready within the time frame for OSP13 GA release date?

Comment 14 Jiri Stransky 2018-04-16 14:51:03 UTC

(In reply to Jakub Libosvar from comment #8)
> What is 42435 user? Any chance the Neutron user has changed its uid during
> the upgrade to containers? That would affect OSP12 -> OSP13 upgrade too.

Yes generally the UIDs in containers are different than on baremetal. IIRC we had a similar issue with hybrid cinder-api (containerized) and cinder-volume (uncontainerized) and this collision on filesystem rights (we could make it work either for the containerized part of Cinder or the not-yet-containerized part, but not both at the same time).

I'd suggest making sure that at any given time on a single host we run Neutron services all uncontainerized or all containerized. I think temporary hybrid approach across multiple hosts, when the unit of atomicity is a single host, should "just work". But hybrid approach on a single host within a single component, where unit of atomicity is just a service, might be hitting this UID trouble in various places.

In case the all-or-nothing approach is not possible then i think what Marius suggested with ACLs is worth exploring. We haven't done this anywhere yet AFAIK, and we should make sure to clean up the ACLs after full containerization is done, but i can't think of a better way to make filesystem access in hybrid approach work. However, i wonder if we'd hit a similar issue somewhere else than just /var/lib/neutron.

Comment 19 Assaf Muller 2018-05-02 13:38:50 UTC

On Networking DFG triage call - @Marius we understand that the current path is that you're implementing the ACLs change (comment 7) to fix this RHBZ? Is our understanding correct? If so should we update this RHBZ to 'ON_DEV' with you as the assignee?

Comment 20 Marius Cornea 2018-05-02 14:23:27 UTC

(In reply to Assaf Muller from comment #19)
> On Networking DFG triage call - @Marius we understand that the current path
> is that you're implementing the ACLs change (comment 7) to fix this RHBZ? Is
> our understanding correct? If so should we update this RHBZ to 'ON_DEV' with
> you as the assignee?

No, taking into consideration the release point in time where we're at I'd like dev to take care of it. Also I'm not sure changing the ACL is the right path anymore given the fact that neutron-keepalived-state-change is expected to run inside the keepalived container(according to comment#18). With this in mind I'd expect that after upgrade neutron-keepalived-state-change also runs inside keepalived container.

Comment 21 Lukas Bezdicka 2018-05-04 15:33:12 UTC

Can you provide us with either fix or proper instructions for workaround? I don't think Marius should be applying acl change in his tests.

Comment 22 Jakub Libosvar 2018-05-04 15:58:50 UTC

(In reply to Lukas Bezdicka from comment #21)
> Can you provide us with either fix or proper instructions for workaround? I
> don't think Marius should be applying acl change in his tests.

I need somebody from upgrades DFG to explain the upgrade steps, in particular why do we reboot nodes in the middle of upgrade process and not after the upgrade is finished. If there is no reason to reboot nodes in the middle, I suggest to change current upgrade docs draft to reboot nodes after they are upgraded.

Comment 23 Marius Cornea 2018-05-04 16:07:08 UTC

(In reply to Jakub Libosvar from comment #22)
> (In reply to Lukas Bezdicka from comment #21)
> > Can you provide us with either fix or proper instructions for workaround? I
> > don't think Marius should be applying acl change in his tests.
> 
> I need somebody from upgrades DFG to explain the upgrade steps, in
> particular why do we reboot nodes in the middle of upgrade process and not
> after the upgrade is finished. If there is no reason to reboot nodes in the
> middle, I suggest to change current upgrade docs draft to reboot nodes after
> they are upgraded.

The issue reported in this ticket was observed when rebooting the nodes _after_ running the fast forward upgrade procedure. 

Note that this is not a requirement for the FFU procedure as reboots should happen after the OSP10 minor update. This test was more of a simulation of an expected lifecycle operation(be it intentional or unintentional caused by a failure) happening post upgrade.

Comment 25 Jakub Libosvar 2018-05-10 13:34:06 UTC

(In reply to Marius Cornea from comment #23)
> (In reply to Jakub Libosvar from comment #22)
> > (In reply to Lukas Bezdicka from comment #21)
> > > Can you provide us with either fix or proper instructions for workaround? I
> > > don't think Marius should be applying acl change in his tests.
> > 
> > I need somebody from upgrades DFG to explain the upgrade steps, in
> > particular why do we reboot nodes in the middle of upgrade process and not
> > after the upgrade is finished. If there is no reason to reboot nodes in the
> > middle, I suggest to change current upgrade docs draft to reboot nodes after
> > they are upgraded.
> 
> The issue reported in this ticket was observed when rebooting the nodes
> _after_ running the fast forward upgrade procedure. 

So you're saying that keepalived processes get spawned outside of container after rebooting the node?

> 
> Note that this is not a requirement for the FFU procedure as reboots should
> happen after the OSP10 minor update. 

Yes, that's exactly my question: why should we reboot nodes in the middle of procedure instead of after once the procedure of upgrade is completed?

Comment 26 Marius Cornea 2018-05-10 13:46:05 UTC

(In reply to Jakub Libosvar from comment #25)
> (In reply to Marius Cornea from comment #23)
> > (In reply to Jakub Libosvar from comment #22)
> > > (In reply to Lukas Bezdicka from comment #21)
> > > > Can you provide us with either fix or proper instructions for workaround? I
> > > > don't think Marius should be applying acl change in his tests.
> > > 
> > > I need somebody from upgrades DFG to explain the upgrade steps, in
> > > particular why do we reboot nodes in the middle of upgrade process and not
> > > after the upgrade is finished. If there is no reason to reboot nodes in the
> > > middle, I suggest to change current upgrade docs draft to reboot nodes after
> > > they are upgraded.
> > 
> > The issue reported in this ticket was observed when rebooting the nodes
> > _after_ running the fast forward upgrade procedure. 
> 
> So you're saying that keepalived processes get spawned outside of container
> after rebooting the node?

No, the keepalived processes remained outside the container as a result of the upgrade process. After the upgrade finished I rebooted the node where the router was active so at that point the router tried to transition to active on a different node(which was not rebooted) and it failed because the keepalived processes were running on the baremetal and didn't have access to /var/lib/neutron.

> > 
> > Note that this is not a requirement for the FFU procedure as reboots should
> > happen after the OSP10 minor update. 
> 
> Yes, that's exactly my question: why should we reboot nodes in the middle of
> procedure instead of after once the procedure of upgrade is completed?

The minor update and fast forward upgrades procedures are considered to be different/done in different maintenance windows so the reason for rebooting the nodes after the minor update window only is to minimize the number of reboots required for upgrading from OSP10 to 13.

Comment 27 Jakub Libosvar 2018-05-10 13:52:32 UTC

(In reply to Marius Cornea from comment #26)
> (In reply to Jakub Libosvar from comment #25)
> > (In reply to Marius Cornea from comment #23)
> > > (In reply to Jakub Libosvar from comment #22)
> > > > (In reply to Lukas Bezdicka from comment #21)
> > > > > Can you provide us with either fix or proper instructions for workaround? I
> > > > > don't think Marius should be applying acl change in his tests.
> > > > 
> > > > I need somebody from upgrades DFG to explain the upgrade steps, in
> > > > particular why do we reboot nodes in the middle of upgrade process and not
> > > > after the upgrade is finished. If there is no reason to reboot nodes in the
> > > > middle, I suggest to change current upgrade docs draft to reboot nodes after
> > > > they are upgraded.
> > > 
> > > The issue reported in this ticket was observed when rebooting the nodes
> > > _after_ running the fast forward upgrade procedure. 
> > 
> > So you're saying that keepalived processes get spawned outside of container
> > after rebooting the node?
> 
> No, the keepalived processes remained outside the container as a result of
> the upgrade process. After the upgrade finished I rebooted the node where
> the router was active so at that point the router tried to transition to
> active on a different node(which was not rebooted) and it failed because the
> keepalived processes were running on the baremetal and didn't have access to
> /var/lib/neutron.

Aha, ok. Now I understand better.
> 
> > > 
> > > Note that this is not a requirement for the FFU procedure as reboots should
> > > happen after the OSP10 minor update. 
> > 
> > Yes, that's exactly my question: why should we reboot nodes in the middle of
> > procedure instead of after once the procedure of upgrade is completed?
> 
> The minor update and fast forward upgrades procedures are considered to be
> different/done in different maintenance windows so the reason for rebooting
> the nodes after the minor update window only is to minimize the number of
> reboots required for upgrading from OSP10 to 13.

I see, thanks a lot for explanation.

Comment 28 Brent Eagles 2018-05-10 17:44:04 UTC

Let's summarize here a bit:

a.) After a FFU, the neutron agents and servers that are normally run under systemd are running in containers. However any processes that neutron may have started are still running on baremetal. This is essential to maintaining a dataplane through FFU. These services may be any of: keepalived, haproxy, dnsmasq, radvd, dibbler_client, neutron-keepalive-state-change. There are potentially multiples depending on number or routers and subnets.

b.) The essence of this bug is that after a FFU, the shared directories that these services use to sync runtime status and, in the case, of keepalived, state have had their permissions changed to the container uid/gid version of neutron so they can no longer update status. In this case precisely the neutron-keepalive-state-change daemon cannot record that it is currently the master nor write the domain socket to notify the l3 agent on that host that it is currently master. Besides not updating neutron about the state, the L3 agent on the node that keepalived has selected as master will not configure the node appropriately and things like metadata retrieval will fail.

c.) Sidecar containers doesn't actually help here because it will only take affect once the baremetal processes have been terminated and neutron tries to restart them. We cannot use this as a solution because it deliberately causes dataplane breakages.

d.) Given the above the only real solution I see is using "set facl" prior to launching the container - probably at the time that we disable the agent in the upgrade_tasks. While this unfortunately will only be needed until the node is rebooted, it has the advantage of being pretty straightforward.

I've added a WIP patch u/s https://review.openstack.org/#/c/567655/ for testing.

Comment 29 Jakub Libosvar 2018-05-11 10:37:17 UTC

(In reply to Brent Eagles from comment #28)
> Let's summarize here a bit:
> 
> a.) After a FFU, the neutron agents and servers that are normally run under
> systemd are running in containers. However any processes that neutron may
> have started are still running on baremetal. This is essential to
> maintaining a dataplane through FFU. These services may be any of:
> keepalived, haproxy, dnsmasq, radvd, dibbler_client,
> neutron-keepalive-state-change. There are potentially multiples depending on
> number or routers and subnets.
> 
> b.) The essence of this bug is that after a FFU, the shared directories that
> these services use to sync runtime status and, in the case, of keepalived,
> state have had their permissions changed to the container uid/gid version of
> neutron so they can no longer update status. In this case precisely the
> neutron-keepalive-state-change daemon cannot record that it is currently the
> master nor write the domain socket to notify the l3 agent on that host that
> it is currently master. Besides not updating neutron about the state, the L3
> agent on the node that keepalived has selected as master will not configure
> the node appropriately and things like metadata retrieval will fail. 
> 
> c.) Sidecar containers doesn't actually help here because it will only take
> affect once the baremetal processes have been terminated and neutron tries
> to restart them. We cannot use this as a solution because it deliberately
> causes dataplane breakages. 

Good summary Brent, thanks :) Who causes the dataplane disruption? Is it just the HA router failover?

> 
> d.) Given the above the only real solution I see is using "set facl" prior
> to launching the container - probably at the time that we disable the agent
> in the upgrade_tasks. While this unfortunately will only be needed until the
> node is rebooted, it has the advantage of being pretty straightforward.
> 
> I've added a WIP patch u/s https://review.openstack.org/#/c/567655/ for
> testing.

Comment 31 Marius Cornea 2018-05-14 16:52:32 UTC

*** Bug 1577945 has been marked as a duplicate of this bug. ***

Comment 32 Steven Hardy 2018-05-18 16:17:06 UTC

https://review.openstack.org/#/c/567655/ seems like a reasonable workaround to me, so I think the next step is confirmation if that works from Marius, if so we can go ahead and backport it.

Comment 34 Marius Cornea 2018-06-05 22:55:56 UTC

(qe-Cloud-0) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 0be9b382-570d-41fb-91f0-e9e39c780346
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 10.0.0.101 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 10.0.0.101 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning



+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 675d7e0e-ac2a-4c00-8231-623804037826 | controller-0.localdomain | True           | :-)   | standby  |
| aa294f32-ea6a-4da0-9147-430719e580dc | controller-2.localdomain | True           | :-)   | active   |
| d73d0f00-c31d-4aa3-91cf-9e23061dce1c | controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+


(undercloud) [stack@undercloud-0 ~]$ ironic node-set-power-state d03197ed-8236-4c44-94e1-ef3cd4c623f8 reboot

(qe-Cloud-0) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 0be9b382-570d-41fb-91f0-e9e39c780346
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 10.0.0.101 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 10.0.0.101 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 675d7e0e-ac2a-4c00-8231-623804037826 | controller-0.localdomain | True           | :-)   | standby  |
| aa294f32-ea6a-4da0-9147-430719e580dc | controller-2.localdomain | True           | :-)   | standby  |
| d73d0f00-c31d-4aa3-91cf-9e23061dce1c | controller-1.localdomain | True           | :-)   | active   |
+--------------------------------------+--------------------------+----------------+-------+----------+

(qe-Cloud-0) [stack@undercloud-0 ~]$ ssh cirros.0.222 'curl --silent 169.254.169.254'
Warning: Permanently added '10.0.0.222' (RSA) to the list of known hosts.
1.0
2007-01-19
2007-03-01
2007-08-29
2007-10-10
2007-12-15
2008-02-01
2008-09-01
2009-04-04

Comment 36 errata-xmlrpc 2018-06-27 13:50:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.