Bug 1430384

Summary: OSP10 -> OSP11 upgrade on IPv6 deployment get stuck during major-upgrade-composable-steps
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: aschultz, dbecker, jcoufal, jschluet, mburns, mcornea, michele, morazi, rhel-osp-director-maint, sathlang
Target Milestone: rcKeywords: Triaged
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-6.0.0-0.10.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 20:06:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1394019    

Description Marius Cornea 2017-03-08 13:48:05 UTC
Description of problem:
OSP10 -> OSP11 upgrade on IPv6 deployment get stuck during major-upgrade-composable-steps

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-6.0.0-0.20170222195630.46117f4.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Run the OSP10->OSP11 upgrade workflow on an IPv6 deployment with 3 controllers, 2 compute nodes and 3 ceph nodes

Actual results:
Upgrade gets stuck

Expected results:
Upgrade doesn't get blocked.

Additional info:

[stack@undercloud-0 ~]$ openstack stack list --nested | grep PROGRESS
| 52ca5c6b-6353-450f-9ec3-6ffb3df5da96 | overcloud-AllNodesDeploySteps-n7gn2tdu4r7m-AllNodesPostUpgradeSteps-cob3l24fahnm-ControllerDeployment_Step1-gl6bzubujvwb                                                    | CREATE_IN_PROGRESS | 2017-03-08T13:28:05Z | None                 | 175caee2-c83d-4f70-b7f5-e2de6373a70b |
| 175caee2-c83d-4f70-b7f5-e2de6373a70b | overcloud-AllNodesDeploySteps-n7gn2tdu4r7m-AllNodesPostUpgradeSteps-cob3l24fahnm                                                                                            | CREATE_IN_PROGRESS | 2017-03-08T13:27:25Z | None                 | 5d78c2c4-fa2f-4560-aa09-40939044b9bb |
| 5d78c2c4-fa2f-4560-aa09-40939044b9bb | overcloud-AllNodesDeploySteps-n7gn2tdu4r7m                                                                                                                                  | UPDATE_IN_PROGRESS | 2017-03-08T11:52:03Z | 2017-03-08T13:09:57Z | efe081d8-de20-4fef-98d8-12c23c578e6c |
| efe081d8-de20-4fef-98d8-12c23c578e6c | overcloud                                                                                                                                                                   | UPDATE_IN_PROGRESS | 2017-03-08T11:41:47Z | 2017-03-08T13:02:34Z | None                                 |


All the controller nodes are running the following in the os-collect-config log:

[root@overcloud-controller-2 heat-admin]# journalctl -fl -u os-collect-config
-- Logs begin at Wed 2017-03-08 11:47:14 UTC. --
Mar 08 13:28:17 overcloud-controller-2.localdomain os-collect-config[4244]: [2017-03-08 13:28:17,211] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/45ec9401-3381-4ed3-8066-5bf0b0a1442e.json
Mar 08 13:28:17 overcloud-controller-2.localdomain os-collect-config[4244]: [2017-03-08 13:28:17,212] (heat-config) [WARNING] Skipping config d4a79a71-3f6c-4ad0-be65-53ee87d38a18, already deployed
Mar 08 13:28:17 overcloud-controller-2.localdomain os-collect-config[4244]: [2017-03-08 13:28:17,212] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/d4a79a71-3f6c-4ad0-be65-53ee87d38a18.json
Mar 08 13:28:17 overcloud-controller-2.localdomain os-collect-config[4244]: [2017-03-08 13:28:17,212] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/puppet < /var/lib/heat-config/deployed/306ad840-29b4-4dfe-825d-0659cce43de8.json
Mar 08 13:28:23 overcloud-controller-2.localdomain su[510690]: (to rabbitmq) root on none
Mar 08 13:28:33 overcloud-controller-2.localdomain su[511048]: (to rabbitmq) root on none
Mar 08 13:28:34 overcloud-controller-2.localdomain su[511217]: (to rabbitmq) root on none
Mar 08 13:28:35 overcloud-controller-2.localdomain su[511396]: (to rabbitmq) root on none
Mar 08 13:28:36 overcloud-controller-2.localdomain su[511564]: (to rabbitmq) root on none
Mar 08 13:28:38 overcloud-controller-2.localdomain usermod[511879]: change user 'hacluster' password


The nodes seem to not be able to join the cluster:

http://paste.openstack.org/show/601938/

ip6tables rules:
http://paste.openstack.org/show/601939/

It looks that the firewall rules are blocking the nodes from joining the cluster. After running 'ip6tables -F' the deployment was unblocked and the nodes were able to join the cluster.

Comment 1 Lukas Bezdicka 2017-03-08 14:36:34 UTC
[root@overcloud-controller-1 ~]# iptables -nL
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state NEW tcp dpt:22
REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
[root@overcloud-controller-1 ~]# ip6tables -nL
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all      ::/0                 ::/0                 state RELATED,ESTABLISHED
ACCEPT     icmpv6    ::/0                 ::/0                
ACCEPT     all      ::/0                 ::/0                
ACCEPT     tcp      ::/0                 ::/0                 state NEW tcp dpt:22
ACCEPT     udp      ::/0                 fe80::/64            udp dpt:546 state NEW
REJECT     all      ::/0                 ::/0                 reject-with icmp6-adm-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all      ::/0                 ::/0                 reject-with icmp6-adm-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
[root@overcloud-controller-1 ~]# 


[stack@instack ~]$ heat resource-list overcloud -n5 | grep -v COMPLETE
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
+----------------------------------------------+---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| resource_name                                | physical_resource_id                                                            | resource_type                                                                                                            | resource_status    | updated_time         | stack_name                                                                                                                            |
+----------------------------------------------+---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| AllNodesDeploySteps                          | f9326dbe-bf6e-4aee-8d7b-5ee57095ad66                                            | OS::TripleO::PostDeploySteps                                                                                             | CREATE_IN_PROGRESS | 2017-03-08T13:11:04Z | overcloud                                                                                                                             |
| ControllerDeployment_Step1                   | 0f163e3f-70e0-404b-8004-c9d553e51edb                                            | OS::Heat::StructuredDeploymentGroup                                                                                      | CREATE_IN_PROGRESS | 2017-03-08T13:26:25Z | overcloud-AllNodesDeploySteps-orrronjgqs57                                                                                            |
| 0                                            | 717ee17b-db18-4de0-9fec-08a50a119358                                            | OS::Heat::StructuredDeployment                                                                                           | CREATE_IN_PROGRESS | 2017-03-08T13:27:09Z | overcloud-AllNodesDeploySteps-orrronjgqs57-ControllerDeployment_Step1-bc2ca62qd4bj                                                    |
| 1                                            | b44a6bbb-1bb4-40e0-95af-60aa71e8d0a1                                            | OS::Heat::StructuredDeployment                                                                                           | CREATE_IN_PROGRESS | 2017-03-08T13:27:09Z | overcloud-AllNodesDeploySteps-orrronjgqs57-ControllerDeployment_Step1-bc2ca62qd4bj                                                    |
| 2                                            | eb4f2387-9aeb-4009-a754-452e4ebc1528                                            | OS::Heat::StructuredDeployment                                                                                           | CREATE_IN_PROGRESS | 2017-03-08T13:27:10Z | overcloud-AllNodesDeploySteps-orrronjgqs57-ControllerDeployment_Step1-bc2ca62qd4bj                                                    |
+----------------------------------------------+---------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------+

Comment 2 Michele Baldessari 2017-03-08 14:55:15 UTC
So for the record we had a similar (same) issue in M->N upgrades but with ipv4,
which we worked around with:
https://github.com/openstack/tripleo-heat-templates/commit/ae8aac36143d5dadb08af0d275f513678909dcc7

But in that case we went from firewall off by default to apply firewall rules which created this disruption.

The reason we have blocked traffic is likely this one:
https://bugs.launchpad.net/tripleo/+bug/1657108
(See https://tickets.puppetlabs.com/browse/MODULES-3184) 

What I am not sure though is as to why /etc/sysconfig/ip[6]tables is populated with stock rules. Those files should be already populated with the right rules after the osp10 deployment.

Marius would it be possible to get sosreports from controller0 after osp10 deployment and after the failed osp11 upgrade?

The only theory I can think of is that you deployed osp10 with firewall disabled and we enabled it when moving to the osp11 templates. Would that be a possible theory?

Comment 3 Marius Cornea 2017-03-09 18:05:35 UTC
(In reply to Michele Baldessari from comment #2)
> So for the record we had a similar (same) issue in M->N upgrades but with
> ipv4,
> which we worked around with:
> https://github.com/openstack/tripleo-heat-templates/commit/
> ae8aac36143d5dadb08af0d275f513678909dcc7
> 
> But in that case we went from firewall off by default to apply firewall
> rules which created this disruption.
> 
> The reason we have blocked traffic is likely this one:
> https://bugs.launchpad.net/tripleo/+bug/1657108
> (See https://tickets.puppetlabs.com/browse/MODULES-3184) 
> 
> What I am not sure though is as to why /etc/sysconfig/ip[6]tables is
> populated with stock rules. Those files should be already populated with the
> right rules after the osp10 deployment.
> 
> Marius would it be possible to get sosreports from controller0 after osp10
> deployment and after the failed osp11 upgrade?
> 
> The only theory I can think of is that you deployed osp10 with firewall
> disabled and we enabled it when moving to the osp11 templates. Would that be
> a possible theory?

I think what happens is that during the OSP10 deployment the default firewall rules are there but the ip6tables service is not running so they're not applied.  During the upgrade the ip6tables service gets started thus the rules set in /etc/sysconfig/ip6tables get applied and block the ipv6 traffic:

http://paste.openstack.org/show/602106/

Comment 5 Michele Baldessari 2017-03-13 09:37:00 UTC
(In reply to Marius Cornea from comment #3)
> (In reply to Michele Baldessari from comment #2)
> > So for the record we had a similar (same) issue in M->N upgrades but with
> > ipv4,
> > which we worked around with:
> > https://github.com/openstack/tripleo-heat-templates/commit/
> > ae8aac36143d5dadb08af0d275f513678909dcc7
> > 
> > But in that case we went from firewall off by default to apply firewall
> > rules which created this disruption.
> > 
> > The reason we have blocked traffic is likely this one:
> > https://bugs.launchpad.net/tripleo/+bug/1657108
> > (See https://tickets.puppetlabs.com/browse/MODULES-3184) 
> > 
> > What I am not sure though is as to why /etc/sysconfig/ip[6]tables is
> > populated with stock rules. Those files should be already populated with the
> > right rules after the osp10 deployment.
> > 
> > Marius would it be possible to get sosreports from controller0 after osp10
> > deployment and after the failed osp11 upgrade?
> > 
> > The only theory I can think of is that you deployed osp10 with firewall
> > disabled and we enabled it when moving to the osp11 templates. Would that be
> > a possible theory?
> 
> I think what happens is that during the OSP10 deployment the default
> firewall rules are there but the ip6tables service is not running so they're
> not applied.  During the upgrade the ip6tables service gets started thus the
> rules set in /etc/sysconfig/ip6tables get applied and block the ipv6 traffic:
> 
> http://paste.openstack.org/show/602106/

Hi Marius,

thanks, yes that explains it fully. I think the bug though is that OSP10 does not have ip6tables running and with the proper rules configured, no? Unless you are deploying OSP10 with ManageFirewall: false and then in the OSP11 upgrade you set ManageFirewall: true, in which case this problem is probably expected (although I am assuming you are not doing this).

Am I correct in assuming that we do not have https://github.com/openstack/puppet-tripleo/commit/8c990738900cd74c2c5c046435517393d1afb92e in our OSP10 puppet-tripleo packages? If you can confirm that that is indeed the case, then I think we have two options:
A) We backport it to OSP10 and that way we should not hit this issue during upgrades
B) We come up with some hack to open up ip6tables traffic at the beginning of the upgrade as it will be reinstantiated during the converge.

If you instead confirm that the patch is already in your OSP10 deployments, we should probably investigate why the rules are not populated.

Comment 6 Michele Baldessari 2017-03-13 09:44:40 UTC
Ops I forgot I have your sosreports ;) I can confirm that in puppet-tripleo-5.5.0-3.el7ost.noarch there is no ipv6 support yet. So the problem is fully explained. Let's discuss today how we should best proceed.

Comment 7 Marius Cornea 2017-03-13 09:57:32 UTC
(In reply to Michele Baldessari from comment #5)
> Hi Marius,
> 
> thanks, yes that explains it fully. I think the bug though is that OSP10
> does not have ip6tables running and with the proper rules configured, no?
> Unless you are deploying OSP10 with ManageFirewall: false and then in the
> OSP11 upgrade you set ManageFirewall: true, in which case this problem is
> probably expected (although I am assuming you are not doing this).

During the OSP10 deployment I'm not manually setting the ManageFirewall parameter so I guess the default one is used. 

> Am I correct in assuming that we do not have
> https://github.com/openstack/puppet-tripleo/commit/
> 8c990738900cd74c2c5c046435517393d1afb92e in our OSP10 puppet-tripleo
> packages? If you can confirm that that is indeed the case, then I think we
> have two options:
> A) We backport it to OSP10 and that way we should not hit this issue during
> upgrades
> B) We come up with some hack to open up ip6tables traffic at the beginning
> of the upgrade as it will be reinstantiated during the converge.

Yes, I can confirm that we don't have the patch in OSP10(puppet-tripleo-5.5.0-4.el7ost.noarch).

> If you instead confirm that the patch is already in your OSP10 deployments,
> we should probably investigate why the rules are not populated.

Comment 8 Sofer Athlan-Guyot 2017-03-24 12:47:19 UTC
Going with the blank previous rule road.  OSP10 ipv6 firewall should go in another bz with z-stream delivery if required.

Comment 9 Sofer Athlan-Guyot 2017-03-27 09:43:00 UTC
Point to ocata branch.

Comment 12 errata-xmlrpc 2017-05-17 20:06:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245