Bug 1470295 - iptables block traffic after upgrade is converge
iptables block traffic after upgrade is converge
Status: ASSIGNED
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Brent Eagles
Ofer Blaut
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-12 12:31 EDT by bigswitch
Modified: 2017-09-16 19:52 EDT (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description bigswitch 2017-07-12 12:31:49 EDT
Description of problem:
Seen after the final steps of rhosp 9 to rhosp 10 upgrade. after running major-upgrade-pacemaker-converge.yaml notice all traffic to VM is blocked. After flushing IPtables traffic is forwarded.

[root@rhosp10-compute-1 heat-admin]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere             /* 000 accept related established rules */ state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere             /* 001 accept all icmp */ state NEW
ACCEPT     all  --  anywhere             anywhere             /* 002 accept all to lo interface */ state NEW
ACCEPT     tcp  --  anywhere             anywhere             multiport dports ssh /* 003 accept ssh */ state NEW
ACCEPT     udp  --  anywhere             anywhere             multiport dports ntp /* 105 ntp */ state NEW
ACCEPT     udp  --  anywhere             anywhere             multiport dports 4789 /* 118 neutron vxlan networks */ state NEW
ACCEPT     udp  --  anywhere             anywhere             multiport dports snmp /* 127 snmp */ state NEW
ACCEPT     gre  --  anywhere             anywhere             /* 136 neutron gre networks */
ACCEPT     tcp  --  anywhere             anywhere             multiport dports 16514,49152:49215,rfb:cvsup /* 200 nova_libvirt */ state NEW
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:ssh
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited
LOG        all  --  anywhere             anywhere             /* 998 log all */ LOG level warning
DROP       all  --  anywhere             anywhere             /* 999 drop all */ state NEW

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination



Version-Release number of selected component (if applicable):
RHOSP 10

How reproducible:
Always

Will upload sosreport from compute nodes to box.
Comment 1 bigswitch 2017-07-12 12:36:31 EDT
box location for sosreport

https://bigswitch.box.com/s/ng5of19727e2oo3k4l6v1epzecg58hhv
Comment 2 Artom Lifshitz 2017-07-16 20:10:15 EDT
Just a few questions in line to clarify the issue.

> Seen after the final steps of rhosp 9 to rhosp 10 upgrade. after running
> major-upgrade-pacemaker-converge.yaml notice all traffic to VM is blocked.

Just to 100% clear, VM in this case means the instances created on the overcloud, correct? And no traffic whatsoever is getting through, or are certain types/sources of traffic still working? For instance, compute host to VM SSH may be blocked, but outside network to VM (through the floating IP) may be working.

> After flushing IPtables traffic is forwarded.

How did you flush iptables? By running the 'iptables' command on the compute host?

Cheers!
Comment 3 bigswitch 2017-07-17 11:01:05 EDT
Hi 
yes, VM in this case are instances created on overcloud compute nodes.
iptables is flush using iptables -F to flush all entries

Thanks
Comment 4 bigswitch 2017-07-17 11:01:58 EDT
As far as I can tell, everything is blocked. I have ssh to an instance as well as ping to another instance floating IP. Both are disconnected/timeout when iptables is changed.
Comment 5 bigswitch 2017-07-20 13:29:48 EDT
Hi
I notice that about 13minutes into the upgrade converge, all neutron-bsn-agen-<name> chain is removed from iptables for 15 minutes before being restored back.
Is there some setting we must set to preserve the iptables?

During upgrade converge:
iptables -L                                                                                                                                                                                                                                                 Thu Jul 20 10:01:26 2017

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere             /* 000 accept related established rules */ state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere             /* 001 accept all icmp */ state NEW
ACCEPT     all  --  anywhere             anywhere             /* 002 accept all to lo interface */ state NEW
ACCEPT     tcp  --  anywhere             anywhere             multiport dports ssh /* 003 accept ssh */ state NEW
ACCEPT     udp  --  anywhere             anywhere             multiport dports ntp /* 105 ntp */ state NEW
ACCEPT     udp  --  anywhere             anywhere             multiport dports snmp /* 127 snmp */ state NEW
ACCEPT     tcp  --  anywhere             anywhere             multiport dports 16514,49152:49215,rfb:cvsup /* 200 nova_libvirt */ state NEW
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:ssh
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited
LOG        all  --  anywhere             anywhere             /* 998 log all */ LOG level warning
DROP       all  --  anywhere             anywhere             /* 999 drop all */ state NEW

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination


Normal working iptables:
[root@rhosp10-compute-1 heat-admin]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
neutron-bsn-agen-INPUT  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             /* 000 accept related established rules */ state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere             /* 001 accept all icmp */ state NEW
ACCEPT     all  --  anywhere             anywhere             /* 002 accept all to lo interface */ state NEW
ACCEPT     tcp  --  anywhere             anywhere             multiport dports ssh /* 003 accept ssh */ state NEW
ACCEPT     udp  --  anywhere             anywhere             multiport dports ntp /* 105 ntp */ state NEW
ACCEPT     udp  --  anywhere             anywhere             multiport dports snmp /* 127 snmp */ state NEW
ACCEPT     tcp  --  anywhere             anywhere             multiport dports 16514,49152:49215,rfb:cvsup /* 200 nova_libvirt */ state NEW
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:ssh
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited
LOG        all  --  anywhere             anywhere             /* 998 log all */ LOG level warning
DROP       all  --  anywhere             anywhere             /* 999 drop all */ state NEW

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
neutron-filter-top  all  --  anywhere             anywhere
neutron-bsn-agen-FORWARD  all  --  anywhere             anywhere
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
neutron-filter-top  all  --  anywhere             anywhere
neutron-bsn-agen-OUTPUT  all  --  anywhere             anywhere

Chain neutron-bsn-agen-FORWARD (1 references)
target     prot opt source               destination
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap72a0367a-e2 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap72a0367a-e2 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap7e4ba834-72 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap7e4ba834-72 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap85795536-01 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap85795536-01 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tap9b518858-67 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tap9b518858-67 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tapc6306b06-0a --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tapc6306b06-0a --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tapce4928c3-30 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tapce4928c3-30 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tapda733f61-c3 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tapda733f61-c3 --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-out tapf0da11c1-8e --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */
neutron-bsn-agen-sg-chain  all  --  anywhere             anywhere             PHYSDEV match --physdev-in tapf0da11c1-8e --physdev-is-bridged /* Direct traffic from the VM interface to the security group chain. */

Chain neutron-bsn-agen-INPUT (1 references) .... plus other chain
Comment 6 bigswitch 2017-07-20 13:57:14 EDT
Re-running the same deploy converge a second time not seeing this issue. However, this is causing about 15 minutes outage during the upgrade.
Comment 7 Artom Lifshitz 2017-07-21 10:17:28 EDT
Thanks for the extra information! This doesn't look like a nova issue. I'm going to re-target this bug to the Upgrades folks, they should have a better idea of what's going one.
Comment 8 Lee Yarwood 2017-07-24 10:05:19 EDT
(In reply to bigswitch from comment #6)
> Re-running the same deploy converge a second time not seeing this issue.
> However, this is causing about 15 minutes outage during the upgrade.

Reading over the associated THT and puppet files for the firewall in Newton this sounds as if you have ManageFirewall and PurgeFirewallRules, both introduced in 10/Newton, set to True somewhere in your environment. ManageFirewall defaults to True but PurgeFirewallRules defaults to False so we shouldn't see rules purged. Can you confirm if either are set in your env or if they are defaulting to the aforementioned values.
Comment 9 bigswitch 2017-07-24 12:11:08 EDT
Hi
I dont see ManageFirewall and PurgeFirewallRules defined in any of Bigswitch environment file. Also this happened on the second last step, finalizing the upgrade (chapter 3.4.9 of the RHOSP 10 upgrade guide), where major-upgrade-pacemaker-converge.yaml is introduced. 
Prior to that, iptables is fine and there are minimum traffic loss through each upgrade steps (about 3 - 10 sec of traffic ping loss, 10 sec loss due to VM live migration).
How do I verify during the converge step if its set to the default values?

Thanks
Comment 10 bigswitch 2017-07-24 17:31:51 EDT
Hi Lee,
Do you think if I create a new yaml file, and put ManageFirewall to True and PurgeFirewallRules to False in the new yaml file, than include that in the second last step, would that work to prevent firewall rules from being purged?

Thanks

Song
Comment 11 bigswitch 2017-07-24 19:30:03 EDT
This is from overcloud.yaml file if it helps:

cat /usr/share/openstack-tripleo-heat-templates/compat/overcloud.yaml | grep -i firewa -A4
  ManageFirewall:
    default: false
    description: Whether to manage IPtables rules.
    type: boolean
  PurgeFirewallRules:
    default: false
    description: Whether IPtables rules should be purged before setting up the ones.
    type: boolean
  MysqlInnodbBufferPoolSize:
--
          ManageFirewall: {get_param: ManageFirewall}
          PurgeFirewallRules: {get_param: PurgeFirewallRules}
          EnableGalera: {get_param: EnableGalera}
          EnableCephStorage: {get_param: ControllerEnableCephStorage}
          EnableSwiftStorage: {get_param: ControllerEnableSwiftStorage}
          ExtraConfig: {get_param: ExtraConfig}
Comment 12 Lee Yarwood 2017-07-25 10:18:27 EDT
(In reply to bigswitch from comment #11)
> This is from overcloud.yaml file if it helps:
> 
> cat /usr/share/openstack-tripleo-heat-templates/compat/overcloud.yaml | grep
> -i firewa -A4
>   ManageFirewall:
>     default: false
>     description: Whether to manage IPtables rules.
>     type: boolean
>   PurgeFirewallRules:
>     default: false
>     description: Whether IPtables rules should be purged before setting up
> the ones.
>     type: boolean
>   MysqlInnodbBufferPoolSize:
> --
>           ManageFirewall: {get_param: ManageFirewall}
>           PurgeFirewallRules: {get_param: PurgeFirewallRules}
>           EnableGalera: {get_param: EnableGalera}
>           EnableCephStorage: {get_param: ControllerEnableCephStorage}
>           EnableSwiftStorage: {get_param: ControllerEnableSwiftStorage}
>           ExtraConfig: {get_param: ExtraConfig}

I think I see the issue now, that is the Mitaka overcloud.yaml provided by openstack-tripleo-heat-templates-compat where ManageFirewall defaulted to False, how are you launching the final deploy?

The following docs for the 9 to 10 upgrade highlight that major-upgrade-pacemaker-converge.yaml is used during a final deploy utilising the current version of openstack-tripleo-heat-templates, not the older compat version :

"The director needs to run through the upgrade finalization to ensure the Overcloud stack is synchronized with the current Heat template collection. This involves an environment file (major-upgrade-pacemaker-converge.yaml), which you include using the openstack overcloud deploy command."

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Finalization

Can you ensure you are not using the compat version of THT?
Comment 13 bigswitch 2017-07-25 11:52:30 EDT
Hi
This is the deploy command:

time openstack overcloud deploy --stack rhosp10 \
-e /home/stack/templates/network-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/templates/timezone.yaml \
-e /home/stack/templates/rhel-registration/environment-rhel-registration.yaml \
-e /home/stack/templates/rhel-registration/rhel-registration-resource-registry.yaml \
-e /home/stack/templates/bigswitch-config.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml \
--neutron-network-type vlan --neutron-network-vlan-ranges rhosp:1001:2000 --neutron-bridge-mappings "rhosp:br-ex" --neutron-disable-tunneling --compute-scale 2 --control-scale 3 --templates --control-flavor control --compute-flavor compute --ntp-server 0.rhel.pool.ntp.org --timeout $timeout --debug 2>&1 

The templates are compute/controller.yaml are copied from previous RHOSP9 templates and modified for our environment. Bigswitch-config.yaml is updated to RHOSP10 format as well as network-environment.yaml. It should not be using the compat version. The overcloud.yaml is the output when I am grepping for ManageFirewall.
Comment 14 bigswitch 2017-07-25 12:04:25 EDT
btw, do you think if I add this into a new yaml file

parameter_defaults:
  ManageFirewall: True
  PurgeFirewallRules: False

Than include this in when running major-upgrade-pacemaker-converge, would this resolve the iptables issue?
Comment 15 Lee Yarwood 2017-07-26 05:04:30 EDT
(In reply to bigswitch from comment #14)
> btw, do you think if I add this into a new yaml file
> 
> parameter_defaults:
>   ManageFirewall: True
>   PurgeFirewallRules: False
> 
> Than include this in when running major-upgrade-pacemaker-converge, would
> this resolve the iptables issue?

As you're using the OSP 10 version this should already be set, your previous grep just threw me off, apologies for that :

# git grep -A3 ManageFirewall\:
puppet/services/tripleo-firewall.yaml:  ManageFirewall:
puppet/services/tripleo-firewall.yaml-    default: true
puppet/services/tripleo-firewall.yaml-    description: Whether to manage IPtables rules.
puppet/services/tripleo-firewall.yaml-    type: boolean
# git grep -A3 PurgeFirewallRules\:
puppet/services/tripleo-firewall.yaml:  PurgeFirewallRules:
puppet/services/tripleo-firewall.yaml-    default: false
puppet/services/tripleo-firewall.yaml-    description: Whether IPtables rules should be purged before setting up the new ones.
puppet/services/tripleo-firewall.yaml-    type: boolean

Looking again at the sosreport this appears to be a known race between the L2 network agent and other host services in enabling and disabling the following sysctl settings :

sosreport-rhosp10-compute-0.localdomain-20170712091515/var/log/messages

39594 Jul 12 00:44:20 rhosp10-compute-0 neutron-enable-bridge-firewall.sh: net.bridge.bridge-nf-call-arptables = 1
39595 Jul 12 00:44:20 rhosp10-compute-0 neutron-enable-bridge-firewall.sh: net.bridge.bridge-nf-call-iptables = 1
39596 Jul 12 00:44:20 rhosp10-compute-0 neutron-enable-bridge-firewall.sh: net.bridge.bridge-nf-call-ip6tables = 1

Can you confirm that you regain connectivity to your instances around the time that the above is seen in /var/log/messages?
Comment 17 bigswitch 2017-07-26 13:31:45 EDT
The timestamp is out and I didnt monitor the time traffic recovers. I did notice there is a 15 min interval when traffic is loss than recovers. From the log is it possible to determine when the iptable was flush than reinstall? There should be about 15 min difference between them
Comment 18 bigswitch 2017-07-28 14:35:10 EDT
Hi Lee,
I re-run the deployment and dont see this message when firewall is restored and traffic resumed. The only messages in /var/log/messages when traffic resumed is this

Jul 28 11:19:37 rhosp10-compute-0 os-collect-config: No local metadata found (['/var/lib/os-collect-config/local-data'])
Jul 28 11:19:54 rhosp10-compute-0 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jul 28 11:19:54 rhosp10-compute-0 os-collect-config: No local metadata found (['/var/lib/os-collect-config/local-data'])
Jul 28 11:20:25 rhosp10-compute-0 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jul 28 11:20:25 rhosp10-compute-0 os-collect-config: No local metadata found (['/var/lib/os-collect-config/local-data'])
Jul 28 11:20:55 rhosp10-compute-0 os-collect-config: /var/lib/os-collect-config/local-data not found. Skipping
Jul 28 11:20:55 rhosp10-compute-0 os-collect-config: No local metadata found (['/var/lib/os-collect-config/local-data'])
Comment 19 Assaf Muller 2017-08-28 09:18:27 EDT
Assigned to Brent for further triage.
Comment 20 Assaf Muller 2017-08-30 10:04:05 EDT
Random brain dump - I noticed:
"Driver configuration doesn't match with enable_security_group" in the BigSwitch agent logs. That warning comes from:
https://github.com/openstack/neutron/blob/stable/newton/neutron/agent/securitygroups_rpc.py#L50

Which implies that the security groups configuration is invalid. We'd have to take a closer look at /etc/neutron/* in the sosreport, I ran out of time right now so just wanted to add this info to the RHBZ.

Note You need to log in before you can comment on or make changes to this bug.