Bug 2033570 - [OSP17] Iptables rules on undercloud are missing after reboot
Summary: [OSP17] Iptables rules on undercloud are missing after reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Brendan Shephard
QA Contact: Jason Grosso
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-17 09:36 UTC by Mikolaj Ciecierski
Modified: 2022-09-21 12:18 UTC (History)
5 users (show)

Fixed In Version: tripleo-ansible-3.3.1-0.20220326002748.9efbca4.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:18:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 823893 0 None MERGED Ensure firewall rules are saved 2022-06-22 15:04:58 UTC
Red Hat Issue Tracker OSP-11873 0 None None None 2021-12-17 09:38:36 UTC
Red Hat Issue Tracker UPG-4882 0 None None None 2021-12-21 00:05:38 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:18:36 UTC

Description Mikolaj Ciecierski 2021-12-17 09:36:05 UTC
Description of problem:
Iptables nat rules are missing after undercloud reboot preventing compute to get to external network.

Version-Release number of selected component (if applicable):


How reproducible: always


Steps to Reproduce:
1.Deploy undercloud
2.Check connection to external network from compute. Should be able to get to external network.
3.Reboot undercloud
4.Check connection to external network from compute. No access to external network.

Actual results:
Nat table is empty after undercloud reboot
undercloud-0 ~]# iptables-save -t nat
# Generated by iptables-save v1.8.4 on Fri Dec 17 09:28:47 2021
*nat
COMMIT
# Completed on Fri Dec 17 09:28:47 2021


Expected results:
Nat table after undercloud reboot has rules to reach out to external network for i.e. compute nodes . Nat rules should be the same as before reboot.
undercloud-0 ~]# iptables-save -t nat
# Generated by iptables-save v1.8.4 on Thu Dec 16 14:32:04 2021
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:CNI-HOSTPORT-SETMARK - [0:0]
:CNI-HOSTPORT-MASQ - [0:0]
:CNI-HOSTPORT-DNAT - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j CNI-HOSTPORT-DNAT
-A POSTROUTING -m comment --comment "CNI portfwd requiring masquerade" -j CNI-HOSTPORT-MASQ
-A POSTROUTING -s 192.168.24.0/24 -d 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24 ipv4" -j RETURN                                       
-A POSTROUTING -s 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "138 routed_network masquerade 192.168.24.0/24 ipv4" -j MASQUERADE                                                                           
-A OUTPUT -m addrtype --dst-type LOCAL -j CNI-HOSTPORT-DNAT
-A CNI-HOSTPORT-SETMARK -m comment --comment "CNI portfwd masquerade mark" -j MARK --set-xmark 0x2000/0x2000
-A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE
COMMIT
# Completed on Thu Dec 16 14:32:04 2021
 
Additional info:

Comment 2 Brendan Shephard 2021-12-19 11:33:41 UTC
I'm not able to reproduce this on OSP17 deployed with infrared:


[stack@undercloud-0 ~]$ sudo systemctl restart iptables
[stack@undercloud-0 ~]$ sudo iptables-save -t nat
# Generated by iptables-save v1.8.4 on Sun Dec 19 11:30:03 2021
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [32:1920]
:OUTPUT ACCEPT [32:1920]
-A POSTROUTING -s 192.168.24.0/24 -d 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24 ipv4" -j RETURN
-A POSTROUTING -s 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "138 routed_network masquerade 192.168.24.0/24 ipv4" -j MASQUERADE
COMMIT
# Completed on Sun Dec 19 11:30:03 2021
[stack@undercloud-0 ~]$ sudo reboot
[stack@undercloud-0 ~]$ sudo iptables-save -t nat
# Generated by iptables-save v1.8.4 on Sun Dec 19 11:31:09 2021
*nat
:PREROUTING ACCEPT [3:728]
:INPUT ACCEPT [1:60]
:POSTROUTING ACCEPT [139:9155]
:OUTPUT ACCEPT [139:9155]
-A POSTROUTING -s 192.168.24.0/24 -d 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24 ipv4" -j RETURN
-A POSTROUTING -s 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "138 routed_network masquerade 192.168.24.0/24 ipv4" -j MASQUERADE
COMMIT
# Completed on Sun Dec 19 11:31:09 2021
[stack@undercloud-0 ~]$ uptime
 11:31:45 up 1 min,  1 user,  load average: 2.30, 0.81, 0.29

I'll take a look at the Jenkins job tomorrow Morning and see if there is something that stands out.

Comment 3 Brendan Shephard 2021-12-20 00:49:11 UTC
Hmm, this shouldn't still be empty:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/staging/DFG-upgrades-updates-17.0-from-passed_phase1-HA_no_ceph-ipv4/23/undercloud-0/etc/sysconfig/iptables.gz

Which happens here:
❯ grep "Create empty ruleset" undercloud_install.log                                                                                                                                                                                         ─╯
2021-12-16 12:55:19.361381 | 52540094-4a29-fbb2-c5fb-0000000001e7 |       TASK | Create empty ruleset in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables
2021-12-16 12:55:19.814658 | 52540094-4a29-fbb2-c5fb-0000000001e7 |    CHANGED | Create empty ruleset in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables | undercloud-0 | item=/etc/sysconfig/iptables
2021-12-16 12:55:19.816327 | 52540094-4a29-fbb2-c5fb-0000000001e7 |     TIMING | tripleo_bootstrap : Create empty ruleset in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables | undercloud-0 | 0:00:22.476563 | 0.45s
2021-12-16 12:55:20.163430 | 52540094-4a29-fbb2-c5fb-0000000001e7 |    CHANGED | Create empty ruleset in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables | undercloud-0 | item=/etc/sysconfig/ip6tables
2021-12-16 12:55:20.164396 | 52540094-4a29-fbb2-c5fb-0000000001e7 |     TIMING | tripleo_bootstrap : Create empty ruleset in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables | undercloud-0 | 0:00:22.824637 | 0.80s
2021-12-16 12:55:20.166099 | 52540094-4a29-fbb2-c5fb-0000000001e7 |     TIMING | tripleo_bootstrap : Create empty ruleset in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables | undercloud-0 | 0:00:22.826348 | 0.80s


Masquerade rules created here:
❯ grep 'routed_network masquerade' undercloud_install.log                                                                                                                                                                                    ─╯
        "<13>Dec 16 12:59:32 puppet-user: Notice: /Stage[main]/Tripleo::Masquerade_networks/Tripleo::Firewall::Rule[138 routed_network masquerade 192.168.24.0/24]/Firewall[138 routed_network masquerade 192.168.24.0/24 ipv4]/ensure: created",
        "<13>Dec 16 12:59:32 puppet-user: Notice: /Stage[main]/Tripleo::Masquerade_networks/Tripleo::Firewall::Rule[138 routed_network masquerade 192.168.24.0/24]/Firewall[138 routed_network masquerade 192.168.24.0/24 ipv4]/ensure: created",
        "<13>Dec 16 12:59:32 puppet-user: Notice: /Stage[main]/Tripleo::Masquerade_networks/Tripleo::Firewall::Rule[138 routed_network masquerade 192.168.24.0/24]/Firewall[138 routed_network masquerade 192.168.24.0/24 ipv4]/ensure: created",
        "<13>Dec 16 12:59:32 puppet-user: Notice: /Stage[main]/Tripleo::Masquerade_networks/Tripleo::Firewall::Rule[138 routed_network masquerade 192.168.24.0/24]/Firewall[138 routed_network masquerade 192.168.24.0/24 ipv4]/ensure: created",


A new /etc/sysconfig/iptables file is created as part of this puppet run ^^. Example:

(venv) [stack@undercloud-0 undercloud]$ sudo mv /etc/sysconfig/iptables{,-backup}
(venv) [stack@undercloud-0 undercloud]$ sudo iptables -t nat -v -L POSTROUTING --line-number
Chain POSTROUTING (policy ACCEPT 182K packets, 11M bytes)
num   pkts bytes target     prot opt in     out     source               destination
1     168K   10M RETURN     all  --  any    any     192.168.24.0/24      192.168.24.0/24      state NEW,RELATED,ESTABLISHED /* 137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24 ipv4 */
2       47  3552 MASQUERADE  all  --  any    any     192.168.24.0/24      anywhere             state NEW,RELATED,ESTABLISHED /* 138 routed_network masquerade 192.168.24.0/24 ipv4 */


(venv) [stack@undercloud-0 undercloud]$ sudo iptables -t nat -D POSTROUTING 2
(venv) [stack@undercloud-0 undercloud]$ sudo iptables -t nat -D POSTROUTING 1
(venv) [stack@undercloud-0 undercloud]$ sudo iptables -t nat -v -L POSTROUTING --line-number
Chain POSTROUTING (policy ACCEPT 182K packets, 11M bytes)
num   pkts bytes target     prot opt in     out     source               destination


(venv) [stack@undercloud-0 undercloud]$ cat puppet_apply
    puppet apply -vvv \
    --modulepath=/etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules \
    --detailed-exitcodes \
    --summarize \
    --color=true \
    /var/lib/tripleo-config/puppet_step_config.pp

(venv) [stack@undercloud-0 undercloud]$ sudo bash puppet_apply
Notice: Compiled catalog for undercloud-0.redhat.local in environment production in 0.26 seconds
Info: Applying configuration version '1639960881'
Notice: /Stage[main]/Tripleo::Masquerade_networks/Tripleo::Firewall::Rule[137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24]/Firewall[137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24 ipv4]/ensure: created
Notice: /Stage[main]/Tripleo::Masquerade_networks/Tripleo::Firewall::Rule[138 routed_network masquerade 192.168.24.0/24]/Firewall[138 routed_network masquerade 192.168.24.0/24 ipv4]/ensure: created

(venv) [stack@undercloud-0 undercloud]$ sudo iptables -t nat -vL POSTROUTING
Chain POSTROUTING (policy ACCEPT 182K packets, 11M bytes)
 pkts bytes target     prot opt in     out     source               destination
  270 16204 RETURN     all  --  any    any     192.168.24.0/24      192.168.24.0/24      state NEW,RELATED,ESTABLISHED /* 137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24 ipv4 */
    0     0 MASQUERADE  all  --  any    any     192.168.24.0/24      anywhere             state NEW,RELATED,ESTABLISHED /* 138 routed_network masquerade 192.168.24.0/24 ipv4 */

(venv) [stack@undercloud-0 undercloud]$ sudo grep -i nat /etc/sysconfig/iptables
-A FORWARD -d 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "140 routed_network forward destinations 192.168.24.0/24 ipv4" -j ACCEPT
*nat

(venv) [stack@undercloud-0 undercloud]$ sudo iptables-save -t nat
# Generated by iptables-save v1.8.4 on Mon Dec 20 00:44:07 2021
*nat
:PREROUTING ACCEPT [50:4280]
:INPUT ACCEPT [1:60]
:POSTROUTING ACCEPT [182507:10833591]
:OUTPUT ACCEPT [182519:10834311]
-A POSTROUTING -s 192.168.24.0/24 -d 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "137 routed_network return src 192.168.24.0/24 dest 192.168.24.0/24 ipv4" -j RETURN
-A POSTROUTING -s 192.168.24.0/24 -m state --state NEW,RELATED,ESTABLISHED -m comment --comment "138 routed_network masquerade 192.168.24.0/24 ipv4" -j MASQUERADE
COMMIT
# Completed on Mon Dec 20 00:44:07 2021




All that to say that I think there is something weird going on with that particular deployment. Are you able to reproduce that error every time?

Comment 6 Brendan Shephard 2021-12-21 00:13:00 UTC
The issue here is specific to the undercloud update process. Since it isn't executing puppet tasks, it isn't applying the masquerade firewall rules again.

Re-running puppet restores the rules:
    puppet apply -vvv \
    --modulepath=/etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules \
    --detailed-exitcodes \
    --summarize \
    --color=true \
    /var/lib/tripleo-config/puppet_step_config.pp

Maybe the solution is to consolidate all of these firewall rules into tripleo-ansible and ensure they are all restored during the update_steps. Having a combination of Puppet and Ansible adding and changing firewall rules probably isn't ideal anyway.

Comment 8 Brendan Shephard 2022-01-08 11:46:41 UTC
Oh yeah, I see what you mean:

(undercloud) [stack@tripleo-director ~]$ sudo cat /etc/sysconfig/iptables
# empty ruleset created by deployed-server bootstrap(undercloud) [stack@tripleo-director ~]$

(undercloud) [stack@tripleo-director ~]$ sudo puppet apply -vvv     --modulepath=/etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules     --detailed-exitcodes     --summarize
--color=true     /var/lib/tripleo-config/puppet_step_config.pp
[...]

Info: Applying configuration version '1641640001'
Notice: Applied catalog in 1.24 seconds
Changes:
Events:
Resources:
            Total: 22
Time:
       Filebucket: 0.00
         Schedule: 0.00
          Package: 0.00
         Firewall: 0.00
             Exec: 0.01
           Augeas: 0.03
             File: 0.06
          Service: 0.12
   Config retrieval: 0.88
   Transaction evaluation: 1.23
   Catalog application: 1.24
         Last run: 1641640003
            Total: 1.25
Version:
           Config: 1641640001
           Puppet: 7.8.0
(undercloud) [stack@tripleo-director ~]$ sudo cat /etc/sysconfig/iptables
# empty ruleset created by deployed-server bootstrap(undercloud) [stack@tripleo-director ~]$


The iptables-save happens in firewall.pp, but firewall.pp isn't included in /var/lib/tripleo-config/puppet_step_config.pp, and we have manage_firewall set to false here:
https://github.com/openstack/tripleo-heat-templates/blob/master/deployment/tripleo-firewall/tripleo-firewall-baremetal-ansible.yaml#L58

So it must be managed by the tripleo_firewall Ansible role:
https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_firewall/tasks/main.yml#L61-L70

Looks like this would find all of the rules that are in mem:
https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_firewall/tasks/main.yml#L56-L59

And then determine that no changes are required, so it would never execute that block:
https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_firewall/tasks/main.yml#L62-L63

Which we should be able to verify in the logs:
2022-01-08 21:15:29,902 p=71145 u=root n=ansible | 2022-01-08 21:15:29.901446 | 566f14f3-0016-3ca3-7e4e-0000000006c2 |       TASK | Save firewall rules ipv4
2022-01-08 21:15:29,950 p=71145 u=root n=ansible | 2022-01-08 21:15:29.948514 | 566f14f3-0016-3ca3-7e4e-0000000006c2 |    SKIPPED | Save firewall rules ipv4 | tripleo-director


So the problem that needs fixing is the when statement that determines whether or not that block needs to be executed.

Adding dfg:hardprov as well.

Comment 9 Brendan Shephard 2022-01-08 12:53:46 UTC
I'm sure there is a more elegant solution. But if this is going to become a blocker for anything, than this should fix the issue in the interim:
https://review.opendev.org/c/openstack/tripleo-ansible/+/823893

Would still appreciate some additional feedback from dfg:upgrades and dfg:hardprov

Comment 10 Brendan Shephard 2022-01-08 13:06:46 UTC
Results after that change:

2022-01-08 23:05:30.325631 | 566f14f3-0016-7e76-8d0f-0000000006c0 |       TASK | Manage firewall rules
2022-01-08 23:05:54.119043 | 566f14f3-0016-7e76-8d0f-0000000006c0 |         OK | Manage firewall rules | tripleo-director
2022-01-08 23:05:54.121579 | 566f14f3-0016-7e76-8d0f-0000000006c0 |     TIMING | tripleo_firewall : Manage firewall rules | tripleo-director | 0:01:46.348490 | 23.79s
2022-01-08 23:05:54.159581 | 566f14f3-0016-7e76-8d0f-0000000006c1 |       TASK | Check that /etc/sysconfig/iptables isn't empty
2022-01-08 23:05:54.835567 | 566f14f3-0016-7e76-8d0f-0000000006c1 |    CHANGED | Check that /etc/sysconfig/iptables isn't empty | tripleo-director
2022-01-08 23:05:54.838318 | 566f14f3-0016-7e76-8d0f-0000000006c1 |     TIMING | tripleo_firewall : Check that /etc/sysconfig/iptables isn't empty | tripleo-director | 0:01:47.065235 | 0.68s
2022-01-08 23:05:54.882352 | 566f14f3-0016-7e76-8d0f-0000000006c3 |       TASK | Save firewall rules ipv4
2022-01-08 23:05:55.326388 | 566f14f3-0016-7e76-8d0f-0000000006c3 |    CHANGED | Save firewall rules ipv4 | tripleo-director
2022-01-08 23:05:55.328749 | 566f14f3-0016-7e76-8d0f-0000000006c3 |     TIMING | tripleo_firewall : Save firewall rules ipv4 | tripleo-director | 0:01:47.555666 | 0.44s
2022-01-08 23:05:55.363950 | 566f14f3-0016-7e76-8d0f-0000000006c4 |       TASK | Save firewall rules ipv6
2022-01-08 23:05:55.786483 | 566f14f3-0016-7e76-8d0f-0000000006c4 |    CHANGED | Save firewall rules ipv6 | tripleo-director

Comment 11 Harald Jensås 2022-01-25 23:16:06 UTC
(In reply to Brendan Shephard from comment #9)
> I'm sure there is a more elegant solution. But if this is going to become a
> blocker for anything, than this should fix the issue in the interim:
> https://review.opendev.org/c/openstack/tripleo-ansible/+/823893
> 
> Would still appreciate some additional feedback from dfg:upgrades and
> dfg:hardprov

Undercloud as a router for the overcloud nodes is a bad idea ...
This functionality is there to allow test/dev environments to use the undercloud as a router.
IMO, we should deprecate and remove this functionality instead of spending resources on refactoring it.

With the uncertain? role of ansible in future tripleo spending resources on re-implementing this in ansible does not make sense.

If the proposed patch works, let's roll with it. And if we re-factor firewalling in tripleo to not use ansible we can re-visit.

Comment 18 errata-xmlrpc 2022-09-21 12:18:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.