Bug 2233300 - [hackfest] OSP17.1 deployment gets stuck on "Manage firewall rules" step for long time
Summary: [hackfest] OSP17.1 deployment gets stuck on "Manage firewall rules" step for ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.1 (Wallaby)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z3
: 17.1
Assignee: James Slagle
QA Contact: David Rosenfeld
URL:
Whiteboard:
: 2247573 2265072 (view as bug list)
Depends On:
Blocks: 2247573
TreeView+ depends on / blocked
 
Reported: 2023-08-21 20:17 UTC by Chris Janiszewski
Modified: 2024-05-22 20:35 UTC (History)
20 users (show)

Fixed In Version: tripleo-ansible-3.3.1-17.1.20231101230831.el9ost openstack-ansible-core-2.14.2-4.3.el9ost
Doc Type: Known Issue
Doc Text:
Cause: Not using a valid DNS server Consequence: In cases where a valid DNS server cannot be reached, Ansible can take a long time adding rules due to iptables attempting to do reverse DNS lookups. This will be apparent in the "Manage firewall rules" Ansible task Workaround (if any): Provide or configure a DNS server for the environment. Result: With a valid DNS server available, the "Manage firewall rules" task should complete in a reasonable time.
Clone Of:
: 2247573 (view as bug list)
Environment:
Last Closed: 2024-05-22 20:35:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-27631 0 None None None 2023-08-21 20:25:47 UTC
Red Hat Knowledge Base (Solution) 7032505 0 None None None 2024-02-26 20:57:47 UTC
Red Hat Product Errata RHSA-2024:2733 0 None None None 2024-05-22 20:35:23 UTC

Description Chris Janiszewski 2023-08-21 20:17:59 UTC
Description of problem:

During the deployment of OSP17.1 the "Manage firewall rules" gets stuck with very little to no activity on the overcloud nodes.

2023-08-21 15:14:56,693 p=391324 u=stack n=ansible | 2023-08-21 15:14:56.692877 | fa163e12-50f1-9fd7-a167-0000000034a2 |       TASK | create kill_scripts directory within /var/lib/neutron
2023-08-21 15:14:57,038 p=391324 u=stack n=ansible | 2023-08-21 15:14:57.038161 | fa163e12-50f1-9fd7-a167-0000000034a2 |    CHANGED | create kill_scripts directory within /var/lib/neutron | chrisj-osp171-novacompute-1
2023-08-21 15:14:57,061 p=391324 u=stack n=ansible | 2023-08-21 15:14:57.061257 | fa163e12-50f1-9fd7-a167-0000000034a3 |       TASK | create haproxy kill script
2023-08-21 15:14:57,211 p=391324 u=stack n=ansible | 2023-08-21 15:14:57.210286 | fa163e12-50f1-9fd7-a167-0000000034a3 |    CHANGED | create haproxy kill script | chrisj-osp171-novacompute-0
2023-08-21 15:14:57,853 p=391324 u=stack n=ansible | 2023-08-21 15:14:57.852961 | fa163e12-50f1-9fd7-a167-0000000034a3 |    CHANGED | create haproxy kill script | chrisj-osp171-novacompute-1
2023-08-21 15:58:41,781 p=391324 u=stack n=ansible | 2023-08-21 15:58:41.780553 | fa163e12-50f1-9fd7-a167-000000003395 |    CHANGED | Manage firewall rules | chrisj-osp171-controller-1
2023-08-21 15:58:41,806 p=391324 u=stack n=ansible | 2023-08-21 15:58:41.806298 | fa163e12-50f1-9fd7-a167-000000003397 |       TASK | Save firewall rules ipv4
2023-08-21 15:58:42,142 p=391324 u=stack n=ansible | 2023-08-21 15:58:42.141171 | fa163e12-50f1-9fd7-a167-000000003397 |    CHANGED | Save firewall rules ipv4 | chrisj-osp171-controller-1
2023-08-21 15:58:42,170 p=391324 u=stack n=ansible | 2023-08-21 15:58:42.170276 | fa163e12-50f1-9fd7-a167-000000003398 |       TASK | Save firewall rules ipv6
2023-08-21 15:58:42,493 p=391324 u=stack n=ansible | 2023-08-21 15:58:42.492956 | fa163e12-50f1-9fd7-a167-000000003398 |    CHANGED | Save firewall rules ipv6 | chrisj-osp171-controller-1


The example above has a 45 minutes delay on that step (please see the time stamp). We have seen other deployments getting stuck for 2 hours in the same step


Version-Release number of selected component (if applicable):
For some reason overcloud shows BETA even though I have pulled GA bits:
[root@chrisj-osp171-controller-2 ~]# cat /etc/rhosp-release 
Red Hat OpenStack Platform release 17.1.0 Beta (Wallaby)


How reproducible:
Seems to be every time, but with different times it gets stuck


Steps to Reproduce:
1. deploy osp17.1 and observer the deployment time


Actual results:
deployment gets stuck with what should be a simple operation

Expected results:
deployment gets through creating firewall rules quickly

Additional info:
Deployment eventually completes but with significant delay

Comment 1 Brendan Shephard 2023-08-22 00:59:37 UTC
Doesn't seem that we're seeing the same results in CI pipelines:
https://zuul.opendev.org/t/openstack/build/aac5ea93080a4001882e09dd331c56e9/log/logs/undercloud/home/zuul/ansible.log#716

Can you re-run the Ansible execution with -vvv to capture the debug logs? (You can just run the ansible-playbook-command.sh script in the config-download directory with -vvv).
Are you passing in any custom firewall rules?

Comment 2 Chris Janiszewski 2023-08-22 17:51:31 UTC
no custom firewall rules and it doesn't seem to be getting stuck on the update (so it looks to be only happening on the initial deployment) .. We are at the OSP17.1 hackfest, I'll try to get someone who has not yet deployed their overcloud to start the process with the --debug in the script and will try to capture that.

Comment 3 Eric Harney 2023-08-22 18:03:05 UTC
I saw this hang for minutes on my deployment too -- it looked like it was waiting for a slow-running "iptables -t filter -L INPUT" to run on a compute node.  Maybe struggling to run reverse DNS lookups?  Running "iptables -t filter -L INPUT -n" there returned instantly.

Comment 4 Chris Janiszewski 2023-08-22 18:16:55 UTC
another observation is that the process gets stuck even longer on the more complex deployments - for example the process has been waiting for over 2 hours on the DCN style deployment with 3 times as many subnets and IPs

Comment 7 Chris Janiszewski 2023-08-23 18:51:08 UTC
Brendan,

Here is a debug log from another run. 
http://chrisj.cloud/ansible.log.tar.bz

Comment 8 Brendan Shephard 2023-08-24 02:09:29 UTC
Seems to be fine on the Compute nodes:
2023-08-23 11:54:40,122 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.122193 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-computehci-0 | tripleo_firewall : Manage firewall rules | 27.44s
2023-08-23 11:54:40,122 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.122654 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-computehci-1 | tripleo_firewall : Manage firewall rules | 27.02s
2023-08-23 11:54:40,123 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.123053 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-computehci-2 | tripleo_firewall : Manage firewall rules | 26.62s

Just the Controller nodes that it appears to be taking ages on:
2023-08-23 11:54:40,116 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.115993 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-controller-1 | tripleo_firewall : Manage firewall rules | 3335.07s
2023-08-23 11:54:40,116 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.116415 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-controller-0 | tripleo_firewall : Manage firewall rules | 3072.81s
2023-08-23 11:54:40,117 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.116855 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-controller-2 | tripleo_firewall : Manage firewall rules | 2229.57s

The nodes aren't being rebooted or something are they? I noticed it also complains about them being UNREACHABLE at some point too:
2023-08-23 11:26:18,143 p=89502 u=stack n=ansible | 2023-08-23 11:26:18.143012 | fa163e1d-f3ca-004e-1d59-000000003703 | UNREACHABLE | Manage firewall rules | overcloud-controller-2
2023-08-23 11:40:22,216 p=89502 u=stack n=ansible | 2023-08-23 11:40:22.215375 | fa163e1d-f3ca-004e-1d59-000000003703 | UNREACHABLE | Manage firewall rules | overcloud-controller-0
2023-08-23 11:44:44,994 p=89502 u=stack n=ansible | 2023-08-23 11:44:44.994172 | fa163e1d-f3ca-004e-1d59-000000003703 | UNREACHABLE | Manage firewall rules | overcloud-controller-1
2023-08-23 11:54:40,116 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.115993 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-controller-1 | tripleo_firewall : Manage firewall rules | 3335.07s
2023-08-23 11:54:40,116 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.116415 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-controller-0 | tripleo_firewall : Manage firewall rules | 3072.81s
2023-08-23 11:54:40,117 p=89502 u=stack n=ansible | 2023-08-23 11:54:40.116855 | fa163e1d-f3ca-004e-1d59-000000003703 |    SUMMARY | overcloud-controller-2 | tripleo_firewall : Manage firewall rules | 2229.57s


Maybe we'll need sosreports from those controllers along with the config-download directory from the director node.

Comment 9 Takashi Kajinami 2023-08-24 03:08:33 UTC
Just dumping some context behind this bug.

We changed the tooling to manage firewall rules from OSP16 to 17. In OSP16 and older versions we used puppet
(firewall_rule type from puppetlabs-firewall), while in OSP17 we use ansible (built-in iptables module).
The core difference between iptables module  by ansible and one by ansible is how the tooling checks whether
a requested rule already exists or not before inserting it. The firewall_rule resource type in puppetlabs-firewall
implements the 'prefetch' interface, which means the provide first gets list of all existing rules, using
a single iptables-save command. On the other hand, the iptables module in ansible is not such smart and
checks existence of chain and rule by `iptables -C` and `iptables -L` in every single insert/delete operation.
This would be the reason why the task takes longer time in case there are more rules being configured.

I agree with Eric about what he mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2233300#c3 .
My current guess is that these iptables command is stuck at DNS resolution because -n is not added to these
list commands.

My initial question here would be
 - Do these overcloud nodes have any DNS servers configured ?
  - If yes, are they really reachable from these overcloud nodes ?
  - If no, is the issue reproduced even when you prepare DNS server ?

If the problem can be reproduced in the deployment WITHOUT DNS server then we probably have to modify
the built-in iptables module somehow to avoid that DNS resolution

Comment 11 Keigo Noha 2023-08-24 23:49:20 UTC
Hi team,

My customer will deploy a new OSP17.1 deployment. If we suspect iptables.py in ansible, they are willing to test the possible fix.
In this discussion, we suspect that ansible's current entry search method is not sophisticated.
I think the modification points are

def get_chain_policy(iptables_path, module, params):
    cmd = push_arguments(iptables_path, '-L', params, make_rule=False)

ref. https://github.com/ansible/ansible/blob/devel/lib/ansible/modules/iptables.py#L731-L732

def check_chain_present(iptables_path, module, params):
    cmd = push_arguments(iptables_path, '-L', params, make_rule=False)

ref. https://github.com/ansible/ansible/blob/devel/lib/ansible/modules/iptables.py#L754-L755

The current passed argument, '-L', should be '-n -L'?

If yes, I'll share the modification to the customer and get the result from them.

Best regards,
Keigo Noha

Comment 12 Takashi Kajinami 2023-08-25 01:19:30 UTC
Keigo,

I expect that change would work, and may be worth trying, but I can't guarantee that change is
really acceptable from general ansible's PoV because the DNS resolution is required to support
usage of hostname instead of IPs.

Also, we first need to understand this point, which I mentioned earlier in my comment, to understand
this is an ansigle bug or an environment problem.

> My initial question here would be
>  - Do these overcloud nodes have any DNS servers configured ?
>   - If yes, are they really reachable from these overcloud nodes ?
>   - If no, is the issue reproduced even when you prepare DNS server ?
>
> If the problem can be reproduced in the deployment WITHOUT DNS server then we probably have to modify
> the built-in iptables module somehow to avoid that DNS resolution

Comment 13 Chris Janiszewski 2023-08-25 14:38:14 UTC
Thanks for looking into this. Generally in our labs we are running in a disconnected fashion with no real DNS serving those labs. We've been doing this since OSP7. Instead the local /etc/hosts has always been generated by tripleo to satisfy the DNS needs. If that is no longer acceptable architecture, we will change and probably figure out how to make undercloud to also act as a DNS server (since that's the only host accessible to all the overcloud nodes in our labs). 
I am not sure why would computes not be affected .. is it possible /etc/hosts gets generated there quicker?

With that being said adding -n parameter to iptables_path sounds like a reasonable fix. I don't see the downside to doing this.

Comment 14 Takashi Kajinami 2023-08-25 14:54:04 UTC
It's true it does not have downside in our usage, but as I said it breaks a different usage.
Because the module causing the problem is not our own module but comes from ansible-core, the change
may not be accepted.

@Chris
Is it possible that you capture following items and share these ?

1. sosreport from
 - controller nodes
 - one compute node
 - undercloud

2. deployment templates and the actual command line of `openstack overcloud deploy`

3. files under /home/stack/overcloud-deploy

If the environment can be shared we can connect to the nodes to look into details next week.

Comment 29 Brendan Shephard 2023-10-12 14:10:27 UTC
So, Takashi has since left Red Hat. But allow me to summarise the conclusions here:

1. Ideally, we need a valid DNS server. But I understand there are test scenarios where you don't necessarily need one. In that case, it's either going to be slow. Or, 

2. We need to fix iptables and / or Ansible to not do DNS lookups. However, in either of these two cases, they fall outside of the OpenStack scope and a BZ would need to be raised with the relevant component team. Probably Ansible would be easier to address compared to iptables I would imagine.

The recommendation for anyone hitting this issue is to use a valid DNS server, or configure a valid DNS service to respond to those DNS queries.

Comment 30 Keigo Noha 2023-10-18 00:39:15 UTC
Hi Brendan,

Thank you for your comment on this bugzilla.
I don't collect the number of customers who deploy RHOSP in disconnected environment.
However, in some cases, we saw there were several customer who deployed RHOSP without DNS servers.
Unfortunately, we don't have a statement in our doc which describes DNS server is required.

What do you think if we open this issue to ansible for fixing issue?

Best regards,
Keigo Noha

Comment 31 Brendan Shephard 2023-10-19 11:51:25 UTC
Hey Keigo,

We can open an issue with Ansible and see if they want to improve the iptables handling when DNS servers aren't reachable. 

We should probably also mention that a valid DNS server is required in the docs. Even if you were using Satellite for the disconnected environment, you would still need to resolve the Satellite servers FQDN to pull packages and containers.

Comment 33 Brendan Shephard 2024-02-26 00:56:32 UTC
*** Bug 2265072 has been marked as a duplicate of this bug. ***

Comment 44 James Slagle 2024-04-04 11:05:09 UTC
*** Bug 2247573 has been marked as a duplicate of this bug. ***

Comment 47 James Slagle 2024-04-08 12:01:20 UTC
https://pkgs.devel.redhat.com/cgit/rpms/openstack-ansible-core/commit/?h=rhos-17.1-rhel-9-trunk&id=9e1c21519e98e1dc8b2fba591e09b6953ad36de7

still needs to be backported to rhos-17.1-rhel-9

the tripleo-ansible patch https://code.engineering.redhat.com/gerrit/c/tripleo-ansible/+/451168 has already merged on  rhos-17.1-patches

Comment 48 James Slagle 2024-04-08 12:11:19 UTC
Next steps would be:

backport https://pkgs.devel.redhat.com/cgit/rpms/openstack-ansible-core/commit/?h=rhos-17.1-rhel-9-trunk&id=9e1c21519e98e1dc8b2fba591e09b6953ad36de7 to rhos-17.1-rhel-9

build tripleo-ansible with https://code.engineering.redhat.com/gerrit/c/tripleo-ansible/+/451168

@lhh Is release delivery going to take care of those, or is it expected the DFG does it?

Comment 49 James Slagle 2024-04-08 12:45:56 UTC
reassigning as this should be DF owned for dev and qe

Comment 69 errata-xmlrpc 2024-05-22 20:35:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenStack Platform 17.1 (openstack-ansible-core) security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2733


Note You need to log in before you can comment on or make changes to this bug.