Bug 1783195 - 'neutron-sanity-check' validation hangs on "Run neutron-sanity-check" Task for Controllers
Summary: 'neutron-sanity-check' validation hangs on "Run neutron-sanity-check" Task fo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-validations
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 16.1 (Train on RHEL 8.2)
Assignee: Bernard Cafarelli
QA Contact: nlevinki
URL:
Whiteboard:
: 1845938 (view as bug list)
Depends On: 1862364
Blocks: 1862427
TreeView+ depends on / blocked
 
Reported: 2019-12-13 09:33 UTC by Gaël Chamoulaud
Modified: 2020-10-28 15:37 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-validations-11.3.2-1.20200807163416.f8d7882.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-28 15:36:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 744146 0 None MERGED Fix neutron_sanity_check for ML2/OVS overcloud 2021-01-11 10:39:41 UTC
OpenStack gerrit 744857 0 None MERGED Fix neutron_sanity_check for ML2/OVS overcloud 2021-01-11 10:40:19 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:37:21 UTC

Description Gaël Chamoulaud 2019-12-13 09:33:20 UTC
This validation runs `neutron-sanity-check` on the controller nodes to find out
potential issues with Neutron's configuration. It hangs on the following task:

- name: Run neutron-sanity-check
  command: >
    {% if oc_container_cli is defined %}{{ oc_container_cli }}{% else %}{{ uc_container_cli }}{% endif %}
    exec -u root {{ container_name }}
    /bin/bash -c 'neutron-sanity-check --config-file {{ item }}'
  with_items: "{{ configs }}"
  become: true
  register: nsc_return
  ignore_errors: true
  changed_when: False


https://opendev.org/openstack/tripleo-validations/src/branch/master/roles/neutron-sanity-check/tasks/main.yml#L29-L38

Comment 1 Bernard Cafarelli 2020-01-22 15:20:45 UTC
Checking the validation with Gael, here are some things to fix:

* FWaaS was removed in 15, so /etc/neutron/fwaas_driver.ini config file should not be included in the list for 15, 16 and newer - validation fails as it does not find the file

* for ML2/OVS deployments, neutron-sanity-check command requires more capabilities than neutron_api container has:
2020-01-22 15:12:55.647 124445 ERROR neutron oslo_privsep.daemon.FailedToDropPrivileges: Privsep daemon failed to start
2020-01-22 15:12:55.647 124445 ERROR neutron 
2020-01-22 15:12:55.574 124458 ERROR oslo.privsep.daemon [-] [Errno 1] Operation not permitted

For these deployments, same command passes fine on neutron_ovs_agent container (which is the one used in undercloud sanity check command). This should most probably be the container to use on controllers too

* Currently neutron-sanity-check is run iteratively on all neutron configuration files (server, agent, ...) listed at https://opendev.org/openstack/tripleo-validations/src/branch/master/roles/neutron-sanity-check/defaults/main.yml . I think it is not needed to run on some of these (to confirm)

* Though if we have OVN deployment, there will not be a neutron_ovs_agent container running (also some configuration files may be missing if not needed by OVN). Which container to use in that case?

* Validation should work on both deployment types, and also on other deployments not using ml2/ovs or ovn (conntrail?) - it should still pass even if we cannot test

Apart from issues in validation itself, the validation hanging is probably unrelated to neutron-sanity-check (more about validations framework)

Comment 3 Bernard Cafarelli 2020-01-22 16:13:49 UTC
Checking on a 16 deployment with OVN, the only "neutron container" we have there is neutron_api. So for these, either we skip neutron-sanity-check run (it seems to be currently focused on ml2/ovs deployments anyway), extend container permissions to allow it to run the tests, or maybe only run a subset of checks (not sure if it is possible with current command)

Comment 7 Terry Wilson 2020-02-27 22:40:25 UTC
(In reply to Bernard Cafarelli from comment #1)
> Checking the validation with Gael, here are some things to fix:
> 

> * Currently neutron-sanity-check is run iteratively on all neutron
> configuration files (server, agent, ...) listed at
> https://opendev.org/openstack/tripleo-validations/src/branch/master/roles/
> neutron-sanity-check/defaults/main.yml . I think it is not needed to run on
> some of these (to confirm)

neutron-sanity-check was always meant to run with *all* of the config files/config dirs passed that the binary itself would run--not iteratively over individual config files. e.g. neutron-sanity-check --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/networking-ovn/networking-ovn-metadata-agent.ini --config-dir /etc/neutron/conf.d/networking-ovn-metadata-agent, etc.

 
> * Though if we have OVN deployment, there will not be a neutron_ovs_agent
> container running (also some configuration files may be missing if not
> needed by OVN). Which container to use in that case?
> 
> * Validation should work on both deployment types, and also on other
> deployments not using ml2/ovs or ovn (conntrail?) - it should still pass
> even if we cannot test

The way the tests are *supposed* to be written is that if a config option doesn't apply, then those tests don't run. So if we pass all of the configs that the binary will be passed, and it sees that we are OVN and the test only applies to OVS, then it won't run. If the tests don't have this requirement added, they need to be fixed so that we can run the sanity check for any config and it doesn't run unnecessary tests.

Comment 8 Terry Wilson 2020-02-27 22:49:34 UTC
(In reply to Terry Wilson from comment #7)
> The way the tests are *supposed* to be written is that if a config option
> doesn't apply, then those tests don't run. So if we pass all of the configs
> that the binary will be passed, and it sees that we are OVN and the test
> only applies to OVS, then it won't run. If the tests don't have this
> requirement added, they need to be fixed so that we can run the sanity check
> for any config and it doesn't run unnecessary tests.

It might be that just wrapping a bunch of the things here https://github.com/openstack/neutron/blob/6b9765c991da8731fe39f7e7eed1ed6e2bca231a/neutron/cmd/sanity_check.py#L377 in:

if 'ovs' in cfg.CONF.ml2.mechanism_drivers:
    ...

would help.

Comment 10 Bernard Cafarelli 2020-07-30 17:18:49 UTC
*** Bug 1845938 has been marked as a duplicate of this bug. ***

Comment 11 Bernard Cafarelli 2020-07-30 17:19:45 UTC
As commented in #1845938 on recent 16.1 deployment, sometimes the 'openstack tripleo validator run --validation neutron-sanity-check' command will hang (rarely) or fail with similar errors as described here

Comment 12 Cédric Jeanneret 2020-07-31 06:35:00 UTC
Some more notes:
- plain ansible-playbook call also hangs from time to time
- when it hangs, we can see this tree:

[snip]
  39113 pts/0    S+     0:00  |                   \_ /usr/bin/python /home/heat-admin/.ansible/tmp/ansible-tmp-1596176076.959135-986739-255214116684236/AnsiballZ_command.py                                                                   
  39114 pts/0    Sl+    0:00  |                       \_ podman exec -u root neutron_api /bin/bash -c neutron-sanity-check --config-file /etc/neutron/metadata_agent.ini                                                                       
  39130 pts/0    Z      0:00  |                           \_ [conmon] <defunct>
[/snip]

There is apparently no log for neutron_api container, at least `podman logs neutron_api' doesn't show anything. There is also nothing relevant in /var/log/containers/neutron, not even the privsep-helper.log (it's empty).

When we run with plain ansible-playbook, we get some more informations:
2020-07-31 06:14:32.117 12714 INFO oslo.privsep.daemon [-] Running privsep helper: ['sudo', 'privsep-helper', '--config-file', '/etc/neutron/neutron.conf', '--privsep_context', 'neutron.privileged.default', '--privsep_sock_path', '/tmp/tmpjwj6kg2d/privsep.sock']\u001b[00m
2020-07-31 06:14:32.805 12714 WARNING oslo.privsep.daemon [-] privsep log: [Errno 1] Operation not permitted\u001b[00m
2020-07-31 06:14:32.883 12714 INFO oslo.privsep.daemon [-] Spawned new privsep daemon via rootwrap\u001b[00m
2020-07-31 06:14:32.796 12732 INFO oslo.privsep.daemon [-] privsep daemon starting\u001b[00m
2020-07-31 06:14:32.800 12732 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0\u001b[00m
2020-07-31 06:14:32.804 12732 ERROR oslo.privsep.daemon [-] [Errno 1] Operation not permitted
Traceback (most recent call last):                                              
  File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 557, in helper_main
    Daemon(channel, context).run()                                              
  File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 367, in run
    self._drop_privs()                                                          
  File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 403, in _drop_privs
    capabilities.drop_all_caps_except(self.caps, self.caps, [])                 
  File \"/usr/lib/python3.6/site-packages/oslo_privsep/capabilities.py\", line 156, in drop_all_caps_except
    raise OSError(errno, os.strerror(errno))                                    
PermissionError: [Errno 1] Operation not permitted\u001b[00m                    
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon [-] Error while sending initial PING to privsep: [Errno 32] Broken pipe: BrokenPipeError: [Errno 32] Broken pipe
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon Traceback (most recent call last):
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon   File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 183, in exchange_ping
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon     reply = self.send_recv((Message.PING.value,))
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon   File \"/usr/lib/python3.6/site-packages/oslo_privsep/comm.py\", line 169, in send_recv
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon     self.writer.send((myid, msg))
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon   File \"/usr/lib/python3.6/site-packages/oslo_privsep/comm.py\", line 56, in send
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon     self.writesock.sendall(buf)
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon BrokenPipeError: [Errno 32] Broken pipe
2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon \u001b[00m              
2020-07-31 06:14:32.887 12714 CRITICAL oslo.privsep.daemon [-] Privsep daemon failed to start\u001b[00m
2020-07-31 06:14:32.888 12714 CRITICAL neutron [-] Unhandled error: oslo_privsep.daemon.FailedToDropPrivileges: Privsep daemon failed to start
2020-07-31 06:14:32.888 12714 ERROR neutron Traceback (most recent call last):  
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/bin/neutron-sanity-check\", line 10, in <module>
2020-07-31 06:14:32.888 12714 ERROR neutron     sys.exit(main())                
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 429, in main
2020-07-31 06:14:32.888 12714 ERROR neutron     return 0 if all_tests_passed() else 1
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 416, in all_tests_passed
2020-07-31 06:14:32.888 12714 ERROR neutron     return all(opt.callback() for opt in OPTS if cfg.CONF.get(opt.name))
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 416, in <genexpr>
2020-07-31 06:14:32.888 12714 ERROR neutron     return all(opt.callback() for opt in OPTS if cfg.CONF.get(opt.name))
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 71, in check_iproute2_vxlan
2020-07-31 06:14:32.888 12714 ERROR neutron     result = checks.iproute2_vxlan_supported()
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity/checks.py\", line 67, in iproute2_vxlan_supported
2020-07-31 06:14:32.888 12714 ERROR neutron     port = ip.add_vxlan(name, 3000) 
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py\", line 304, in add_vxlan
2020-07-31 06:14:32.888 12714 ERROR neutron     privileged.create_interface(name, self.namespace, \"vxlan\", **kwargs)
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py\", line 73, in sync_inner
2020-07-31 06:14:32.888 12714 ERROR neutron     return input_func(*args, **kwargs)
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py\", line 244, in _wrap
2020-07-31 06:14:32.888 12714 ERROR neutron     self.start()                    
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py\", line 255, in start
2020-07-31 06:14:32.888 12714 ERROR neutron     channel = daemon.RootwrapClientChannel(context=self)
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 347, in __init__
2020-07-31 06:14:32.888 12714 ERROR neutron     super(RootwrapClientChannel, self).__init__(sock)
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 178, in __init__
2020-07-31 06:14:32.888 12714 ERROR neutron     self.exchange_ping()            
2020-07-31 06:14:32.888 12714 ERROR neutron   File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 191, in exchange_ping
2020-07-31 06:14:32.888 12714 ERROR neutron     raise FailedToDropPrivileges(msg)
2020-07-31 06:14:32.888 12714 ERROR neutron oslo_privsep.daemon.FailedToDropPrivileges: Privsep daemon failed to start
2020-07-31 06:14:32.888 12714 ERROR neutron \u001b[00m                          

This is for the privsep-helper thingy.

I don't see any SELinux thing preventing this action, so it's more likely bound to privsep itself. Unfortunately, I have no knowledge on that tool :/.
I suspect the broken pipe to be the thing breaking conmon, but I'm really unsure. Just one of those guts feeling.

@Bernard: I keep my env up in case you want to investigate this issue.

Cheers,

C.

Comment 13 Bernard Cafarelli 2020-07-31 12:21:28 UTC
Linked patch should fix the race condition hangs, and makes the validator pass. For ML2/OVN, this means check is currently skipped, but this will be tracked in separate bug 1862427

Comment 20 errata-xmlrpc 2020-10-28 15:36:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.