This validation runs `neutron-sanity-check` on the controller nodes to find out potential issues with Neutron's configuration. It hangs on the following task: - name: Run neutron-sanity-check command: > {% if oc_container_cli is defined %}{{ oc_container_cli }}{% else %}{{ uc_container_cli }}{% endif %} exec -u root {{ container_name }} /bin/bash -c 'neutron-sanity-check --config-file {{ item }}' with_items: "{{ configs }}" become: true register: nsc_return ignore_errors: true changed_when: False https://opendev.org/openstack/tripleo-validations/src/branch/master/roles/neutron-sanity-check/tasks/main.yml#L29-L38
Checking the validation with Gael, here are some things to fix: * FWaaS was removed in 15, so /etc/neutron/fwaas_driver.ini config file should not be included in the list for 15, 16 and newer - validation fails as it does not find the file * for ML2/OVS deployments, neutron-sanity-check command requires more capabilities than neutron_api container has: 2020-01-22 15:12:55.647 124445 ERROR neutron oslo_privsep.daemon.FailedToDropPrivileges: Privsep daemon failed to start 2020-01-22 15:12:55.647 124445 ERROR neutron 2020-01-22 15:12:55.574 124458 ERROR oslo.privsep.daemon [-] [Errno 1] Operation not permitted For these deployments, same command passes fine on neutron_ovs_agent container (which is the one used in undercloud sanity check command). This should most probably be the container to use on controllers too * Currently neutron-sanity-check is run iteratively on all neutron configuration files (server, agent, ...) listed at https://opendev.org/openstack/tripleo-validations/src/branch/master/roles/neutron-sanity-check/defaults/main.yml . I think it is not needed to run on some of these (to confirm) * Though if we have OVN deployment, there will not be a neutron_ovs_agent container running (also some configuration files may be missing if not needed by OVN). Which container to use in that case? * Validation should work on both deployment types, and also on other deployments not using ml2/ovs or ovn (conntrail?) - it should still pass even if we cannot test Apart from issues in validation itself, the validation hanging is probably unrelated to neutron-sanity-check (more about validations framework)
Checking on a 16 deployment with OVN, the only "neutron container" we have there is neutron_api. So for these, either we skip neutron-sanity-check run (it seems to be currently focused on ml2/ovs deployments anyway), extend container permissions to allow it to run the tests, or maybe only run a subset of checks (not sure if it is possible with current command)
(In reply to Bernard Cafarelli from comment #1) > Checking the validation with Gael, here are some things to fix: > > * Currently neutron-sanity-check is run iteratively on all neutron > configuration files (server, agent, ...) listed at > https://opendev.org/openstack/tripleo-validations/src/branch/master/roles/ > neutron-sanity-check/defaults/main.yml . I think it is not needed to run on > some of these (to confirm) neutron-sanity-check was always meant to run with *all* of the config files/config dirs passed that the binary itself would run--not iteratively over individual config files. e.g. neutron-sanity-check --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/networking-ovn/networking-ovn-metadata-agent.ini --config-dir /etc/neutron/conf.d/networking-ovn-metadata-agent, etc. > * Though if we have OVN deployment, there will not be a neutron_ovs_agent > container running (also some configuration files may be missing if not > needed by OVN). Which container to use in that case? > > * Validation should work on both deployment types, and also on other > deployments not using ml2/ovs or ovn (conntrail?) - it should still pass > even if we cannot test The way the tests are *supposed* to be written is that if a config option doesn't apply, then those tests don't run. So if we pass all of the configs that the binary will be passed, and it sees that we are OVN and the test only applies to OVS, then it won't run. If the tests don't have this requirement added, they need to be fixed so that we can run the sanity check for any config and it doesn't run unnecessary tests.
(In reply to Terry Wilson from comment #7) > The way the tests are *supposed* to be written is that if a config option > doesn't apply, then those tests don't run. So if we pass all of the configs > that the binary will be passed, and it sees that we are OVN and the test > only applies to OVS, then it won't run. If the tests don't have this > requirement added, they need to be fixed so that we can run the sanity check > for any config and it doesn't run unnecessary tests. It might be that just wrapping a bunch of the things here https://github.com/openstack/neutron/blob/6b9765c991da8731fe39f7e7eed1ed6e2bca231a/neutron/cmd/sanity_check.py#L377 in: if 'ovs' in cfg.CONF.ml2.mechanism_drivers: ... would help.
*** Bug 1845938 has been marked as a duplicate of this bug. ***
As commented in #1845938 on recent 16.1 deployment, sometimes the 'openstack tripleo validator run --validation neutron-sanity-check' command will hang (rarely) or fail with similar errors as described here
Some more notes: - plain ansible-playbook call also hangs from time to time - when it hangs, we can see this tree: [snip] 39113 pts/0 S+ 0:00 | \_ /usr/bin/python /home/heat-admin/.ansible/tmp/ansible-tmp-1596176076.959135-986739-255214116684236/AnsiballZ_command.py 39114 pts/0 Sl+ 0:00 | \_ podman exec -u root neutron_api /bin/bash -c neutron-sanity-check --config-file /etc/neutron/metadata_agent.ini 39130 pts/0 Z 0:00 | \_ [conmon] <defunct> [/snip] There is apparently no log for neutron_api container, at least `podman logs neutron_api' doesn't show anything. There is also nothing relevant in /var/log/containers/neutron, not even the privsep-helper.log (it's empty). When we run with plain ansible-playbook, we get some more informations: 2020-07-31 06:14:32.117 12714 INFO oslo.privsep.daemon [-] Running privsep helper: ['sudo', 'privsep-helper', '--config-file', '/etc/neutron/neutron.conf', '--privsep_context', 'neutron.privileged.default', '--privsep_sock_path', '/tmp/tmpjwj6kg2d/privsep.sock']\u001b[00m 2020-07-31 06:14:32.805 12714 WARNING oslo.privsep.daemon [-] privsep log: [Errno 1] Operation not permitted\u001b[00m 2020-07-31 06:14:32.883 12714 INFO oslo.privsep.daemon [-] Spawned new privsep daemon via rootwrap\u001b[00m 2020-07-31 06:14:32.796 12732 INFO oslo.privsep.daemon [-] privsep daemon starting\u001b[00m 2020-07-31 06:14:32.800 12732 INFO oslo.privsep.daemon [-] privsep process running with uid/gid: 0/0\u001b[00m 2020-07-31 06:14:32.804 12732 ERROR oslo.privsep.daemon [-] [Errno 1] Operation not permitted Traceback (most recent call last): File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 557, in helper_main Daemon(channel, context).run() File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 367, in run self._drop_privs() File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 403, in _drop_privs capabilities.drop_all_caps_except(self.caps, self.caps, []) File \"/usr/lib/python3.6/site-packages/oslo_privsep/capabilities.py\", line 156, in drop_all_caps_except raise OSError(errno, os.strerror(errno)) PermissionError: [Errno 1] Operation not permitted\u001b[00m 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon [-] Error while sending initial PING to privsep: [Errno 32] Broken pipe: BrokenPipeError: [Errno 32] Broken pipe 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon Traceback (most recent call last): 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 183, in exchange_ping 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon reply = self.send_recv((Message.PING.value,)) 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon File \"/usr/lib/python3.6/site-packages/oslo_privsep/comm.py\", line 169, in send_recv 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon self.writer.send((myid, msg)) 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon File \"/usr/lib/python3.6/site-packages/oslo_privsep/comm.py\", line 56, in send 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon self.writesock.sendall(buf) 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon BrokenPipeError: [Errno 32] Broken pipe 2020-07-31 06:14:32.885 12714 ERROR oslo.privsep.daemon \u001b[00m 2020-07-31 06:14:32.887 12714 CRITICAL oslo.privsep.daemon [-] Privsep daemon failed to start\u001b[00m 2020-07-31 06:14:32.888 12714 CRITICAL neutron [-] Unhandled error: oslo_privsep.daemon.FailedToDropPrivileges: Privsep daemon failed to start 2020-07-31 06:14:32.888 12714 ERROR neutron Traceback (most recent call last): 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/bin/neutron-sanity-check\", line 10, in <module> 2020-07-31 06:14:32.888 12714 ERROR neutron sys.exit(main()) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 429, in main 2020-07-31 06:14:32.888 12714 ERROR neutron return 0 if all_tests_passed() else 1 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 416, in all_tests_passed 2020-07-31 06:14:32.888 12714 ERROR neutron return all(opt.callback() for opt in OPTS if cfg.CONF.get(opt.name)) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 416, in <genexpr> 2020-07-31 06:14:32.888 12714 ERROR neutron return all(opt.callback() for opt in OPTS if cfg.CONF.get(opt.name)) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity_check.py\", line 71, in check_iproute2_vxlan 2020-07-31 06:14:32.888 12714 ERROR neutron result = checks.iproute2_vxlan_supported() 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/neutron/cmd/sanity/checks.py\", line 67, in iproute2_vxlan_supported 2020-07-31 06:14:32.888 12714 ERROR neutron port = ip.add_vxlan(name, 3000) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py\", line 304, in add_vxlan 2020-07-31 06:14:32.888 12714 ERROR neutron privileged.create_interface(name, self.namespace, \"vxlan\", **kwargs) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py\", line 73, in sync_inner 2020-07-31 06:14:32.888 12714 ERROR neutron return input_func(*args, **kwargs) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py\", line 244, in _wrap 2020-07-31 06:14:32.888 12714 ERROR neutron self.start() 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py\", line 255, in start 2020-07-31 06:14:32.888 12714 ERROR neutron channel = daemon.RootwrapClientChannel(context=self) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 347, in __init__ 2020-07-31 06:14:32.888 12714 ERROR neutron super(RootwrapClientChannel, self).__init__(sock) 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 178, in __init__ 2020-07-31 06:14:32.888 12714 ERROR neutron self.exchange_ping() 2020-07-31 06:14:32.888 12714 ERROR neutron File \"/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py\", line 191, in exchange_ping 2020-07-31 06:14:32.888 12714 ERROR neutron raise FailedToDropPrivileges(msg) 2020-07-31 06:14:32.888 12714 ERROR neutron oslo_privsep.daemon.FailedToDropPrivileges: Privsep daemon failed to start 2020-07-31 06:14:32.888 12714 ERROR neutron \u001b[00m This is for the privsep-helper thingy. I don't see any SELinux thing preventing this action, so it's more likely bound to privsep itself. Unfortunately, I have no knowledge on that tool :/. I suspect the broken pipe to be the thing breaking conmon, but I'm really unsure. Just one of those guts feeling. @Bernard: I keep my env up in case you want to investigate this issue. Cheers, C.
Linked patch should fix the race condition hangs, and makes the validator pass. For ML2/OVN, this means check is currently skipped, but this will be tracked in separate bug 1862427
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284