Description of problem: VM with SRIOV port cannot boot on setup with OVN It looks like the SRIOV NIC Agent RPC Daemon Started but there is an error that Agent out of sync with plugin!: /var/log/neutron/sriov-nic-agent.log 2017-10-31 15:51:56.281 17158 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] Agent initialized successfully, now running... 2017-10-31 15:51:56.282 17158 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] SRIOV NIC Agent RPC Daemon Started! 2017-10-31 15:51:56.283 17158 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] Agent out of sync with plugin! 2017-10-31 15:51:57.106 17158 INFO oslo_rootwrap.client [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] Spawned new rootwrap daemon process with pid=18224 2017-10-31 18:57:41.765 17158 ERROR oslo.messaging._drivers.impl_rabbit [-] [163a9ea7-017c-4278-a0db-054e201e3985] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: None: error: [Errno 110] Connection timed out 2017-10-31 18:57:48.895 17158 ERROR oslo.messaging._drivers.impl_rabbit [-] [6ab7209f-dbc9-4d36-9f28-6281ddd174d1] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: None: error: [Errno 110] Connection timed out 2017-10-31 18:57:54.694 17158 ERROR oslo.messaging._drivers.impl_rabbit [-] [2274bf5e-bec3-41c3-b34d-514b8427aadd] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out When trying to boot VM with SRIOV port (direct) it starts with error: /var/log/containers/nova/nova-conductor.log 017-10-31 15:53:59.834 20 WARNING oslo_config.cfg [req-a76e925d-b0d9-48ac-9e67-0fcfd5d2ad13 - - - - -] Option "rabbit_password" from group "oslo_messaging_rabbit" is deprecated for removal (Replaced by [DEFAULT]/transport_url). Its value may be silently ignored in the future. 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager [req-98048079-d574-4e0a-b1b7-70ae4663d56a 9daa58f5aaa949d8be28ac894ee65340 6f210d906fee4f1b9a7a6c26bcc96b46 - default default] Failed to schedule instances: NoValidHost_Remote: No valid host was found. There are not enough hosts available. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 232, in inner return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 149, in select_destinations alloc_reqs_by_rp_uuid, provider_summaries) File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 109, in select_destinations raise exception.NoValidHost(reason=reason) NoValidHost: No valid host was found. There are not enough hosts available. 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager Traceback (most recent call last): 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 1027, in schedule_and_build_instances 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager instance_uuids) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 626, in _schedule_instances 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager request_spec, instance_uuids) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 586, in wrapped 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager return func(*args, **kwargs) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 52, in select_destinations 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager instance_uuids) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager return getattr(self.instance, __name)(*args, **kwargs) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 33, in select_destinations 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager instance_uuids) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 137, in select_destinations 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager return cctxt.call(ctxt, 'select_destinations', **msg_args) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager retry=self.retry) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager timeout=timeout, retry=retry) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager retry=retry) 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 569, in _send 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager raise result 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager NoValidHost_Remote: No valid host was found. There are not enough hosts available. 2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager Traceback (most recent call last): SOS-report attached Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. deploy sriov setup with OVN 2. create environment with SRiov instance 3. check the logs Actual results: Error Expected results: SRIOV instance should boot with no errors Additional info: SOS-report attached
Version-Release number of selected component (if applicable): [root@compute-0 ~]# rpm -qa | grep neutron python-neutronclient-6.5.0-0.20170814170137.355983d.el7ost.noarch python-neutron-11.0.2-0.20171020230401.el7ost.noarch openstack-neutron-11.0.2-0.20171020230401.el7ost.noarch openstack-neutron-linuxbridge-11.0.2-0.20171020230401.el7ost.noarch python-neutron-lib-1.9.1-0.20170821170222.0ef54c3.el7ost.noarch openstack-neutron-lbaas-11.0.2-0.20170927152439.743c1db.el7ost.noarch openstack-neutron-metering-agent-11.0.2-0.20171020230401.el7ost.noarch openstack-neutron-sriov-nic-agent-11.0.2-0.20171020230401.el7ost.noarch python-neutron-lbaas-11.0.2-0.20170927152439.743c1db.el7ost.noarch openstack-neutron-lbaas-ui-3.0.1-2.el7ost.noarch ^[[Aopenstack-neutron-common-11.0.2-0.20171020230401.el7ost.noarch openstack-neutron-ml2-11.0.2-0.20171020230401.el7ost.noarch openstack-neutron-openvswitch-11.0.2-0.20171020230401.el7ost.noarch puppet-neutron-11.3.1-0.20171005205442.83e8ac7.el7ost.noarch [root@compute-0 ~]# rpm -qa | grep ovn python-networking-ovn-3.0.1-0.20171005161553.0cde8a5.el7ost.noarch openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64 openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64 openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64 openstack-nova-novncproxy-16.0.2-0.20171023105738.a2e4540.el7ost.noarch novnc-0.6.1-1.el7ost.noarch puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch
Created attachment 1346511 [details] nova_conductor
Created attachment 1346512 [details] sosCompute
I'm going to triage this bug.
I think we need sos report from controller to check whether both mechanism drivers are enabled. The one attached is for compute.
Also please make sure that rabbitmq server on controller-1.internalapi.localdomain and port 5672 is really reachable. Sounds like agent can't talk to it.
I will deploy new setup & collect all logs you need.
Putting back needinfo as logs weren't provided yet
It's been a month since we asked for the logs so we're closing this bug. Please feel free to re-open the bug once you have the logs available.
Here you can find logs of Compute & Controller nodes: https://drive.google.com/open?id=13g2bPE4dsuMlSE4ExvNuf9cT8HuOhC3t Enjoy
https://drive.google.com/open?id=1QmZQGB39MbJFCJo7oWD4CWdoeltfPnU8
It seems that SRIOV is not configured at all. The only thing that is running are SRIOV agents but nova-compute service has completely missing pci_passthrough_whitelist parameter and neutron-server has only mechanism_drivers=ovn hence missing sriovnicswitch mech driver. It looks like a tripleo issue, but I have limited knowledge there. Brent, could you please help me here finding out why the sriov hasn't been configured? I'm setting this bug to triaged as the root cause of why VM is in ERROR state was found. The exception with rabbit was just a red herring.
Created attachment 1370036 [details] deployment files
tl;dr: not likely a bug, NeutronMechanismDrivers was not properly set NeutronMechanismDrivers and other heat variables do not "accumulate". The last definition is what will be used, so I'm guessing that the last environment file that appears in the command line is neutron-ml2-ovn-ha.yaml and that the sriov mechanism driver simply isn't being configured. There is an u/s bug reported that would add this type of functionality but would change how these parameters are handled, but this is what we have for now.
So if I will add NeutronMechanismDrivers:"ovn,sriovnicswitch" under neutron-ml2-ovn-ha.yaml it should work or do I need to change some other heat variables? ^ Can you share the u/s bug that you mentioned?
Yes, I think that should do it. The u/s bug is https://bugs.launchpad.net/tripleo/+bug/1716391
Did setting the variable work for you?
To clarify: the limitation here is not specific to SR-IOV nor the coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the ability (or lack of) of OVN to support vlan tenant networks.
(In reply to Nir Yechiel from comment #28) > To clarify: the limitation here is not specific to SR-IOV nor the > coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the > ability (or lack of) of OVN to support vlan tenant networks. s/coextensive/coextensive
(In reply to Nir Yechiel from comment #28) > To clarify: the limitation here is not specific to SR-IOV nor the > coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the > ability (or lack of) of OVN to support vlan tenant networks. I think there was some sort of support on ovn for vlan tenant networks too per talks with russellb. @russellb, do you remember that, at some point we even configured a job for vlan tenant networking and it was working?
That's right - we should(In reply to Miguel Angel Ajo from comment #30) > (In reply to Nir Yechiel from comment #28) > > To clarify: the limitation here is not specific to SR-IOV nor the > > coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the > > ability (or lack of) of OVN to support vlan tenant networks. > > I think there was some sort of support on ovn for vlan tenant networks too > per talks with russellb. > > @russellb, do you remember that, at some point we even configured a job for > vlan tenant networking and it was working? That's right. We should treat this as a supported feature, and fix any issues that arise during testing.
Hmm, there's something else we need to take care of in this case. If we have instances connected to tenant networks via SRIOV (I' don't know how SRIOV is used normally), we also will need to run the dhcp-agent alongside the ovn-controller resolver. Because ovn-controller won't be able to capture the traffic sent via sriov directly to the switch. Another option could be making ovn-controller aware of that, and able to capture dhcp requests also via the provider network, and respond them back.
Numan can you have an eye on my last comment? ^
@Miguel - Sure. Need to explore on how to support it. If we have this support in ovn-controller, it would benefit even for undercloud to use OVN in tripleo/osp director deployments.
@numans, can we deploy neutron dhcp-agent when we have SRIOV enabled?, I suspect that's the only way to serve dhcp for SR-IOV. @dalvarez, would it work with the ovn-metadata agent that uses dhcp ports? I suspect it may confuse the dhcp-agent.
@Miguel - We can. For that we have to remove "OS::TripleO::Services::NeutronDhcpAgent: OS::Heat::None" from the ovn enviroment file or add a new env file at the end with the contents "OS::TripleO::Services::NeutronDhcpAgent: docker/services/neutron-dhcp.yaml"
The u/s patch to have SR-IOV instances get DHCP from OVN instead of dhcp agent is here - https://patchwork.ozlabs.org/patch/1003652/ It's still under review.