Bug 1508449 - cant boot SRIOV instance because lack of OVN to support vlan tenant networks
Summary: cant boot SRIOV instance because lack of OVN to support vlan tenant networks
Keywords:
Status: CLOSED DUPLICATE of bug 1826364
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Lucas Alvares Gomes
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1561880
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-01 12:59 UTC by Eran Kuris
Modified: 2020-11-02 20:39 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
OVN serves DHCP as an openflow controller with ovn-controller directly on Compute nodes. However, SR-IOV instances are attached directly to the network through the VF/PF and so SR-IOV instances cannot receive DHCP responses. + Workaround: Change `OS::TripleO::Services::NeutronDhcpAgent` to `OS::TripleO::Services::NeutronDhcpAgent: deployment/neutron/neutron-dhcp-container-puppet.yaml`.
Clone Of:
Environment:
Last Closed: 2020-09-09 10:12:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
nova_conductor (20.29 KB, text/plain)
2017-11-01 13:02 UTC, Eran Kuris
no flags Details
sosCompute (15.08 MB, application/x-xz)
2017-11-01 13:03 UTC, Eran Kuris
no flags Details
deployment files (14.39 KB, application/x-gzip)
2017-12-19 13:44 UTC, Toni Freger
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1729343 0 None None None 2017-11-01 13:11:00 UTC
OpenStack gerrit 578766 0 None MERGED OVN: Add env file to deploy SRIOV with OVN. 2021-02-01 20:55:33 UTC

Description Eran Kuris 2017-11-01 12:59:33 UTC
Description of problem:
VM with SRIOV port cannot boot on setup with OVN 
It looks like the SRIOV NIC Agent RPC Daemon Started but there is an error that Agent out of sync with plugin!:
 /var/log/neutron/sriov-nic-agent.log 
2017-10-31 15:51:56.281 17158 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] Agent initialized successfully, now running...
2017-10-31 15:51:56.282 17158 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] SRIOV NIC Agent RPC Daemon Started!
2017-10-31 15:51:56.283 17158 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] Agent out of sync with plugin!
2017-10-31 15:51:57.106 17158 INFO oslo_rootwrap.client [req-119c03f8-f722-4b7a-a69b-3aa3064fd407 - - - - -] Spawned new rootwrap daemon process with pid=18224
2017-10-31 18:57:41.765 17158 ERROR oslo.messaging._drivers.impl_rabbit [-] [163a9ea7-017c-4278-a0db-054e201e3985] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: None: error: [Errno 110] Connection timed out
2017-10-31 18:57:48.895 17158 ERROR oslo.messaging._drivers.impl_rabbit [-] [6ab7209f-dbc9-4d36-9f28-6281ddd174d1] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: None: error: [Errno 110] Connection timed out
2017-10-31 18:57:54.694 17158 ERROR oslo.messaging._drivers.impl_rabbit [-] [2274bf5e-bec3-41c3-b34d-514b8427aadd] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out


When trying to boot VM with SRIOV port (direct) it starts with error: 
 /var/log/containers/nova/nova-conductor.log
017-10-31 15:53:59.834 20 WARNING oslo_config.cfg [req-a76e925d-b0d9-48ac-9e67-0fcfd5d2ad13 - - - - -] Option "rabbit_password" from group "oslo_messaging_rabbit" is deprecated for removal (Replaced by [DEFAULT]/transport_url).  Its value may be silently ignored in the future.
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager [req-98048079-d574-4e0a-b1b7-70ae4663d56a 9daa58f5aaa949d8be28ac894ee65340 6f210d906fee4f1b9a7a6c26bcc96b46 - default default] Failed to schedule instances: NoValidHost_Remote: No valid host was found. There are not enough hosts available.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 232, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 149, in select_destinations
    alloc_reqs_by_rp_uuid, provider_summaries)

  File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 109, in select_destinations
    raise exception.NoValidHost(reason=reason)

NoValidHost: No valid host was found. There are not enough hosts available.
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager Traceback (most recent call last):
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 1027, in schedule_and_build_instances
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     instance_uuids)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 626, in _schedule_instances
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     request_spec, instance_uuids)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 586, in wrapped
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     return func(*args, **kwargs)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 52, in select_destinations
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     instance_uuids)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     return getattr(self.instance, __name)(*args, **kwargs)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 33, in select_destinations
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     instance_uuids)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 137, in select_destinations
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     return cctxt.call(ctxt, 'select_destinations', **msg_args)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     retry=self.retry)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     timeout=timeout, retry=retry)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     retry=retry)
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 569, in _send
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager     raise result
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager NoValidHost_Remote: No valid host was found. There are not enough hosts available.
2017-11-01 07:26:52.210 21 ERROR nova.conductor.manager Traceback (most recent call last):

SOS-report attached


Version-Release number of selected component (if applicable):


How reproducible:
always 

Steps to Reproduce:
1. deploy sriov setup with OVN 
2. create  environment with SRiov instance 
3. check the logs 

Actual results:
Error

Expected results:
SRIOV instance should boot with no errors 

Additional info:
SOS-report attached

Comment 1 Eran Kuris 2017-11-01 13:01:35 UTC
Version-Release number of selected component (if applicable):
[root@compute-0 ~]# rpm -qa | grep neutron
python-neutronclient-6.5.0-0.20170814170137.355983d.el7ost.noarch
python-neutron-11.0.2-0.20171020230401.el7ost.noarch
openstack-neutron-11.0.2-0.20171020230401.el7ost.noarch
openstack-neutron-linuxbridge-11.0.2-0.20171020230401.el7ost.noarch
python-neutron-lib-1.9.1-0.20170821170222.0ef54c3.el7ost.noarch
openstack-neutron-lbaas-11.0.2-0.20170927152439.743c1db.el7ost.noarch
openstack-neutron-metering-agent-11.0.2-0.20171020230401.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.2-0.20171020230401.el7ost.noarch
python-neutron-lbaas-11.0.2-0.20170927152439.743c1db.el7ost.noarch
openstack-neutron-lbaas-ui-3.0.1-2.el7ost.noarch
^[[Aopenstack-neutron-common-11.0.2-0.20171020230401.el7ost.noarch
openstack-neutron-ml2-11.0.2-0.20171020230401.el7ost.noarch
openstack-neutron-openvswitch-11.0.2-0.20171020230401.el7ost.noarch
puppet-neutron-11.3.1-0.20171005205442.83e8ac7.el7ost.noarch
[root@compute-0 ~]# rpm -qa | grep ovn
python-networking-ovn-3.0.1-0.20171005161553.0cde8a5.el7ost.noarch
openvswitch-ovn-common-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-host-2.7.2-4.git20170719.el7fdp.x86_64
openvswitch-ovn-central-2.7.2-4.git20170719.el7fdp.x86_64
openstack-nova-novncproxy-16.0.2-0.20171023105738.a2e4540.el7ost.noarch
novnc-0.6.1-1.el7ost.noarch
puppet-ovn-11.3.1-0.20170825135756.c03c3ed.el7ost.noarch

Comment 2 Eran Kuris 2017-11-01 13:02:32 UTC
Created attachment 1346511 [details]
nova_conductor

Comment 3 Eran Kuris 2017-11-01 13:03:43 UTC
Created attachment 1346512 [details]
sosCompute

Comment 4 Jakub Libosvar 2017-11-13 14:59:48 UTC
I'm going to triage this bug.

Comment 5 Ihar Hrachyshka 2017-11-13 15:02:17 UTC
I think we need sos report from controller to check whether both mechanism drivers are enabled. The one attached is for compute.

Comment 6 Jakub Libosvar 2017-11-13 15:04:23 UTC
Also please make sure that rabbitmq server on controller-1.internalapi.localdomain and port 5672 is really reachable. Sounds like agent can't talk to it.

Comment 7 Eran Kuris 2017-11-14 08:18:50 UTC
I will deploy new setup & collect all logs you need.

Comment 8 Jakub Libosvar 2017-11-22 12:51:07 UTC
Putting back needinfo as logs weren't provided yet

Comment 11 Jakub Libosvar 2017-12-11 14:49:35 UTC
It's been a month since we asked for the logs so we're closing this bug. Please feel free to re-open the bug once you have the logs available.

Comment 17 Eran Kuris 2017-12-16 11:30:16 UTC
Here you can find logs of Compute & Controller nodes: 

https://drive.google.com/open?id=13g2bPE4dsuMlSE4ExvNuf9cT8HuOhC3t

Enjoy

Comment 19 Jakub Libosvar 2017-12-19 11:46:47 UTC
It seems that SRIOV is not configured at all. The only thing that is running are SRIOV agents but nova-compute service has completely missing pci_passthrough_whitelist parameter and neutron-server has only mechanism_drivers=ovn
hence missing sriovnicswitch mech driver.

It looks like a tripleo issue, but I have limited knowledge there. Brent, could you please help me here finding out why the sriov hasn't been configured?

I'm setting this bug to triaged as the root cause of why VM is in ERROR state was found. The exception with rabbit was just a red herring.

Comment 20 Toni Freger 2017-12-19 13:44:55 UTC
Created attachment 1370036 [details]
deployment files

Comment 22 Brent Eagles 2018-01-24 19:40:22 UTC
tl;dr: not likely a bug, NeutronMechanismDrivers was not properly set

NeutronMechanismDrivers and other heat variables do not "accumulate". The last definition is what will be used, so I'm guessing that the last environment file that appears in the command line is neutron-ml2-ovn-ha.yaml and that the sriov mechanism driver simply isn't being configured. There is an u/s bug reported that would add this type of functionality but would change how these parameters are handled, but this is what we have for now.

Comment 23 Eran Kuris 2018-01-25 07:19:30 UTC
So if I will add NeutronMechanismDrivers:"ovn,sriovnicswitch"  under neutron-ml2-ovn-ha.yaml it should work or do I need to change some other heat variables?  

                                        ^
Can you share the u/s  bug that you mentioned?

Comment 24 Brent Eagles 2018-01-25 12:12:01 UTC
Yes, I think that should do it. The u/s bug is https://bugs.launchpad.net/tripleo/+bug/1716391

Comment 25 Brent Eagles 2018-02-27 15:02:40 UTC
Did setting the variable work for you?

Comment 28 Nir Yechiel 2018-03-27 14:30:06 UTC
To clarify: the limitation here is not specific to SR-IOV nor the coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the ability (or lack of) of OVN to support vlan tenant networks.

Comment 29 Nir Yechiel 2018-03-27 14:30:45 UTC
(In reply to Nir Yechiel from comment #28)
> To clarify: the limitation here is not specific to SR-IOV nor the
> coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the
> ability (or lack of) of OVN to support vlan tenant networks.

s/coextensive/coextensive

Comment 30 Miguel Angel Ajo 2018-03-27 14:37:29 UTC
(In reply to Nir Yechiel from comment #28)
> To clarify: the limitation here is not specific to SR-IOV nor the
> coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the
> ability (or lack of) of OVN to support vlan tenant networks.

I think there was some sort of support on ovn for vlan tenant networks too per talks with russellb.

@russellb, do you remember that, at some point we even configured a job for vlan tenant networking and it was working?

Comment 31 Russell Bryant 2018-03-27 15:48:12 UTC
That's right - we should(In reply to Miguel Angel Ajo from comment #30)
> (In reply to Nir Yechiel from comment #28)
> > To clarify: the limitation here is not specific to SR-IOV nor the
> > coextensive of OVN and sriovnicswitch mechanism drivers. It's more about the
> > ability (or lack of) of OVN to support vlan tenant networks.
> 
> I think there was some sort of support on ovn for vlan tenant networks too
> per talks with russellb.
> 
> @russellb, do you remember that, at some point we even configured a job for
> vlan tenant networking and it was working?

That's right.  We should treat this as a supported feature, and fix any issues that arise during testing.

Comment 32 Miguel Angel Ajo 2018-05-25 13:18:22 UTC
Hmm, there's something else we need to take care of in this case.


If we have instances connected to tenant networks via SRIOV (I' don't know how SRIOV is used normally), we also will need to run the dhcp-agent alongside the ovn-controller resolver.

Because ovn-controller won't be able to capture the traffic sent via sriov directly to the switch.

Another option could be making ovn-controller aware of that, and able to capture dhcp requests also via the provider network, and respond them back.

Comment 33 Miguel Angel Ajo 2018-05-25 13:19:25 UTC
Numan can you have an eye on my last comment? ^

Comment 35 Numan Siddique 2018-05-28 06:25:32 UTC
@Miguel - Sure. Need to explore on how to support it. If we have this support in ovn-controller, it would benefit even for undercloud to use OVN in tripleo/osp director deployments.

Comment 36 Miguel Angel Ajo 2018-06-10 22:07:23 UTC
@numans, can we deploy neutron dhcp-agent when we have SRIOV enabled?, I suspect that's the only way to serve dhcp for SR-IOV.

@dalvarez, would it work with the ovn-metadata agent that uses dhcp ports? I suspect it may confuse the dhcp-agent.

Comment 37 Numan Siddique 2018-06-11 07:52:29 UTC
@Miguel - We can. For that we have to remove "OS::TripleO::Services::NeutronDhcpAgent: OS::Heat::None" from the ovn enviroment file or add a new env file at the end with the 
 contents 

"OS::TripleO::Services::NeutronDhcpAgent: docker/services/neutron-dhcp.yaml"

Comment 45 Numan Siddique 2018-11-29 09:03:28 UTC
The u/s patch to have SR-IOV instances get DHCP from OVN instead of dhcp agent is here - https://patchwork.ozlabs.org/patch/1003652/
It's still under review.


Note You need to log in before you can comment on or make changes to this bug.