Bug 1562952 - [Deployment][OSP13][ODL] DPDK VxLAN setup - unable to boot a DPDK instance.
Summary: [Deployment][OSP13][ODL] DPDK VxLAN setup - unable to boot a DPDK instance.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: beta
: 13.0 (Queens)
Assignee: Tim Rozet
QA Contact: Maxim Babushkin
URL:
Whiteboard: odl_deployment
Depends On: 1549938
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-02 18:21 UTC by Ziv Greenberg
Modified: 2018-10-18 07:25 UTC (History)
17 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.0.2-0.20180410170330.a39634a.el7ost puppet-neutron-12.4.1-0.20180410174837.a900c54.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2018-06-27 13:49:35 UTC
Target Upstream Version:


Attachments (Terms of Use)
OSP13 ODL DPDK VxLAN templates (8.64 KB, application/x-gzip)
2018-04-02 18:21 UTC, Ziv Greenberg
no flags Details
neutron-server log (1.66 MB, text/plain)
2018-04-02 18:24 UTC, Ziv Greenberg
no flags Details
neutron-metadata-agent log (5.72 KB, text/plain)
2018-04-02 18:25 UTC, Ziv Greenberg
no flags Details
nova-compute log (142.20 KB, text/plain)
2018-04-02 18:25 UTC, Ziv Greenberg
no flags Details
openstack-neutron-server-opendaylight log (338 bytes, text/plain)
2018-04-02 18:26 UTC, Ziv Greenberg
no flags Details
openstack-opendaylight log (9.03 MB, text/plain)
2018-04-02 18:27 UTC, Ziv Greenberg
no flags Details


Links
System ID Priority Status Summary Last Updated
Launchpad 1762473 None None None 2018-04-09 16:21:57 UTC
OpenStack gerrit 559986 None stable/queens: MERGED tripleo-heat-templates: Fixes the OvsVhostuserMode heat parameter (I7cd34b30f54766fe35e4a40a57b278166d0197cb) 2018-04-16 12:33:49 UTC
OpenStack gerrit 560023 None stable/queens: MERGED puppet-neutron: Corrects ODL-OVS vhostusermode default and docs (Idda82cae48b11eae1c79c6bbdd85827e3e43e56b) 2018-04-16 12:33:43 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:50:27 UTC

Description Ziv Greenberg 2018-04-02 18:21:30 UTC
Created attachment 1416434 [details]
OSP13 ODL DPDK VxLAN templates

Description of problem:

After a successful deployment, I have found the opendayligt containers in an unhealthy status with errors.
Also, I have failed to boot the DPDK instance.

Error logs for the opendaylight containers, neutron, and nova services are attached.


Version-Release number of selected component (if applicable):
puddle 2018-03-29.1

How reproducible:
Always

Steps to Reproduce:
1. Deploy the overcloud by using the attached OSP13 ODL DPDK VxLAN templates.
3. Check the status of the opendaylight containers.
2. Try to boot a DPDK instance.

Expected results:
opendaylight containers should be running in a healthy mode with the ability to boot a DPDK instance.

Comment 1 Ziv Greenberg 2018-04-02 18:24:31 UTC
Created attachment 1416435 [details]
neutron-server log

Comment 2 Ziv Greenberg 2018-04-02 18:25:20 UTC
Created attachment 1416436 [details]
neutron-metadata-agent log

Comment 3 Ziv Greenberg 2018-04-02 18:25:50 UTC
Created attachment 1416437 [details]
nova-compute log

Comment 4 Ziv Greenberg 2018-04-02 18:26:28 UTC
Created attachment 1416438 [details]
openstack-neutron-server-opendaylight log

Comment 5 Ziv Greenberg 2018-04-02 18:27:31 UTC
Created attachment 1416439 [details]
openstack-opendaylight log

Comment 6 jamo luhrsen 2018-04-03 05:28:34 UTC
looking at the docker logs of the opendaylight container it looks like
the karaf process is being restarted, maybe 34 times when I checked.
I can't figure out what is causing that, however. karaf logs seem
benign, then all of a sudden I see logs showing up that are indicative of
karaf starting up.

example:

2018-04-02T17:09:07,596 | INFO  | nioEventLoopGroup-9-5 | OF13DeviceInitializer            | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-5 | Static node openflow:93252746509277 info: OFPMPPORTDESC collected
2018-04-02T17:09:07,597 | INFO  | nioEventLoopGroup-9-5 | OF13DeviceInitializer            | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-5 | Static node openflow:93252746509277 successfully finished collecting
Apr 02, 2018 5:09:14 PM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Trying to lock /opt/opendaylight/lock
Apr 02, 2018 5:09:14 PM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Lock acquired
Apr 02, 2018 5:09:14 PM org.apache.karaf.main.Main$KarafLockCallback lockAquired
INFO: Lock acquired. Setting startlevel to 100
2018-04-02T17:09:16,210 | INFO  | Start Level: Equinox Container: 5b0db9de-24ce-461f-a03b-d462acc16fd1 | fileinstall                      | 6 - org.apache.felix.fileinstall - 3.5.8 | Updating configuration from org.ops4j.pax.logging.cfg
2018-04-02T17:09:17,027 | INFO  | Start Level: Equinox Container: 5b0db9de-24ce-461f-a03b-d462acc16fd1 | core                             | 83 - org.apache.aries.jmx.core - 1.1.7 | Starting JMX OSGi agent


The container itself doesn't appear to be restarting, as it's showing
being up the same amount of time as all the other containers:

.
.
.
6290b931b57e        192.0.20.1:8787/rhosp13/openstack-mariadb:pcmklatest                         "/bin/bash /usr/lo..."   12 hours ago        Up 12 hours                                   galera-bundle-docker-0
965d15c18b1b        192.0.20.1:8787/rhosp13/openstack-memcached:2018-03-29.1                     "/bin/bash -c 'sou..."   12 hours ago        Up 12 hours                                   memcached
acb60212cef4        192.0.20.1:8787/rhosp13/openstack-opendaylight:2018-03-29.1                  "kolla_start"            12 hours ago        Up 12 hours (unhealthy)                       opendaylight_api


something is broken on the docker side that I don't know how to get
around, and I cannot connect to the running container:

[root@controller-0 heat-admin]# docker exec -it acb60212cef4 /bin/bash
rpc error: code = 2 desc = containerd: container not found


I was able to connect to the opendaylight-neutron container, which is
also unhealthy, but the reason that is unhealthy is because it cannot
reach restconf which is provided in the broken opendaylight container.

I think normally the healthcheck for opendaylight is looking for an
index.html file to exist in the opendaylight filesystem, but in this
case, the unhealthy flag is showing up for the same reason the exec
command failed:

[root@controller-0 heat-admin]# docker inspect acb60212cef4 --format='{{json .State.Health}}'
{"Status":"unhealthy","FailingStreak":1446,"Log":[{"Start":"2018-04-03T05:22:54.462778387Z","End":"2018-04-03T05:22:54.473092343Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:23:24.473238026Z","End":"2018-04-03T05:23:24.47521961Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:23:54.475352186Z","End":"2018-04-03T05:23:54.485640472Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:24:24.486559814Z","End":"2018-04-03T05:24:24.488660251Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:24:54.489240777Z","End":"2018-04-03T05:24:54.501849488Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"}]}

I tried restarting and stopping the container, but neither command
seemed to work and it still shows as running for 12 hours.

[root@controller-0 heat-admin]# docker restart acb60212cef4
acb60212cef4
[root@controller-0 heat-admin]# docker stop acb60212cef4
acb60212cef4
[root@controller-0 heat-admin]# docker ps | egrep opend
04c7d263a3d2        192.0.20.1:8787/rhosp13/openstack-neutron-server-opendaylight:2018-03-29.1   "kolla_start"            12 hours ago        Up 12 hours (unhealthy)                       neutron_api
acb60212cef4        192.0.20.1:8787/rhosp13/openstack-opendaylight:2018-03-29.1                  "kolla_start"            12 hours ago        Up 12 hours (unhealthy)                       opendaylight_api


At this point, I'm stuck. I think we need some docker expertise involved,
as I'm not able to figure anything else out.

Comment 7 Ariel Adam 2018-04-03 12:38:35 UTC
Tim, can you take a look?
Do we have someone in the team familiar with the dockers else I will take it back to Frank (NFV DFG) to see if we can get a contact point.

Comment 8 Tim Rozet 2018-04-03 15:37:43 UTC
It looks like all of the fixes for healthcheck have been merged, so the container should show healthy.


I do see this one error in karaf which I have not seen before:
2018-04-02T16:52:15,077 | ERROR | features-1-thread-1 | metatype                         | 9 - org.apache.felix.metatype - 1.1.6 | fromDocuments: Error accessing document bundleentry://191.fwk1239548589/OSGI-INF/metatype/metatype.properties : XML parsing exception while reading metadata: undefined prefix: client_id (position:START_TAG <{null}client_id:client_secret>@6:50 in bundleentry://191.fwk1239548589/OSGI-INF/metatype/metatype.properties)


Typically when the karaf process exits it will kill the container, so this is odd behavior that karaf server is starting and there docker output is different.

Comment 9 Tim Rozet 2018-04-03 15:43:12 UTC
Note in the deploy command used, you should have neutron-opendaylight-dpdk.yaml after neutron-opendaylight.yaml (although that will not hurt anything in this instance), and neutron-opendaylight.yaml should be after docker.yaml.  This is because neutron-opendaylight.yaml overrides services defaulted in docker.yaml in case the service is enabled on your roles.  So for example, NeutronOvsAgent will be overridden to None.

Comment 10 Michael Vorburger 2018-04-03 15:45:00 UTC
> I do see this one error in karaf which I have not seen before:

Ignore that; it could not cause Karaf to exit; see https://jira.opendaylight.org/browse/AAA-166.

Comment 14 Tim Rozet 2018-04-04 14:48:41 UTC
The problem is although docker ps shows the container is running, it really isn't.  The process was killed by oom-killer:

[ 3330.124431] Out of memory: Kill process 184715 (java) score 105 or sacrifice child                                                                            
[ 3330.127395] Killed process 184715 (java) total-vm:5689200kB, anon-rss:839080kB, file-rss:0kB, shmem-rss:0kB

This is because there is only 8GB of RAM on the controller:
[root@controller-0 ~]# free
              total        used        free      shared  buff/cache   available
Mem:        8009540     7094752      219968       66496      694820      352352
Swap:             0           0           0


ODL will use around 2GB at least.  I would recommend bumping to 12GB to be safe and see if that fixes the problem.

Comment 15 Ziv Greenberg 2018-04-04 14:57:00 UTC
Ok, as per your proposal, I'll reconfigure my environment and will re-run the deployment end-to-end ASAP.
I'll keep you posted.

Thanks,
Ziv

Comment 16 Ziv Greenberg 2018-04-08 18:58:02 UTC
As per your suggestion, I have redeployed after adding additional memory to the controller node and reconfigured the overcloud deploy script.
Unfortunately, so far, the opendaylight containers are still in unhealthy status and I'm unable to boot up an instance.

[root@controller-0 ~]# docker ps | grep opendaylight
b3ff08b41f3d        192.0.20.1:8787/rhosp13/openstack-neutron-server-opendaylight:2018-03-29.1   "kolla_start"  3 hours ago   Up 3 hours (unhealthy)  neutron_api
21e907d35ac5        192.0.20.1:8787/rhosp13/openstack-opendaylight:2018-03-29.1                  "kolla_start"  3 hours ago   Up 3 hours (unhealthy)  opendaylight_api

[root@controller-0 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        8.0G        725M         66M        6.8G        6.6G
Swap:            0B          0B          0B

Thanks,
Ziv

Comment 17 Janki 2018-04-09 08:08:17 UTC
You are probably hitting this bug https://bugzilla.redhat.com/show_bug.cgi?id=1549938

There are 4 more errors in ODL. But they are nostly false positives.

You also need to specify "-e /usr/share/openstack-tripleo-heat-templates/environments/ovs-dpdk-permissions.yaml" to deploy command if you are deploying OSP13. (https://review.openstack.org/#/c/531322)

Comment 18 Tim Rozet 2018-04-09 16:15:45 UTC
The unhealthy status is opened here and should not affect the functionality:
https://bugzilla.redhat.com/show_bug.cgi?id=1549218

For unable to boot an instance:
The problem is the port is being created by OVS instead of by QEMU.  That is resulting in the port being created in the wrong directory (/var/run/openvswitch) and not being read correctly by QEMU (looking in /var/lib/vhost_sockets).  This type of port is deprecated and we should be using dpdkvhostuserclient ports (QEMU creates the port).  We need to change the configuration in puppet-neutron, and THT to use 'server' mode for host config.  This seems backwards because we are configuring 'server' mode as OVS configuration, but really OVS is the client here.  Will add some documentation to clarify this in OOO.

Comment 19 Tim Rozet 2018-04-09 16:17:50 UTC
Workaround for now is to set:
OvsVhostuserMode: server

in your THT env file.

We should also fix this in networking-odl as client is the default there for OVS in documentation, and the python tool.  puppet-neutron also needs to be fixed.

Comment 20 Ziv Greenberg 2018-04-10 09:31:34 UTC
(In reply to Tim Rozet from comment #18)
> The unhealthy status is opened here and should not affect the functionality:
> https://bugzilla.redhat.com/show_bug.cgi?id=1549218
> 
> For unable to boot an instance:
> The problem is the port is being created by OVS instead of by QEMU.  That is
> resulting in the port being created in the wrong directory
> (/var/run/openvswitch) and not being read correctly by QEMU (looking in
> /var/lib/vhost_sockets).  This type of port is deprecated and we should be
> using dpdkvhostuserclient ports (QEMU creates the port).  We need to change
> the configuration in puppet-neutron, and THT to use 'server' mode for host
> config.  This seems backwards because we are configuring 'server' mode as
> OVS configuration, but really OVS is the client here.  Will add some
> documentation to clarify this in OOO.

Changing the subject due to existing BZ1549218
Using comment #19 recommendation a dpdk instance has been booted successfully.


Thanks,
Ziv

Comment 28 errata-xmlrpc 2018-06-27 13:49:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.