Created attachment 1416434 [details] OSP13 ODL DPDK VxLAN templates Description of problem: After a successful deployment, I have found the opendayligt containers in an unhealthy status with errors. Also, I have failed to boot the DPDK instance. Error logs for the opendaylight containers, neutron, and nova services are attached. Version-Release number of selected component (if applicable): puddle 2018-03-29.1 How reproducible: Always Steps to Reproduce: 1. Deploy the overcloud by using the attached OSP13 ODL DPDK VxLAN templates. 3. Check the status of the opendaylight containers. 2. Try to boot a DPDK instance. Expected results: opendaylight containers should be running in a healthy mode with the ability to boot a DPDK instance.
Created attachment 1416435 [details] neutron-server log
Created attachment 1416436 [details] neutron-metadata-agent log
Created attachment 1416437 [details] nova-compute log
Created attachment 1416438 [details] openstack-neutron-server-opendaylight log
Created attachment 1416439 [details] openstack-opendaylight log
looking at the docker logs of the opendaylight container it looks like the karaf process is being restarted, maybe 34 times when I checked. I can't figure out what is causing that, however. karaf logs seem benign, then all of a sudden I see logs showing up that are indicative of karaf starting up. example: 2018-04-02T17:09:07,596 | INFO | nioEventLoopGroup-9-5 | OF13DeviceInitializer | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-5 | Static node openflow:93252746509277 info: OFPMPPORTDESC collected 2018-04-02T17:09:07,597 | INFO | nioEventLoopGroup-9-5 | OF13DeviceInitializer | 385 - org.opendaylight.openflowplugin.impl - 0.6.0.redhat-5 | Static node openflow:93252746509277 successfully finished collecting Apr 02, 2018 5:09:14 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /opt/opendaylight/lock Apr 02, 2018 5:09:14 PM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired Apr 02, 2018 5:09:14 PM org.apache.karaf.main.Main$KarafLockCallback lockAquired INFO: Lock acquired. Setting startlevel to 100 2018-04-02T17:09:16,210 | INFO | Start Level: Equinox Container: 5b0db9de-24ce-461f-a03b-d462acc16fd1 | fileinstall | 6 - org.apache.felix.fileinstall - 3.5.8 | Updating configuration from org.ops4j.pax.logging.cfg 2018-04-02T17:09:17,027 | INFO | Start Level: Equinox Container: 5b0db9de-24ce-461f-a03b-d462acc16fd1 | core | 83 - org.apache.aries.jmx.core - 1.1.7 | Starting JMX OSGi agent The container itself doesn't appear to be restarting, as it's showing being up the same amount of time as all the other containers: . . . 6290b931b57e 192.0.20.1:8787/rhosp13/openstack-mariadb:pcmklatest "/bin/bash /usr/lo..." 12 hours ago Up 12 hours galera-bundle-docker-0 965d15c18b1b 192.0.20.1:8787/rhosp13/openstack-memcached:2018-03-29.1 "/bin/bash -c 'sou..." 12 hours ago Up 12 hours memcached acb60212cef4 192.0.20.1:8787/rhosp13/openstack-opendaylight:2018-03-29.1 "kolla_start" 12 hours ago Up 12 hours (unhealthy) opendaylight_api something is broken on the docker side that I don't know how to get around, and I cannot connect to the running container: [root@controller-0 heat-admin]# docker exec -it acb60212cef4 /bin/bash rpc error: code = 2 desc = containerd: container not found I was able to connect to the opendaylight-neutron container, which is also unhealthy, but the reason that is unhealthy is because it cannot reach restconf which is provided in the broken opendaylight container. I think normally the healthcheck for opendaylight is looking for an index.html file to exist in the opendaylight filesystem, but in this case, the unhealthy flag is showing up for the same reason the exec command failed: [root@controller-0 heat-admin]# docker inspect acb60212cef4 --format='{{json .State.Health}}' {"Status":"unhealthy","FailingStreak":1446,"Log":[{"Start":"2018-04-03T05:22:54.462778387Z","End":"2018-04-03T05:22:54.473092343Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:23:24.473238026Z","End":"2018-04-03T05:23:24.47521961Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:23:54.475352186Z","End":"2018-04-03T05:23:54.485640472Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:24:24.486559814Z","End":"2018-04-03T05:24:24.488660251Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"},{"Start":"2018-04-03T05:24:54.489240777Z","End":"2018-04-03T05:24:54.501849488Z","ExitCode":-1,"Output":"rpc error: code = 2 desc = containerd: container not found"}]} I tried restarting and stopping the container, but neither command seemed to work and it still shows as running for 12 hours. [root@controller-0 heat-admin]# docker restart acb60212cef4 acb60212cef4 [root@controller-0 heat-admin]# docker stop acb60212cef4 acb60212cef4 [root@controller-0 heat-admin]# docker ps | egrep opend 04c7d263a3d2 192.0.20.1:8787/rhosp13/openstack-neutron-server-opendaylight:2018-03-29.1 "kolla_start" 12 hours ago Up 12 hours (unhealthy) neutron_api acb60212cef4 192.0.20.1:8787/rhosp13/openstack-opendaylight:2018-03-29.1 "kolla_start" 12 hours ago Up 12 hours (unhealthy) opendaylight_api At this point, I'm stuck. I think we need some docker expertise involved, as I'm not able to figure anything else out.
Tim, can you take a look? Do we have someone in the team familiar with the dockers else I will take it back to Frank (NFV DFG) to see if we can get a contact point.
It looks like all of the fixes for healthcheck have been merged, so the container should show healthy. I do see this one error in karaf which I have not seen before: 2018-04-02T16:52:15,077 | ERROR | features-1-thread-1 | metatype | 9 - org.apache.felix.metatype - 1.1.6 | fromDocuments: Error accessing document bundleentry://191.fwk1239548589/OSGI-INF/metatype/metatype.properties : XML parsing exception while reading metadata: undefined prefix: client_id (position:START_TAG <{null}client_id:client_secret>@6:50 in bundleentry://191.fwk1239548589/OSGI-INF/metatype/metatype.properties) Typically when the karaf process exits it will kill the container, so this is odd behavior that karaf server is starting and there docker output is different.
Note in the deploy command used, you should have neutron-opendaylight-dpdk.yaml after neutron-opendaylight.yaml (although that will not hurt anything in this instance), and neutron-opendaylight.yaml should be after docker.yaml. This is because neutron-opendaylight.yaml overrides services defaulted in docker.yaml in case the service is enabled on your roles. So for example, NeutronOvsAgent will be overridden to None.
> I do see this one error in karaf which I have not seen before: Ignore that; it could not cause Karaf to exit; see https://jira.opendaylight.org/browse/AAA-166.
The problem is although docker ps shows the container is running, it really isn't. The process was killed by oom-killer: [ 3330.124431] Out of memory: Kill process 184715 (java) score 105 or sacrifice child [ 3330.127395] Killed process 184715 (java) total-vm:5689200kB, anon-rss:839080kB, file-rss:0kB, shmem-rss:0kB This is because there is only 8GB of RAM on the controller: [root@controller-0 ~]# free total used free shared buff/cache available Mem: 8009540 7094752 219968 66496 694820 352352 Swap: 0 0 0 ODL will use around 2GB at least. I would recommend bumping to 12GB to be safe and see if that fixes the problem.
Ok, as per your proposal, I'll reconfigure my environment and will re-run the deployment end-to-end ASAP. I'll keep you posted. Thanks, Ziv
As per your suggestion, I have redeployed after adding additional memory to the controller node and reconfigured the overcloud deploy script. Unfortunately, so far, the opendaylight containers are still in unhealthy status and I'm unable to boot up an instance. [root@controller-0 ~]# docker ps | grep opendaylight b3ff08b41f3d 192.0.20.1:8787/rhosp13/openstack-neutron-server-opendaylight:2018-03-29.1 "kolla_start" 3 hours ago Up 3 hours (unhealthy) neutron_api 21e907d35ac5 192.0.20.1:8787/rhosp13/openstack-opendaylight:2018-03-29.1 "kolla_start" 3 hours ago Up 3 hours (unhealthy) opendaylight_api [root@controller-0 ~]# free -h total used free shared buff/cache available Mem: 15G 8.0G 725M 66M 6.8G 6.6G Swap: 0B 0B 0B Thanks, Ziv
You are probably hitting this bug https://bugzilla.redhat.com/show_bug.cgi?id=1549938 There are 4 more errors in ODL. But they are nostly false positives. You also need to specify "-e /usr/share/openstack-tripleo-heat-templates/environments/ovs-dpdk-permissions.yaml" to deploy command if you are deploying OSP13. (https://review.openstack.org/#/c/531322)
The unhealthy status is opened here and should not affect the functionality: https://bugzilla.redhat.com/show_bug.cgi?id=1549218 For unable to boot an instance: The problem is the port is being created by OVS instead of by QEMU. That is resulting in the port being created in the wrong directory (/var/run/openvswitch) and not being read correctly by QEMU (looking in /var/lib/vhost_sockets). This type of port is deprecated and we should be using dpdkvhostuserclient ports (QEMU creates the port). We need to change the configuration in puppet-neutron, and THT to use 'server' mode for host config. This seems backwards because we are configuring 'server' mode as OVS configuration, but really OVS is the client here. Will add some documentation to clarify this in OOO.
Workaround for now is to set: OvsVhostuserMode: server in your THT env file. We should also fix this in networking-odl as client is the default there for OVS in documentation, and the python tool. puppet-neutron also needs to be fixed.
(In reply to Tim Rozet from comment #18) > The unhealthy status is opened here and should not affect the functionality: > https://bugzilla.redhat.com/show_bug.cgi?id=1549218 > > For unable to boot an instance: > The problem is the port is being created by OVS instead of by QEMU. That is > resulting in the port being created in the wrong directory > (/var/run/openvswitch) and not being read correctly by QEMU (looking in > /var/lib/vhost_sockets). This type of port is deprecated and we should be > using dpdkvhostuserclient ports (QEMU creates the port). We need to change > the configuration in puppet-neutron, and THT to use 'server' mode for host > config. This seems backwards because we are configuring 'server' mode as > OVS configuration, but really OVS is the client here. Will add some > documentation to clarify this in OOO. Changing the subject due to existing BZ1549218 Using comment #19 recommendation a dpdk instance has been booted successfully. Thanks, Ziv
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086