Description of problem: Using Rally I am starting 10 guests in parallel, if this test passes the automation moves to starting 20 Guests in parallel, we keep increasing by 10 guests until we determine the max guest launch a OSP environment can sustain. Test cases 10-80 seem to successfully pass (unless we run into a harness issue), however once we reach 90-100 test cases, we loose connectivity to the Rally server which is running on our cloud (to avoid the need for a ton of floating-ips). This is a Non-HA Run, on the Neutron Networker we see the following : ovs-agent.log w/ DEBUG enabled: 2014-09-02 17:49:14.546 16886 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent rpc_loop - iteration:717 completed. Processed ports statistics: {'ancillary': {'removed': 0, 'added': 0}, 'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.034 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1386 2014-09-02 17:49:15.454 16886 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1405 2014-09-02 17:49:16.513 16886 DEBUG neutron.agent.linux.async_process [-] Halting async process [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']]. stop /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:90 More : 2014-09-03 11:52:34.528 20796 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent rpc_loop - iteration:32471 completed. Processed ports statistics: {'ancillary': {'removed': 0, 'added': 0}, 'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.034 rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1386 2014-09-03 11:52:35.734 20796 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1405 2014-09-03 11:52:36.496 20796 DEBUG neutron.agent.linux.async_process [-] Halting async process [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']]. stop /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:90 2014-09-03 11:52:36.496 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21126', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.511 20796 DEBUG neutron.agent.linux.utils [-] Command: ['ps', '--ppid', '21126', '-o', 'pid='] Exit code: 0 Stdout: '21128\n' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:36.511 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21128', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.519 20796 DEBUG neutron.agent.linux.utils [-] Command: ['ps', '--ppid', '21128', '-o', 'pid='] Exit code: 0 Stdout: '21133\n' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:36.520 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['ps', '--ppid', '21133', '-o', 'pid='] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.528 20796 DEBUG neutron.agent.linux.utils [-] Command: ['ps', '--ppid', '21133', '-o', 'pid='] Exit code: 1 Stdout: '' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:36.528 20796 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '21133'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 2014-09-03 11:52:36.561 20796 DEBUG neutron.agent.linux.utils [-] Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '21133'] Exit code: 0 Stdout: '' Stderr: '' execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:74 2014-09-03 11:52:38.468 9081 INFO neutron.common.config [-] Logging enabled! Once this happens, the Rally guest looses connectivity, and I have to manually restart the Neutron Networker OSP services -- Restarting the Rally guest does not restore connectivity. Interesting observation, guests launched after the restart of the service seem to have connectivity to the qdhcp ns, however the Rally guest does not. [root@macbc305bf5f451 neutron]# ip netns exec qdhcp-02fac325-2b01-4418-b305-6ef69a877b63 ping 10.0.0.165 PING 10.0.0.165 (10.0.0.165) 56(84) bytes of data. 64 bytes from 10.0.0.165: icmp_seq=1 ttl=64 time=2.78 ms 64 bytes from 10.0.0.165: icmp_seq=2 ttl=64 time=0.554 ms ^C --- 10.0.0.165 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.554/1.670/2.787/1.117 ms [root@macbc305bf5f451 neutron]# ip netns exec qdhcp-02fac325-2b01-4418-b305-6ef69a877b63 ping 10.0.0.8 PING 10.0.0.8 (10.0.0.8) 56(84) bytes of data. From 10.0.0.3 icmp_seq=1 Destination Host Unreachable From 10.0.0.3 icmp_seq=2 Destination Host Unreachable From 10.0.0.3 icmp_seq=3 Destination Host Unreachable From 10.0.0.3 icmp_seq=4 Destination Host Unreachable ^C --- 10.0.0.8 ping statistics --- 4 packets
This is a RHEL OSP5 on RHEL7 Deployment non-ha w/ 5 nodes deployed with Staypuft. The deployment is: 1x Controller 1x Neutron Networker 3x Compute nodes. Rally is running within the OpenStack Cloud Deployed above, with a Floating IP to make API calls to OpenStack. To be clear, both the floating-ip and the internal address are unreachable.
Created attachment 934192 [details] ovs-agent log-file
Hit this issue again, here is ovs-vsctl show after I hit the issue and clean up all the old guests except for the Rally, guest, it is still there: [root@macbc305bf5f451 neutron]# ovs-vsctl show 8e923870-f081-4585-ba03-7ec16e55b6cf Bridge br-ex Port br-ex Interface br-ex type: internal Port "qg-8a7abac9-d6" Interface "qg-8a7abac9-d6" type: internal Port phy-br-ex Interface phy-br-ex Port "eno2" Interface "eno2" Bridge br-int fail_mode: secure Port "tap7418e859-a2" tag: 1 Interface "tap7418e859-a2" type: internal Port "tapbfc4341f-c8" tag: 2 Interface "tapbfc4341f-c8" type: internal Port int-br-ex Interface int-br-ex Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Port "qr-f23ab158-ff" tag: 1 Interface "qr-f23ab158-ff" type: internal Port br-int Interface br-int type: internal Bridge br-tun Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} ovs_version: "2.1.3" Note: The tunnel for the compute node where the Rally guest is, is missing... So the guest will have zero connectivity to anything. Restart services: 8e923870-f081-4585-ba03-7ec16e55b6cf Bridge br-ex Port br-ex Interface br-ex type: internal Port phy-br-ex Interface phy-br-ex Port "qg-8a7abac9-d6" Interface "qg-8a7abac9-d6" type: internal Port "eno2" Interface "eno2" Bridge br-int fail_mode: secure Port "qr-f23ab158-ff" tag: 1 Interface "qr-f23ab158-ff" type: internal Port "tapbfc4341f-c8" tag: 2 Interface "tapbfc4341f-c8" type: internal Port "tap7418e859-a2" tag: 1 Interface "tap7418e859-a2" type: internal Port br-int Interface br-int type: internal Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Port int-br-ex Interface int-br-ex Bridge br-tun Port "vxlan-ac1264f1" Interface "vxlan-ac1264f1" type: vxlan options: {in_key=flow, local_ip="172.18.100.240", out_key=flow, remote_ip="172.18.100.241"} Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} ovs_version: "2.1.3"
Disabling L2Population seemed to fix the problem with not being able to reach above 100 Guest launched concurrently. With the latest testing L2Population disabled, with 3 iterations, I averaged, 110 (90,130,120). The reason I am noting this, is because I am not seeing the br-tun removed during this test, like I would when L2Population was enabled.
I can recreate this bug running : rally task start rally-launch I was using a RHEL7 Cloud image. I would run the above command a couple of times, until failure - Failure is when I loose connectivity to the Rally guest. The rally-launch scenario : { "VMTasks.boot_runcommand_delete": [ { "runner": { "type": "constant", "times": 100, "concurrency": 100 }, "args": { "username": "root", "floating_network": "Public", "use_floatingip": false, "script": "/opt/rally/true.sh", "auto_assign_nic" : True, "fixed_network": "private", "interpreter": "/bin/sh", "flavor": { "name": "m1.small" }, "image": { "name": "rhel7" }, "detailed" : True }, "context": { "users": { "users_per_tenant": 1, "tenants": 1 }, "quotas": { "neutron": { "network": -1, "port": -1 }, "nova": { "instances": -1, "cores": -1, "ram": -1 } } } } ] }
This looks like a race condition when multiple port delete and port create/update requests are incoming for OVS agent. It may turn out that neutron-server incorrectly detects that the whole tunnel is unused and requests FLOODING_ENTRY removal while new ports are coming later.
I've added a patch to external links list that may fix the issues that we experience. Though more testing is needed to make sure it helps.
Raised the bug to upstream (see external trackers' list). Vivek from HP who worked on DVR privately told me that he is willing to provide the fix for that, because DVR team re-introduced that regression in Juno after it was fixed there.
*** Bug 1141497 has been marked as a duplicate of this bug. ***
Fix arrived with 2014.1.4 rebase.
This bug was verified on rhel7.1 with smaller setup (AIO+Comute Node) and Rally "boot_runcomand_delete" script, when L2pop was enabled. Times-500 Concurrency-50 openstack-neutron-ml2-2014.1.4-1.el7ost.noarch openstack-neutron-openvswitch-2014.1.4-1.el7ost.noarch openstack-neutron-2014.1.4-1.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0829.html