Description of problem: The migration process from ml2/ovs to ml2/ovn is failing on task migration: Cleanup neutron datapath resources on controller nodes Version-Release number of selected component (if applicable): OpenStack/14.0-RHEL-7/2019-03-06.1 [stack@undercloud-0 ~]$ rpm -qa | grep -i ovn python-networking-ovn-migration-tool-5.0.2-0.20181217155313.726820a.el7ost.noarch puppet-ovn-13.3.1-0.20181013120724.38e2e33.el7ost.noarch python-networking-ovn-5.0.2-0.20181217155313.726820a.el7ost.noarch [stack@undercloud-0 ~]$ rpm -qa | grep -i migration python-networking-ovn-migration-tool-5.0.2-0.20181217155313.726820a.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. run ovn migration job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration/ 2. 3. Actual results: Expected results: Additional info:
SOS Report : http://rhos-release.virt.bos.redhat.com/log/bz1694572/ ERROR : failed: [controller-0] (item={'value': {u'netns_regex': u'fip-|snat-|qrouter-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log', u'cleanup_type': u'l3'}, 'key': u'neutron_l3_agent'}) => {"changed": true, "cmd": "if ip netns | egrep -e \"fip-|snat-|qrouter-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_l3_agent \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_l3_agent.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_l3_agent.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log --agent-type l3 --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:12.416200", "end": "2019-03-31 14:33:33.731888", "item": {"key": "neutron_l3_agent", "value": {"cleanup_type": "l3", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log", "netns_regex": "fip-|snat-|qrouter-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-03-31 14:33:21.315688", "stderr": "Unable to find image '5672:latest' locally\nTrying to pull repository registry.access.redhat.com/5672 ... \nTrying to pull repository docker.io/library/5672 ... \nTrying to pull repository registry.fedoraproject.org/5672 ... \nPulling repository registry.fedoraproject.org/5672\nTrying to pull repository quay.io/5672 ... \nTrying to pull repository registry.centos.org/5672 ... \nPulling repository registry.centos.org/5672\nTrying to pull repository docker.io/library/5672 ... \n/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Unable to find image '5672:latest' locally", "Trying to pull repository registry.access.redhat.com/5672 ... ", "Trying to pull repository docker.io/library/5672 ... ", "Trying to pull repository registry.fedoraproject.org/5672 ... ", "Pulling repository registry.fedoraproject.org/5672", "Trying to pull repository quay.io/5672 ... ", "Trying to pull repository registry.centos.org/5672 ... ", "Pulling repository registry.centos.org/5672", "Trying to pull repository docker.io/library/5672 ... ", "/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.", "See '/usr/bin/docker-current run --help'."], "stdout": "qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)\nqrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)\nCleaning up\ndocker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh", "stdout_lines": ["qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)", "qrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)", "Cleaning up", "docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh"]}
Eran, can you upload the /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json file to this BZ so they can diagnose the issue with paunch?
Created attachment 1555029 [details] hashed-docker-container-startup-config-step_4
It seems like paunch is not quoting --health-cmd=/openstack/healthcheck 5672 Which makes docker believe 5672 is the image, instead of part of the "health-cmd". root@controller-0 tmp]# paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \ --action print-cmd \ --container neutron_l3_agent \ --interactive docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --interactive --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 JSON FILE: "neutron_l3_agent": { "start_order": 10, "ulimit": [ "nofile=1024" ], "healthcheck": { "test": "/openstack/healthcheck 5672" }, "image": "192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1", "pid": "host", "environment": [ "KOLLA_CONFIG_STRATEGY=COPY_ALWAYS", "TRIPLEO_CONFIG_HASH=55f1e4a78a54dee524d17bb5b3f72103" ], "volumes": [ "/etc/hosts:/etc/hosts:ro", "/etc/localtime:/etc/localtime:ro", "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", "/dev/log:/dev/log", "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro", "/etc/puppet:/etc/puppet:ro", "/var/log/containers/neutron:/var/log/neutron", "/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro", "/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro", "/lib/modules:/lib/modules:ro", "/run/openvswitch:/run/openvswitch", "/var/lib/neutron:/var/lib/neutron", "/run/netns:/run/netns:shared", "/var/lib/openstack:/var/lib/openstack", "/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro", "/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro", "/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro" ], "net": "host", "privileged": true, "restart": "always" },
Can't verify before getting the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1700302
Note that we're now hitting https://bugzilla.redhat.com/show_bug.cgi?id=1717144 which might related to my patch in Paunch... I'm investigating.
FYI I'm reverting the patch as I tested on OSP14 and the healthchecks work fine now. https://review.opendev.org/#/c/665249/ Regarding comment #10 - we need to investigate how did you hit the issue.
Looks like the issue still exists: failed: [controller-0] (item={'value': {u'netns_regex': u'qdhcp-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log', u'cleanup_type': u'dhcp'}, 'key': u'neutron_dhcp'}) => {"changed": true, "cmd": "# avoid cleaning up dhcp namespaces if the neutron dhcp agent is up (SR-IOV use case)\n if [[ \"dhcp\" == \"dhcp\" ]]; then\n docker inspect neutron_dhcp && echo \"Shouldn't clean DHCP namespaces if neutron_dhcp docker is up\" && exit 0\n fi\n\n if ip netns | egrep -e \"qdhcp-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_dhcp \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_dhcp.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_dhcp.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log --agent-type dhcp --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:00.586690", "end": "2019-06-16 16:32:00.968967", "item": {"key": "neutron_dhcp", "value": {"cleanup_type": "dhcp", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log", "netns_regex": "qdhcp-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-06-16 16:32:00.382277", "stderr": "Error: No such object: neutron_dhcp\n/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Error: No such object: neutron_dhcp", "/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.", "See '/usr/bin/docker-current run --help'."], "stdout": "[]\nqdhcp-102e3fa9-3fba-4 OpenStack/14.0-RHEL-7/2019-06-10.3 [root@controller-0 ~]# rpm -qa | grep -i paunch python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch [root@controller-0 ~]# logout [heat-admin@controller-0 ~]$ logout Connection to controller-0 closed. (overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep -i paunch python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch
I've checked paunch and indeed the cmd output isn't quoting the command, so if there is an argument the command will fail because docker interprets the healthcheck argument as a docker cli argument. Now, I've been digging on where this code is executed and I found it: https://opendev.org/openstack/networking-ovn/src/branch/master/migration/tripleo_environment/playbooks/roles/migration/tasks/cleanup-dataplane.yml#L37-L69 You're running paunch to find out what the docker command looks like then you parse this cmd, sed it to adjust for your needs and execute it. At this stage, there is no guarantee that the installer will behave correctly later since you're executing a third party script in the container; it's a bit risky to me... I'll look at the paunch cmd output again and see if I can quickly fix it.
So 2 things here: 1) you shouldn't hack into containers this way. What I've seen in cleanup-dataplane.yml is doing things we clearly don't recommend nor support at all. 2) the paunch debug command seems broken indeed. The bug is the following: we use print to show the docker arguments how they come from the builder while if we execute it, we use subprocess which properly parse it that's why the error is only for a print. What we need to is hack the print to quote each arg during the print. This is IMHO low hanging fruit, I don't think it's going to make it for this zstream.