Bug 1694572
Summary: | migration from ml2/ovs to ml2/ovn is failing when paunch command was executed | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> | ||||
Component: | python-paunch | Assignee: | Luke Short <lshort> | ||||
Status: | CLOSED EOL | QA Contact: | Eran Kuris <ekuris> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 14.0 (Rocky) | CC: | apevec, aschultz, dalvarez, emacchi, jlibosva, jschluet, lhh, lmartins, lshort, majopela, mburns, njohnston, ramishra, scohen, spower, tfreger, twilson | ||||
Target Milestone: | z5 | Keywords: | AutomationBlocker, TestOnly, Triaged, ZStream | ||||
Target Release: | 14.0 (Rocky) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | python-paunch-3.2.0-0.20180921003262.6d2ec11.el7ost | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-01-24 12:29:55 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1659066, 1667920, 1672782, 1672785, 1672786, 1672790, 1700302, 1715112 | ||||||
Attachments: |
|
Description
Eran Kuris
2019-04-01 07:04:30 UTC
SOS Report : http://rhos-release.virt.bos.redhat.com/log/bz1694572/ ERROR : failed: [controller-0] (item={'value': {u'netns_regex': u'fip-|snat-|qrouter-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log', u'cleanup_type': u'l3'}, 'key': u'neutron_l3_agent'}) => {"changed": true, "cmd": "if ip netns | egrep -e \"fip-|snat-|qrouter-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_l3_agent \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_l3_agent.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_l3_agent.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log --agent-type l3 --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:12.416200", "end": "2019-03-31 14:33:33.731888", "item": {"key": "neutron_l3_agent", "value": {"cleanup_type": "l3", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log", "netns_regex": "fip-|snat-|qrouter-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-03-31 14:33:21.315688", "stderr": "Unable to find image '5672:latest' locally\nTrying to pull repository registry.access.redhat.com/5672 ... \nTrying to pull repository docker.io/library/5672 ... \nTrying to pull repository registry.fedoraproject.org/5672 ... \nPulling repository registry.fedoraproject.org/5672\nTrying to pull repository quay.io/5672 ... \nTrying to pull repository registry.centos.org/5672 ... \nPulling repository registry.centos.org/5672\nTrying to pull repository docker.io/library/5672 ... \n/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Unable to find image '5672:latest' locally", "Trying to pull repository registry.access.redhat.com/5672 ... ", "Trying to pull repository docker.io/library/5672 ... ", "Trying to pull repository registry.fedoraproject.org/5672 ... ", "Pulling repository registry.fedoraproject.org/5672", "Trying to pull repository quay.io/5672 ... ", "Trying to pull repository registry.centos.org/5672 ... ", "Pulling repository registry.centos.org/5672", "Trying to pull repository docker.io/library/5672 ... ", "/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.", "See '/usr/bin/docker-current run --help'."], "stdout": "qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)\nqrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)\nCleaning up\ndocker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh", "stdout_lines": ["qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)", "qrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)", "Cleaning up", "docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh"]} Eran, can you upload the /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json file to this BZ so they can diagnose the issue with paunch? Created attachment 1555029 [details]
hashed-docker-container-startup-config-step_4
It seems like paunch is not quoting --health-cmd=/openstack/healthcheck 5672 Which makes docker believe 5672 is the image, instead of part of the "health-cmd". root@controller-0 tmp]# paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \ --action print-cmd \ --container neutron_l3_agent \ --interactive docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --interactive --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 JSON FILE: "neutron_l3_agent": { "start_order": 10, "ulimit": [ "nofile=1024" ], "healthcheck": { "test": "/openstack/healthcheck 5672" }, "image": "192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1", "pid": "host", "environment": [ "KOLLA_CONFIG_STRATEGY=COPY_ALWAYS", "TRIPLEO_CONFIG_HASH=55f1e4a78a54dee524d17bb5b3f72103" ], "volumes": [ "/etc/hosts:/etc/hosts:ro", "/etc/localtime:/etc/localtime:ro", "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", "/dev/log:/dev/log", "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro", "/etc/puppet:/etc/puppet:ro", "/var/log/containers/neutron:/var/log/neutron", "/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro", "/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro", "/lib/modules:/lib/modules:ro", "/run/openvswitch:/run/openvswitch", "/var/lib/neutron:/var/lib/neutron", "/run/netns:/run/netns:shared", "/var/lib/openstack:/var/lib/openstack", "/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro", "/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro", "/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro" ], "net": "host", "privileged": true, "restart": "always" }, Can't verify before getting the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1700302 Can't verify before getting the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1700302 Note that we're now hitting https://bugzilla.redhat.com/show_bug.cgi?id=1717144 which might related to my patch in Paunch... I'm investigating. FYI I'm reverting the patch as I tested on OSP14 and the healthchecks work fine now. https://review.opendev.org/#/c/665249/ Regarding comment #10 - we need to investigate how did you hit the issue. Looks like the issue still exists: failed: [controller-0] (item={'value': {u'netns_regex': u'qdhcp-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log', u'cleanup_type': u'dhcp'}, 'key': u'neutron_dhcp'}) => {"changed": true, "cmd": "# avoid cleaning up dhcp namespaces if the neutron dhcp agent is up (SR-IOV use case)\n if [[ \"dhcp\" == \"dhcp\" ]]; then\n docker inspect neutron_dhcp && echo \"Shouldn't clean DHCP namespaces if neutron_dhcp docker is up\" && exit 0\n fi\n\n if ip netns | egrep -e \"qdhcp-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_dhcp \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_dhcp.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_dhcp.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log --agent-type dhcp --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:00.586690", "end": "2019-06-16 16:32:00.968967", "item": {"key": "neutron_dhcp", "value": {"cleanup_type": "dhcp", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log", "netns_regex": "qdhcp-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-06-16 16:32:00.382277", "stderr": "Error: No such object: neutron_dhcp\n/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Error: No such object: neutron_dhcp", "/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.", "See '/usr/bin/docker-current run --help'."], "stdout": "[]\nqdhcp-102e3fa9-3fba-4 OpenStack/14.0-RHEL-7/2019-06-10.3 [root@controller-0 ~]# rpm -qa | grep -i paunch python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch [root@controller-0 ~]# logout [heat-admin@controller-0 ~]$ logout Connection to controller-0 closed. (overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep -i paunch python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch I've checked paunch and indeed the cmd output isn't quoting the command, so if there is an argument the command will fail because docker interprets the healthcheck argument as a docker cli argument. Now, I've been digging on where this code is executed and I found it: https://opendev.org/openstack/networking-ovn/src/branch/master/migration/tripleo_environment/playbooks/roles/migration/tasks/cleanup-dataplane.yml#L37-L69 You're running paunch to find out what the docker command looks like then you parse this cmd, sed it to adjust for your needs and execute it. At this stage, there is no guarantee that the installer will behave correctly later since you're executing a third party script in the container; it's a bit risky to me... I'll look at the paunch cmd output again and see if I can quickly fix it. So 2 things here: 1) you shouldn't hack into containers this way. What I've seen in cleanup-dataplane.yml is doing things we clearly don't recommend nor support at all. 2) the paunch debug command seems broken indeed. The bug is the following: we use print to show the docker arguments how they come from the builder while if we execute it, we use subprocess which properly parse it that's why the error is only for a print. What we need to is hack the print to quote each arg during the print. This is IMHO low hanging fruit, I don't think it's going to make it for this zstream. |