Bug 1694572 - migration from ml2/ovs to ml2/ovn is failing when paunch command was executed
Summary: migration from ml2/ovs to ml2/ovn is failing when paunch command was executed
Status: MODIFIED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-paunch
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z3
: 14.0 (Rocky)
Assignee: Luke Short
QA Contact: Eran Kuris
URL:
Whiteboard:
Keywords: Triaged, ZStream
Depends On:
Blocks: 1652084 1672786 1700302 1715112 1667920 1672782 1672785 1672790
TreeView+ depends on / blocked
 
Reported: 2019-04-01 07:04 UTC by Eran Kuris
Modified: 2019-07-10 13:24 UTC (History)
14 users (show)

(edit)
Clone Of:
(edit)
Last Closed:


Attachments (Terms of Use)
hashed-docker-container-startup-config-step_4 (52.44 KB, text/plain)
2019-04-14 08:12 UTC, Eran Kuris
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 652647 None master: MERGED paunch: docker/compose: quote health-cmd (I8273684716df0dbe6f58608d1915e0c9c17493b3) 2019-04-16 09:44 UTC
OpenStack gerrit 653063 None None None 2019-04-16 17:58 UTC
OpenStack gerrit 666891 None None None 2019-06-21 16:54 UTC
OpenStack gerrit 667923 None None None 2019-06-27 12:56 UTC
Launchpad 1824808 None None None 2019-04-15 12:57 UTC

Description Eran Kuris 2019-04-01 07:04:30 UTC
Description of problem:
The migration process from ml2/ovs to ml2/ovn is failing on task migration: Cleanup neutron datapath resources on controller nodes 

Version-Release number of selected component (if applicable):
OpenStack/14.0-RHEL-7/2019-03-06.1
[stack@undercloud-0 ~]$ rpm -qa | grep -i ovn 
python-networking-ovn-migration-tool-5.0.2-0.20181217155313.726820a.el7ost.noarch
puppet-ovn-13.3.1-0.20181013120724.38e2e33.el7ost.noarch
python-networking-ovn-5.0.2-0.20181217155313.726820a.el7ost.noarch
[stack@undercloud-0 ~]$ rpm -qa | grep -i migration
python-networking-ovn-migration-tool-5.0.2-0.20181217155313.726820a.el7ost.noarch
How reproducible:
100%

Steps to Reproduce:
1. run ovn migration job: 
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration/
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Eran Kuris 2019-04-01 07:30:53 UTC
SOS Report : http://rhos-release.virt.bos.redhat.com/log/bz1694572/

ERROR :

failed: [controller-0] (item={'value': {u'netns_regex': u'fip-|snat-|qrouter-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log', u'cleanup_type': u'l3'}, 'key': u'neutron_l3_agent'}) => {"changed": true, "cmd": "if ip netns | egrep -e \"fip-|snat-|qrouter-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_l3_agent \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_l3_agent.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_l3_agent.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log --agent-type l3 --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:12.416200", "end": "2019-03-31 14:33:33.731888", "item": {"key": "neutron_l3_agent", "value": {"cleanup_type": "l3", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log", "netns_regex": "fip-|snat-|qrouter-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-03-31 14:33:21.315688", "stderr": "Unable to find image '5672:latest' locally\nTrying to pull repository registry.access.redhat.com/5672 ... \nTrying to pull repository docker.io/library/5672 ... \nTrying to pull repository registry.fedoraproject.org/5672 ... \nPulling repository registry.fedoraproject.org/5672\nTrying to pull repository quay.io/5672 ... \nTrying to pull repository registry.centos.org/5672 ... \nPulling repository registry.centos.org/5672\nTrying to pull repository docker.io/library/5672 ... \n/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Unable to find image '5672:latest' locally", "Trying to pull repository registry.access.redhat.com/5672 ... ", "Trying to pull repository docker.io/library/5672 ... ", "Trying to pull repository registry.fedoraproject.org/5672 ... ", "Pulling repository registry.fedoraproject.org/5672", "Trying to pull repository quay.io/5672 ... ", "Trying to pull repository registry.centos.org/5672 ... ", "Pulling repository registry.centos.org/5672", "Trying to pull repository docker.io/library/5672 ... ", "/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.", "See '/usr/bin/docker-current run --help'."], "stdout": "qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)\nqrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)\nCleaning up\ndocker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh", "stdout_lines": ["qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)", "qrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)", "Cleaning up", "docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh"]}

Comment 7 Miguel Angel Ajo 2019-04-10 08:23:25 UTC
Eran, can you upload the /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json file to this BZ so they can diagnose the issue with paunch?

Comment 9 Eran Kuris 2019-04-14 08:12 UTC
Created attachment 1555029 [details]
hashed-docker-container-startup-config-step_4

Comment 10 Miguel Angel Ajo 2019-04-15 08:14:38 UTC
It seems like paunch is not quoting --health-cmd=/openstack/healthcheck 5672   

Which makes docker believe 5672 is the image, instead of part of the "health-cmd".



root@controller-0 tmp]# paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \
                                     --action print-cmd \
                                     --container neutron_l3_agent \
                                     --interactive

docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --interactive --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1


JSON FILE:


  "neutron_l3_agent": {
    "start_order": 10, 
    "ulimit": [
      "nofile=1024"
    ], 
    "healthcheck": {
      "test": "/openstack/healthcheck 5672"
    }, 
    "image": "192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1", 
    "pid": "host", 
    "environment": [
      "KOLLA_CONFIG_STRATEGY=COPY_ALWAYS", 
      "TRIPLEO_CONFIG_HASH=55f1e4a78a54dee524d17bb5b3f72103"
    ], 
    "volumes": [
      "/etc/hosts:/etc/hosts:ro", 
      "/etc/localtime:/etc/localtime:ro", 
      "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", 
      "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", 
      "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", 
      "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", 
      "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", 
      "/dev/log:/dev/log", 
      "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro", 
      "/etc/puppet:/etc/puppet:ro", 
      "/var/log/containers/neutron:/var/log/neutron", 
      "/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro", 
      "/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro", 
      "/lib/modules:/lib/modules:ro", 
      "/run/openvswitch:/run/openvswitch", 
      "/var/lib/neutron:/var/lib/neutron", 
      "/run/netns:/run/netns:shared", 
      "/var/lib/openstack:/var/lib/openstack", 
      "/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro", 
      "/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro", 
      "/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro"
    ], 
    "net": "host", 
    "privileged": true, 
    "restart": "always"
  },

Comment 26 Eran Kuris 2019-06-11 08:33:35 UTC
Can't verify before getting the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1700302

Comment 27 Eran Kuris 2019-06-12 12:56:20 UTC
Can't verify before getting the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1700302

Comment 28 Emilien Macchi 2019-06-13 21:29:40 UTC
Note that we're now hitting https://bugzilla.redhat.com/show_bug.cgi?id=1717144 which might related to my patch in Paunch... I'm investigating.

Comment 29 Emilien Macchi 2019-06-13 22:04:05 UTC
FYI I'm reverting the patch as I tested on OSP14 and the healthchecks work fine now.

https://review.opendev.org/#/c/665249/

Regarding comment #10 - we need to investigate how did you hit the issue.

Comment 30 Eran Kuris 2019-06-17 05:26:27 UTC
Looks like the issue still exists:
failed: [controller-0] (item={'value': {u'netns_regex': u'qdhcp-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log', u'cleanup_type': u'dhcp'}, 'key': u'neutron_dhcp'}) => {"changed": true, "cmd": "# avoid cleaning up dhcp namespaces if the neutron dhcp agent is up (SR-IOV use case)\n if [[ \"dhcp\" == \"dhcp\" ]]; then\n docker inspect neutron_dhcp && echo \"Shouldn't clean DHCP namespaces if neutron_dhcp docker is up\" && exit 0\n fi\n\n if ip netns | egrep -e \"qdhcp-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_dhcp \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_dhcp.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_dhcp.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log --agent-type dhcp --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:00.586690", "end": "2019-06-16 16:32:00.968967", "item": {"key": "neutron_dhcp", "value": {"cleanup_type": "dhcp", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log", "netns_regex": "qdhcp-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-06-16 16:32:00.382277", "stderr": "Error: No such object: neutron_dhcp\n/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Error: No such object: neutron_dhcp", "/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.", "See '/usr/bin/docker-current run --help'."], "stdout": "[]\nqdhcp-102e3fa9-3fba-4
OpenStack/14.0-RHEL-7/2019-06-10.3

[root@controller-0 ~]# rpm -qa | grep -i paunch
python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch
[root@controller-0 ~]# logout
[heat-admin@controller-0 ~]$ logout
Connection to controller-0 closed.
(overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep -i paunch
python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch

Comment 31 Emilien Macchi 2019-06-20 17:29:41 UTC
I've checked paunch and indeed the cmd output isn't quoting the command, so if there is an argument the command will fail because docker interprets the healthcheck argument as a docker cli argument.

Now, I've been digging on where this code is executed and I found it: https://opendev.org/openstack/networking-ovn/src/branch/master/migration/tripleo_environment/playbooks/roles/migration/tasks/cleanup-dataplane.yml#L37-L69

You're running paunch to find out what the docker command looks like then you parse this cmd, sed it to adjust for your needs and execute it.
At this stage, there is no guarantee that the installer will behave correctly later since you're executing a third party script in the container; it's a bit risky to me...

I'll look at the paunch cmd output again and see if I can quickly fix it.

Comment 32 Emilien Macchi 2019-06-20 19:49:07 UTC
So 2 things here:

1) you shouldn't hack into containers this way. What I've seen in cleanup-dataplane.yml is doing things we clearly don't recommend nor support at all.

2) the paunch debug command seems broken indeed. The bug is the following: we use print to show the docker arguments how they come from the builder while if we execute it, we use subprocess which properly parse it that's why the error is only for a print. What we need to is hack the print to quote each arg during the print. This is IMHO low hanging fruit, I don't think it's going to make it for this zstream.


Note You need to log in before you can comment on or make changes to this bug.