Bug 1694572

Summary: migration from ml2/ovs to ml2/ovn is failing when paunch command was executed
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: python-paunchAssignee: Luke Short <lshort>
Status: CLOSED EOL QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: apevec, aschultz, dalvarez, emacchi, jlibosva, jschluet, lhh, lmartins, lshort, majopela, mburns, njohnston, ramishra, scohen, spower, tfreger, twilson
Target Milestone: z5Keywords: AutomationBlocker, TestOnly, Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-paunch-3.2.0-0.20180921003262.6d2ec11.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-24 12:29:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1659066, 1667920, 1672782, 1672785, 1672786, 1672790, 1700302, 1715112    
Attachments:
Description Flags
hashed-docker-container-startup-config-step_4 none

Description Eran Kuris 2019-04-01 07:04:30 UTC
Description of problem:
The migration process from ml2/ovs to ml2/ovn is failing on task migration: Cleanup neutron datapath resources on controller nodes 

Version-Release number of selected component (if applicable):
OpenStack/14.0-RHEL-7/2019-03-06.1
[stack@undercloud-0 ~]$ rpm -qa | grep -i ovn 
python-networking-ovn-migration-tool-5.0.2-0.20181217155313.726820a.el7ost.noarch
puppet-ovn-13.3.1-0.20181013120724.38e2e33.el7ost.noarch
python-networking-ovn-5.0.2-0.20181217155313.726820a.el7ost.noarch
[stack@undercloud-0 ~]$ rpm -qa | grep -i migration
python-networking-ovn-migration-tool-5.0.2-0.20181217155313.726820a.el7ost.noarch
How reproducible:
100%

Steps to Reproduce:
1. run ovn migration job: 
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ml2ovs-to-ovn-migration/
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Eran Kuris 2019-04-01 07:30:53 UTC
SOS Report : http://rhos-release.virt.bos.redhat.com/log/bz1694572/

ERROR :

failed: [controller-0] (item={'value': {u'netns_regex': u'fip-|snat-|qrouter-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log', u'cleanup_type': u'l3'}, 'key': u'neutron_l3_agent'}) => {"changed": true, "cmd": "if ip netns | egrep -e \"fip-|snat-|qrouter-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_l3_agent \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_l3_agent.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_l3_agent.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log --agent-type l3 --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:12.416200", "end": "2019-03-31 14:33:33.731888", "item": {"key": "neutron_l3_agent", "value": {"cleanup_type": "l3", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/l3_agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-l3-agent --log-file=/var/log/neutron/netns-cleanup-l3.log", "netns_regex": "fip-|snat-|qrouter-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-03-31 14:33:21.315688", "stderr": "Unable to find image '5672:latest' locally\nTrying to pull repository registry.access.redhat.com/5672 ... \nTrying to pull repository docker.io/library/5672 ... \nTrying to pull repository registry.fedoraproject.org/5672 ... \nPulling repository registry.fedoraproject.org/5672\nTrying to pull repository quay.io/5672 ... \nTrying to pull repository registry.centos.org/5672 ... \nPulling repository registry.centos.org/5672\nTrying to pull repository docker.io/library/5672 ... \n/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Unable to find image '5672:latest' locally", "Trying to pull repository registry.access.redhat.com/5672 ... ", "Trying to pull repository docker.io/library/5672 ... ", "Trying to pull repository registry.fedoraproject.org/5672 ... ", "Pulling repository registry.fedoraproject.org/5672", "Trying to pull repository quay.io/5672 ... ", "Trying to pull repository registry.centos.org/5672 ... ", "Pulling repository registry.centos.org/5672", "Trying to pull repository docker.io/library/5672 ... ", "/usr/bin/docker-current: repository docker.io/5672 not found: does not exist or no pull access.", "See '/usr/bin/docker-current run --help'."], "stdout": "qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)\nqrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)\nCleaning up\ndocker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh", "stdout_lines": ["qrouter-ccfea4c8-71d9-49e8-a177-75f12ab8d9e3 (id: 3)", "qrouter-b3ba9c6d-4108-4918-879e-99edf06d568c (id: 1)", "Cleaning up", "docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --volume=/tmp/cleanup-neutron_l3_agent.sh:/cleanup.sh:ro --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1 /cleanup.sh"]}

Comment 7 Miguel Angel Ajo 2019-04-10 08:23:25 UTC
Eran, can you upload the /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json file to this BZ so they can diagnose the issue with paunch?

Comment 9 Eran Kuris 2019-04-14 08:12:11 UTC
Created attachment 1555029 [details]
hashed-docker-container-startup-config-step_4

Comment 10 Miguel Angel Ajo 2019-04-15 08:14:38 UTC
It seems like paunch is not quoting --health-cmd=/openstack/healthcheck 5672   

Which makes docker believe 5672 is the image, instead of part of the "health-cmd".



root@controller-0 tmp]# paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \
                                     --action print-cmd \
                                     --container neutron_l3_agent \
                                     --interactive

docker run --name neutron_l3_agent --env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS --env=TRIPLEO_CONFIG_HASH=1c97362381162383633f1c95d4277e66 --rm --interactive --tty --net=host --pid=host --ulimit=nofile=1024 --health-cmd=/openstack/healthcheck 5672 --privileged=true --restart=no --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/neutron:/var/log/neutron --volume=/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro --volume=/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro --volume=/lib/modules:/lib/modules:ro --volume=/run/openvswitch:/run/openvswitch --volume=/var/lib/neutron:/var/lib/neutron --volume=/run/netns:/run/netns:shared --volume=/var/lib/openstack:/var/lib/openstack --volume=/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro --volume=/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro --volume=/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro 192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1


JSON FILE:


  "neutron_l3_agent": {
    "start_order": 10, 
    "ulimit": [
      "nofile=1024"
    ], 
    "healthcheck": {
      "test": "/openstack/healthcheck 5672"
    }, 
    "image": "192.168.24.1:8787/rhosp14/openstack-neutron-l3-agent:2019-03-06.1", 
    "pid": "host", 
    "environment": [
      "KOLLA_CONFIG_STRATEGY=COPY_ALWAYS", 
      "TRIPLEO_CONFIG_HASH=55f1e4a78a54dee524d17bb5b3f72103"
    ], 
    "volumes": [
      "/etc/hosts:/etc/hosts:ro", 
      "/etc/localtime:/etc/localtime:ro", 
      "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", 
      "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", 
      "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", 
      "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", 
      "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", 
      "/dev/log:/dev/log", 
      "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro", 
      "/etc/puppet:/etc/puppet:ro", 
      "/var/log/containers/neutron:/var/log/neutron", 
      "/var/lib/kolla/config_files/neutron_l3_agent.json:/var/lib/kolla/config_files/config.json:ro", 
      "/var/lib/config-data/puppet-generated/neutron/:/var/lib/kolla/config_files/src:ro", 
      "/lib/modules:/lib/modules:ro", 
      "/run/openvswitch:/run/openvswitch", 
      "/var/lib/neutron:/var/lib/neutron", 
      "/run/netns:/run/netns:shared", 
      "/var/lib/openstack:/var/lib/openstack", 
      "/var/lib/neutron/keepalived_wrapper:/usr/local/bin/keepalived:ro", 
      "/var/lib/neutron/l3_haproxy_wrapper:/usr/local/bin/haproxy:ro", 
      "/var/lib/neutron/dibbler_wrapper:/usr/local/bin/dibbler_client:ro"
    ], 
    "net": "host", 
    "privileged": true, 
    "restart": "always"
  },

Comment 26 Eran Kuris 2019-06-11 08:33:35 UTC
Can't verify before getting the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1700302

Comment 27 Eran Kuris 2019-06-12 12:56:20 UTC
Can't verify before getting the fix of https://bugzilla.redhat.com/show_bug.cgi?id=1700302

Comment 28 Emilien Macchi 2019-06-13 21:29:40 UTC
Note that we're now hitting https://bugzilla.redhat.com/show_bug.cgi?id=1717144 which might related to my patch in Paunch... I'm investigating.

Comment 29 Emilien Macchi 2019-06-13 22:04:05 UTC
FYI I'm reverting the patch as I tested on OSP14 and the healthchecks work fine now.

https://review.opendev.org/#/c/665249/

Regarding comment #10 - we need to investigate how did you hit the issue.

Comment 30 Eran Kuris 2019-06-17 05:26:27 UTC
Looks like the issue still exists:
failed: [controller-0] (item={'value': {u'netns_regex': u'qdhcp-', u'config': u'--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log', u'cleanup_type': u'dhcp'}, 'key': u'neutron_dhcp'}) => {"changed": true, "cmd": "# avoid cleaning up dhcp namespaces if the neutron dhcp agent is up (SR-IOV use case)\n if [[ \"dhcp\" == \"dhcp\" ]]; then\n docker inspect neutron_dhcp && echo \"Shouldn't clean DHCP namespaces if neutron_dhcp docker is up\" && exit 0\n fi\n\n if ip netns | egrep -e \"qdhcp-\"\n then\n echo \"Cleaning up\"\n cmd=\"$(paunch debug --file /var/lib/ovn-migration-backup/tripleo-config/hashed-docker-container-startup-config-step_4.json \\\n--action print-cmd --container neutron_dhcp \\\n--interactive | \\\nsed 's/--interactive /--volume=\\/tmp\\/cleanup-neutron_dhcp.sh:\\/cleanup.sh:ro /g ' )\"\n\n f=\"/tmp/cleanup-neutron_dhcp.sh\"\n\n echo \"#!/bin/sh\" > $f\n echo \"set -x\" >> $f\n echo \"set -e\" >> $f\n echo \"sudo -E kolla_set_configs\" >> $f\n echo \"neutron-netns-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log --agent-type dhcp --force\" >> $f\n\n chmod a+x $f\n\n echo $cmd /cleanup.sh\n\n $cmd /cleanup.sh\n\n fi", "delta": "0:00:00.586690", "end": "2019-06-16 16:32:00.968967", "item": {"key": "neutron_dhcp", "value": {"cleanup_type": "dhcp", "config": "--config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/netns-cleanup-dhcp.log", "netns_regex": "qdhcp-"}}, "msg": "non-zero return code", "rc": 125, "start": "2019-06-16 16:32:00.382277", "stderr": "Error: No such object: neutron_dhcp\n/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["Error: No such object: neutron_dhcp", "/usr/bin/docker-current: Error parsing reference: \"5672\\\"\" is not a valid repository/tag: invalid reference format.", "See '/usr/bin/docker-current run --help'."], "stdout": "[]\nqdhcp-102e3fa9-3fba-4
OpenStack/14.0-RHEL-7/2019-06-10.3

[root@controller-0 ~]# rpm -qa | grep -i paunch
python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch
[root@controller-0 ~]# logout
[heat-admin@controller-0 ~]$ logout
Connection to controller-0 closed.
(overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep -i paunch
python-paunch-3.2.0-0.20180921003260.6d2ec11.el7ost.noarch

Comment 31 Emilien Macchi 2019-06-20 17:29:41 UTC
I've checked paunch and indeed the cmd output isn't quoting the command, so if there is an argument the command will fail because docker interprets the healthcheck argument as a docker cli argument.

Now, I've been digging on where this code is executed and I found it: https://opendev.org/openstack/networking-ovn/src/branch/master/migration/tripleo_environment/playbooks/roles/migration/tasks/cleanup-dataplane.yml#L37-L69

You're running paunch to find out what the docker command looks like then you parse this cmd, sed it to adjust for your needs and execute it.
At this stage, there is no guarantee that the installer will behave correctly later since you're executing a third party script in the container; it's a bit risky to me...

I'll look at the paunch cmd output again and see if I can quickly fix it.

Comment 32 Emilien Macchi 2019-06-20 19:49:07 UTC
So 2 things here:

1) you shouldn't hack into containers this way. What I've seen in cleanup-dataplane.yml is doing things we clearly don't recommend nor support at all.

2) the paunch debug command seems broken indeed. The bug is the following: we use print to show the docker arguments how they come from the builder while if we execute it, we use subprocess which properly parse it that's why the error is only for a print. What we need to is hack the print to quote each arg during the print. This is IMHO low hanging fruit, I don't think it's going to make it for this zstream.