Bug 1774911

Summary: [OSP13][OVN][ovs2.11] openstack-neutron-metadata-agent-ovn container is running with old tag after minor update from latest Z-release (13z9)
Product: Red Hat OpenStack Reporter: Roman Safronov <rsafrono>
Component: openstack-tripleo-commonAssignee: Sofer Athlan-Guyot <sathlang>
Status: CLOSED NOTABUG QA Contact: Alexander Chuzhoy <sasha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: apevec, jlibosva, lhh, majopela, mbultel, mburns, sathlang, scohen, slinaber
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-09 13:59:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1753533    

Description Roman Safronov 2019-11-21 09:42:00 UTC
Description of problem:

Minor update failed on overcloud update stage during validation of overcloud docker container images

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/64/

TASK [tripleo-upgrade : validate overcloud docker images/containers] ***********
task path: /home/rhos-ci/jenkins/workspace/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/infrared/plugins/tripleo-upgrade/infrared_plugin/roles/tripleo-upgrade/tasks/update/main.yml:107
Wednesday 20 November 2019  18:21:53 +0000 (0:00:00.085)       2:29:36.728 **** 
fatal: [undercloud-0]: FAILED! => {
    "changed": true, 
    "cmd": "set -o pipefail\n source /home/stack/stackrc\n bash /home/stack/validate_docker_images_versions.sh 2>&1 | awk '{ print strftime(\"%Y-%m-%d %H:%M:%S |\"), $0; fflush(); }' > /home/stack/validate_oc_images_containers.log", 
    "delta": "0:00:30.085161", 
    "end": "2019-11-20 13:22:23.942804", 
    "rc": 2, 
    "start": "2019-11-20 13:21:53.857643"
}

MSG:

non-zero return code

from validate_oc_images_containers.log
2019-11-20 13:22:23 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-10-31.1 not present in /home/stack/composable_roles/docker-images.yaml


Feel free to change component to a more relevant one.



Version-Release number of selected component (if applicable):
puddle 13.0-RHEL-7/2019-11-15.1
openvswitch2.11-2.11.0-26.el7fdp.x86_64
python-networking-ovn-4.0.3-14.el7ost.noarch
puppet-ovn-12.4.0-3.el7ost.noarch.rpm
python-networking-ovn-metadata-agent-4.0.3-14.el7ost.noarch.rpm   
openstack-tripleo-heat-templates-8.4.1-16.el7ost.noarch.rpm  



How reproducible:
Tried minor update once and the issue occurred.



Steps to Reproduce:
1. Run OVN OSP13 minor update job from z9 to the latest puddle containing openvswitch 2.11
2.
3.

Actual results:
Job failed

Expected results:
Job succeeded, minor update completed successfully

Additional info:

Comment 1 Roman Safronov 2019-11-21 14:21:06 UTC
It seems like that running openstack-neutron-metadata-agent-ovn container was with tag 2019-10-31.1 rather than 2019-11-14.1

from /home/stack/composable_roles/docker-images.yaml
...
DockerOvnMetadataImage: 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-14.1
...

Comment 2 Roman Safronov 2019-11-26 14:37:44 UTC
Note: during minor update from some older release (2019-06-28.1) to the same puddle (2019-11-15.1) the issue did not happen: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/65/

I'll retest once again minor update from z9 (2019-11-04.1) to 2019-11-15.1 (ovsFD19.G-respin1) and will update the BZ

Comment 3 Roman Safronov 2019-11-27 10:16:04 UTC
Retested minor update from z9 (2019-11-04.1) to 2019-11-15.1 (ovsFD19.G-respin1)
Result: minor update failed on the same place.

When validating container images on compute node an expected image openstack-neutron-metadata-agent-ovn:2019-11-14.1 not found. Instead of it was found openstack-neutron-metadata-agent-ovn:2019-10-31.1.

# cat docker-images.yaml  | grep meta
  DockerOvnMetadataImage: 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-14.1

Note, validate_docker_images_versions.sh script retrieves all running container images with tags from all nodes and compares to the ones in docker-images.yaml

from validate_oc_images_containers.log

2019-11-26 16:10:40 | Validate docker images at host 192.168.24.17
2019-11-26 16:10:40 | ================================================================================
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-ovn-controller:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-cron:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-iscsid:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-10-31.1 not present in /home/stack/composable_roles/docker-images.yaml   <--- BUG

Comment 5 Roman Safronov 2019-12-01 09:32:19 UTC
Retested minor update on new puddle 2019-11-28.2  (i.e. z9 to 2019-11-28.2), same issue
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/69/

2019-11-30 18:35:40 | Validate docker images at host 192.168.24.14
2019-11-30 18:35:40 | ================================================================================
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-ovn-controller:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-cron:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-iscsid:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml
2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-10-31.1 not present in /home/stack/composable_roles/docker-images.yaml <-- BUG

Comment 7 Roman Safronov 2019-12-04 14:55:06 UTC
Reassigned to DFG:Upgrades

The problem that is after updating one of the running containers on compute node is with a wrong(old) tag
The problem happens only when updating from 13z9. When updating from 13z8 the problem does not occur.

Comment 8 Sofer Athlan-Guyot 2019-12-09 13:59:49 UTC
Hi,

this is a "side car" container.  Basically it's not defined in the template and not managed by update (or deployment) at all.  Those containers are spawned by neutron using a shell script that is has the right tag definition in it.  So to get new images for those sidecar container they need to be restarted (which causes disruption in the force).  One way to do that is to make sure you've rebooted all your nodes after the update.

Here what we are doing in our own image validation https://github.com/openstack/tripleo-upgrade/blob/4d189c08d4cbc67312ebb902e61f077713a9905e/templates/validate_docker_images_versions.sh.j2#L44 .  Se the grep -v part.

If you need more information about sidecar container check with dfg:networking.

So closing this as not a bug.

Thanks,

Comment 9 Jakub Libosvar 2019-12-09 14:47:01 UTC
(In reply to Sofer Athlan-Guyot from comment #8)
> Hi,
> 
> this is a "side car" container.  Basically it's not defined in the template

openstack-neutron-metadata-agent-ovn is not a side-car, it's a legitimate containerized agent process that spawns other side-car containers. It is defined in tripleo templates and it is managed by tripleo and update process. Or at least it should be.

Comment 11 Sofer Athlan-Guyot 2019-12-18 01:17:35 UTC
Hi,

(In reply to Jakub Libosvar from comment #9)
> (In reply to Sofer Athlan-Guyot from comment #8)
> > Hi,
> > 
> > this is a "side car" container.  Basically it's not defined in the template
> 
> openstack-neutron-metadata-agent-ovn is not a side-car, it's a legitimate
> containerized agent process that spawns other side-car containers. It is
> defined in tripleo templates and it is managed by tripleo and update
> process. Or at least it should be.

Well, the issue here is "found a container whose image is not defined
in the docker_images.yaml".  That image is
openstack-neutron-metadata-agent-ovn.  

The thing is that there is no such container as
openstack-neutron-metadata-agent-ovn on the computes node.

On compute here are the containers that use that image[1]

4eba7679920b        192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-27.1   "dumb-init --singl..."   39 minutes ago      Up 39 minutes                                      neutron-haproxy-ovnmeta-c52b6b94-5020-45b3-bb6f-02227e2f49d7   0 B (virtual 945 MB)
b0a3be5b84a7        192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-27.1   "dumb-init --singl..."   2 hours ago         Up 56 minutes healthy                            ovn_metadata_agent 

One is ovn_metadata_agent which is indeed defined in the templates, the
other is neutron-haproxy-ovnmeta-c52b6b94-5020-45b3-bb6f-02227e2f49d7
which is a sidecar container.

So the theory of a false positive linked to a side car container still
hold.

Note that I checked all computes containers_allinfo in
DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/72/
and none of them had an issue with the image defined.  

That must mean that the compute node were restarted and the sidecar
container recreated (with the new image)

To avoid this kind of false positive we try to exclude from
validate_docker_images_versions.sh all the side car container but we
may have missed some.  

It fact it has been disabled there
https://github.com/openstack/tripleo-upgrade/commit/72dd2c49e37db7ed014f297e3b7aad8dc22f60fe
by default as the script was not very well maintained and could lead
to false positive.  Until we get more resource/time to debug it fully
we will leave it disable by default.

To re-enable it overcloud_images_validate has to be set to true
again. Note that is you have access to the environment the script
validate_docker_images_versions.sh is still available.

Now if the problem reproduce - in case you're re-activating the
check - (the log you mentioned where you had the issues are gone),
it's worth checking which container exactly run with the wrong
image. Unless it's not a sidecar one, it's not a issue.

Sorry if my first response was a bit rushed, I hope I've replied fully
now.

[1] this is from compute-1/var/log/extra/containers/containers_allinfo.log in https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/72/ of the two compute and their tag are corrects:
[2] feel free to help here but the script is convoluted, and need major overhaul

Thanks,

Comment 12 Jakub Libosvar 2020-01-03 09:33:46 UTC
(In reply to Sofer Athlan-Guyot from comment #11)

Thanks for detailed explanation. Unfortunately my lack of knowledge of TripleO makes it harder to understand so please bear with me :)

To simplify the things, let's forget about neutron-haproxy-ovnmeta, the side-car container, and let's focus on ovn_metadata_agent which is a legitimate Neutron agent. If I also forget about the validate_docker_images_versions.sh script and we won't execute it then I still end up with an image in docker registry that has not been updated after update is finished, am I right?

That is a problem. We want to have latest bits in all our Neutron services. That makes me think there must be something wrong either with the configuration or the processes that update images.

I'm trying to pinpoint where the problem lays, you're saying:
'Well, the issue here is "found a container whose image is not defined
in the docker_images.yaml".  That image is
openstack-neutron-metadata-agent-ovn. '

Who is responsible for defining docker_images.yaml? Is that also used to deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it doesn't configure/update the image there.

Thank you for your help.

Comment 13 Sofer Athlan-Guyot 2020-01-07 10:40:17 UTC
Hi Jakub,

(In reply to Jakub Libosvar from comment #12)
> (In reply to Sofer Athlan-Guyot from comment #11)
> 
> Thanks for detailed explanation. Unfortunately my lack of knowledge of
> TripleO makes it harder to understand so please bear with me :)

np.
 
> To simplify the things, let's forget about neutron-haproxy-ovnmeta, the
> side-car container, and let's focus on ovn_metadata_agent which is a
> legitimate Neutron agent. If I also forget about the
> validate_docker_images_versions.sh script and we won't execute it then I
> still end up with an image in docker registry that has not been updated
> after update is finished, am I right?

Nope.  All the images have been updated, but not all containers are using
the latest.  Namely the side car container still uses the previous image because
it's not managed by tht.  What has been updated is the script that neutron
uses to create those side car containers.  It now has the latest image defined
in it.  It means that next time it restarts it will use the correct latest
image.

> 
> That is a problem. We want to have latest bits in all our Neutron services.
> That makes me think there must be something wrong either with the
> configuration or the processes that update images.
> 
> I'm trying to pinpoint where the problem lays, you're saying:
> 'Well, the issue here is "found a container whose image is not defined
> in the docker_images.yaml".  That image is
> openstack-neutron-metadata-agent-ovn. '
> 
> Who is responsible for defining docker_images.yaml? Is that also used to
> deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it
> doesn't configure/update the image there.

So docker_images.yaml is updated by infrared, by just dumping the definition
of the container image in that file.  It has *only* the latest images.
The sidecar container is still running with the previous image which is is
not defined there. Hence the false positive returned by the validation script.


> 
> Thank you for your help.

yw, I hope it makes sense.

Comment 14 Jakub Libosvar 2020-01-16 09:34:25 UTC
(In reply to Sofer Athlan-Guyot from comment #13)
> Hi Jakub,
> 
> (In reply to Jakub Libosvar from comment #12)
> > (In reply to Sofer Athlan-Guyot from comment #11)
> > 
> > Thanks for detailed explanation. Unfortunately my lack of knowledge of
> > TripleO makes it harder to understand so please bear with me :)
> 
> np.
>  
> > To simplify the things, let's forget about neutron-haproxy-ovnmeta, the
> > side-car container, and let's focus on ovn_metadata_agent which is a
> > legitimate Neutron agent. If I also forget about the
> > validate_docker_images_versions.sh script and we won't execute it then I
> > still end up with an image in docker registry that has not been updated
> > after update is finished, am I right?
> 
> Nope.  All the images have been updated, but not all containers are using
> the latest.  Namely the side car container still uses the previous image
> because
> it's not managed by tht.  What has been updated is the script that neutron
> uses to create those side car containers.  It now has the latest image
> defined
> in it.  It means that next time it restarts it will use the correct latest
> image.
> 
> > 
> > That is a problem. We want to have latest bits in all our Neutron services.
> > That makes me think there must be something wrong either with the
> > configuration or the processes that update images.
> > 
> > I'm trying to pinpoint where the problem lays, you're saying:
> > 'Well, the issue here is "found a container whose image is not defined
> > in the docker_images.yaml".  That image is
> > openstack-neutron-metadata-agent-ovn. '
> > 
> > Who is responsible for defining docker_images.yaml? Is that also used to
> > deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it
> > doesn't configure/update the image there.
> 
> So docker_images.yaml is updated by infrared, by just dumping the definition
> of the container image in that file.  It has *only* the latest images.
> The sidecar container is still running with the previous image which is is
> not defined there. Hence the false positive returned by the validation
> script.
> 
> 
> > 
> > Thank you for your help.
> 
> yw, I hope it makes sense.

Yes, thanks

To summarize what I understand is:
1) The agent image has been updated correctly.
2) side-cars are running with old version because we don't restart them
   - but validation script doesn't take into consideration side-cars when running, so the metadata-agent image is there actually twice
      ^ this still sounds like a bug to me, if update finished fine, we shouldn't fail in validation

3) However, due to lack of resources, validation script is not maintained.
   - is the validation script supported? is it documented that customers must not use the validation script? it's gonna fail 100% of time, if they have workloads, right?

Comment 15 Sofer Athlan-Guyot 2020-01-17 10:34:39 UTC
(In reply to Jakub Libosvar from comment #14)
> (In reply to Sofer Athlan-Guyot from comment #13)
> > Hi Jakub,
> > 
> > (In reply to Jakub Libosvar from comment #12)
> > > (In reply to Sofer Athlan-Guyot from comment #11)
> > > 
> > > Thanks for detailed explanation. Unfortunately my lack of knowledge of
> > > TripleO makes it harder to understand so please bear with me :)
> > 
> > np.
> >  
> > > To simplify the things, let's forget about neutron-haproxy-ovnmeta, the
> > > side-car container, and let's focus on ovn_metadata_agent which is a
> > > legitimate Neutron agent. If I also forget about the
> > > validate_docker_images_versions.sh script and we won't execute it then I
> > > still end up with an image in docker registry that has not been updated
> > > after update is finished, am I right?
> > 
> > Nope.  All the images have been updated, but not all containers are using
> > the latest.  Namely the side car container still uses the previous image
> > because
> > it's not managed by tht.  What has been updated is the script that neutron
> > uses to create those side car containers.  It now has the latest image
> > defined
> > in it.  It means that next time it restarts it will use the correct latest
> > image.
> > 
> > > 
> > > That is a problem. We want to have latest bits in all our Neutron services.
> > > That makes me think there must be something wrong either with the
> > > configuration or the processes that update images.
> > > 
> > > I'm trying to pinpoint where the problem lays, you're saying:
> > > 'Well, the issue here is "found a container whose image is not defined
> > > in the docker_images.yaml".  That image is
> > > openstack-neutron-metadata-agent-ovn. '
> > > 
> > > Who is responsible for defining docker_images.yaml? Is that also used to
> > > deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it
> > > doesn't configure/update the image there.
> > 
> > So docker_images.yaml is updated by infrared, by just dumping the definition
> > of the container image in that file.  It has *only* the latest images.
> > The sidecar container is still running with the previous image which is is
> > not defined there. Hence the false positive returned by the validation
> > script.
> > 
> > 
> > > 
> > > Thank you for your help.
> > 
> > yw, I hope it makes sense.
> 
> Yes, thanks
> 
> To summarize what I understand is:
> 1) The agent image has been updated correctly.

Correct

> 2) side-cars are running with old version because we don't restart them
>    - but validation script doesn't take into consideration side-cars when
> running, so the metadata-agent image is there actually twice
>       ^ this still sounds like a bug to me, if update finished fine, we
> shouldn't fail in validation
> 
> 3) However, due to lack of resources, validation script is not maintained.
>    - is the validation script supported? is it documented that customers
> must not use the validation script? it's gonna fail 100% of time, if they
> have workloads, right?

Oh, I believe I get it now. That validation was *only* use in CI.  It's not
part of tripleo/OSP.  We have disabled it there[1]. It was useful at some
point to check that everything was fine in *CI*. But as I said, because of
lack of resource, we have disabled it until we can rework it altogether to
make it more robust as we had too many false positive with it.

I hope it clarifies totally the situation. This is why I closed this promptly,
this was a false positive due to our own testing tooling: tripleo-upgrade ansible
role validation is *not* used by costumers, only in our own CI jobs.

[1] https://review.opendev.org/#/q/change:Ia5a767491c2f297b396c5cc937c1495e4267a4e3