Bug 1774911
Summary: | [OSP13][OVN][ovs2.11] openstack-neutron-metadata-agent-ovn container is running with old tag after minor update from latest Z-release (13z9) | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Roman Safronov <rsafrono> |
Component: | openstack-tripleo-common | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED NOTABUG | QA Contact: | Alexander Chuzhoy <sasha> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 13.0 (Queens) | CC: | apevec, jlibosva, lhh, majopela, mbultel, mburns, sathlang, scohen, slinaber |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-12-09 13:59:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1753533 |
Description
Roman Safronov
2019-11-21 09:42:00 UTC
It seems like that running openstack-neutron-metadata-agent-ovn container was with tag 2019-10-31.1 rather than 2019-11-14.1 from /home/stack/composable_roles/docker-images.yaml ... DockerOvnMetadataImage: 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-14.1 ... Note: during minor update from some older release (2019-06-28.1) to the same puddle (2019-11-15.1) the issue did not happen: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/65/ I'll retest once again minor update from z9 (2019-11-04.1) to 2019-11-15.1 (ovsFD19.G-respin1) and will update the BZ Retested minor update from z9 (2019-11-04.1) to 2019-11-15.1 (ovsFD19.G-respin1) Result: minor update failed on the same place. When validating container images on compute node an expected image openstack-neutron-metadata-agent-ovn:2019-11-14.1 not found. Instead of it was found openstack-neutron-metadata-agent-ovn:2019-10-31.1. # cat docker-images.yaml | grep meta DockerOvnMetadataImage: 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-14.1 Note, validate_docker_images_versions.sh script retrieves all running container images with tags from all nodes and compares to the ones in docker-images.yaml from validate_oc_images_containers.log 2019-11-26 16:10:40 | Validate docker images at host 192.168.24.17 2019-11-26 16:10:40 | ================================================================================ 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-ovn-controller:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-cron:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-iscsid:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-14.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-26 16:10:40 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-10-31.1 not present in /home/stack/composable_roles/docker-images.yaml <--- BUG Link to retested minor update job from the previous comment https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/68/ Retested minor update on new puddle 2019-11-28.2 (i.e. z9 to 2019-11-28.2), same issue https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/69/ 2019-11-30 18:35:40 | Validate docker images at host 192.168.24.14 2019-11-30 18:35:40 | ================================================================================ 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-ovn-controller:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-compute:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-cron:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-ceilometer-compute:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-iscsid:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-nova-libvirt:2019-11-27.1 present in /home/stack/composable_roles/docker-images.yaml 2019-11-30 18:35:41 | Image 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-10-31.1 not present in /home/stack/composable_roles/docker-images.yaml <-- BUG minor update from z8 passed without this issue https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/70/ Reassigned to DFG:Upgrades The problem that is after updating one of the running containers on compute node is with a wrong(old) tag The problem happens only when updating from 13z9. When updating from 13z8 the problem does not occur. Hi, this is a "side car" container. Basically it's not defined in the template and not managed by update (or deployment) at all. Those containers are spawned by neutron using a shell script that is has the right tag definition in it. So to get new images for those sidecar container they need to be restarted (which causes disruption in the force). One way to do that is to make sure you've rebooted all your nodes after the update. Here what we are doing in our own image validation https://github.com/openstack/tripleo-upgrade/blob/4d189c08d4cbc67312ebb902e61f077713a9905e/templates/validate_docker_images_versions.sh.j2#L44 . Se the grep -v part. If you need more information about sidecar container check with dfg:networking. So closing this as not a bug. Thanks, (In reply to Sofer Athlan-Guyot from comment #8) > Hi, > > this is a "side car" container. Basically it's not defined in the template openstack-neutron-metadata-agent-ovn is not a side-car, it's a legitimate containerized agent process that spawns other side-car containers. It is defined in tripleo templates and it is managed by tripleo and update process. Or at least it should be. Hi, (In reply to Jakub Libosvar from comment #9) > (In reply to Sofer Athlan-Guyot from comment #8) > > Hi, > > > > this is a "side car" container. Basically it's not defined in the template > > openstack-neutron-metadata-agent-ovn is not a side-car, it's a legitimate > containerized agent process that spawns other side-car containers. It is > defined in tripleo templates and it is managed by tripleo and update > process. Or at least it should be. Well, the issue here is "found a container whose image is not defined in the docker_images.yaml". That image is openstack-neutron-metadata-agent-ovn. The thing is that there is no such container as openstack-neutron-metadata-agent-ovn on the computes node. On compute here are the containers that use that image[1] 4eba7679920b 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-27.1 "dumb-init --singl..." 39 minutes ago Up 39 minutes neutron-haproxy-ovnmeta-c52b6b94-5020-45b3-bb6f-02227e2f49d7 0 B (virtual 945 MB) b0a3be5b84a7 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2019-11-27.1 "dumb-init --singl..." 2 hours ago Up 56 minutes healthy ovn_metadata_agent One is ovn_metadata_agent which is indeed defined in the templates, the other is neutron-haproxy-ovnmeta-c52b6b94-5020-45b3-bb6f-02227e2f49d7 which is a sidecar container. So the theory of a false positive linked to a side car container still hold. Note that I checked all computes containers_allinfo in DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/72/ and none of them had an issue with the image defined. That must mean that the compute node were restarted and the sidecar container recreated (with the new image) To avoid this kind of false positive we try to exclude from validate_docker_images_versions.sh all the side car container but we may have missed some. It fact it has been disabled there https://github.com/openstack/tripleo-upgrade/commit/72dd2c49e37db7ed014f297e3b7aad8dc22f60fe by default as the script was not very well maintained and could lead to false positive. Until we get more resource/time to debug it fully we will leave it disable by default. To re-enable it overcloud_images_validate has to be set to true again. Note that is you have access to the environment the script validate_docker_images_versions.sh is still available. Now if the problem reproduce - in case you're re-activating the check - (the log you mentioned where you had the issues are gone), it's worth checking which container exactly run with the wrong image. Unless it's not a sidecar one, it's not a issue. Sorry if my first response was a bit rushed, I hope I've replied fully now. [1] this is from compute-1/var/log/extra/containers/containers_allinfo.log in https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/72/ of the two compute and their tag are corrects: [2] feel free to help here but the script is convoluted, and need major overhaul Thanks, (In reply to Sofer Athlan-Guyot from comment #11) Thanks for detailed explanation. Unfortunately my lack of knowledge of TripleO makes it harder to understand so please bear with me :) To simplify the things, let's forget about neutron-haproxy-ovnmeta, the side-car container, and let's focus on ovn_metadata_agent which is a legitimate Neutron agent. If I also forget about the validate_docker_images_versions.sh script and we won't execute it then I still end up with an image in docker registry that has not been updated after update is finished, am I right? That is a problem. We want to have latest bits in all our Neutron services. That makes me think there must be something wrong either with the configuration or the processes that update images. I'm trying to pinpoint where the problem lays, you're saying: 'Well, the issue here is "found a container whose image is not defined in the docker_images.yaml". That image is openstack-neutron-metadata-agent-ovn. ' Who is responsible for defining docker_images.yaml? Is that also used to deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it doesn't configure/update the image there. Thank you for your help. Hi Jakub, (In reply to Jakub Libosvar from comment #12) > (In reply to Sofer Athlan-Guyot from comment #11) > > Thanks for detailed explanation. Unfortunately my lack of knowledge of > TripleO makes it harder to understand so please bear with me :) np. > To simplify the things, let's forget about neutron-haproxy-ovnmeta, the > side-car container, and let's focus on ovn_metadata_agent which is a > legitimate Neutron agent. If I also forget about the > validate_docker_images_versions.sh script and we won't execute it then I > still end up with an image in docker registry that has not been updated > after update is finished, am I right? Nope. All the images have been updated, but not all containers are using the latest. Namely the side car container still uses the previous image because it's not managed by tht. What has been updated is the script that neutron uses to create those side car containers. It now has the latest image defined in it. It means that next time it restarts it will use the correct latest image. > > That is a problem. We want to have latest bits in all our Neutron services. > That makes me think there must be something wrong either with the > configuration or the processes that update images. > > I'm trying to pinpoint where the problem lays, you're saying: > 'Well, the issue here is "found a container whose image is not defined > in the docker_images.yaml". That image is > openstack-neutron-metadata-agent-ovn. ' > > Who is responsible for defining docker_images.yaml? Is that also used to > deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it > doesn't configure/update the image there. So docker_images.yaml is updated by infrared, by just dumping the definition of the container image in that file. It has *only* the latest images. The sidecar container is still running with the previous image which is is not defined there. Hence the false positive returned by the validation script. > > Thank you for your help. yw, I hope it makes sense. (In reply to Sofer Athlan-Guyot from comment #13) > Hi Jakub, > > (In reply to Jakub Libosvar from comment #12) > > (In reply to Sofer Athlan-Guyot from comment #11) > > > > Thanks for detailed explanation. Unfortunately my lack of knowledge of > > TripleO makes it harder to understand so please bear with me :) > > np. > > > To simplify the things, let's forget about neutron-haproxy-ovnmeta, the > > side-car container, and let's focus on ovn_metadata_agent which is a > > legitimate Neutron agent. If I also forget about the > > validate_docker_images_versions.sh script and we won't execute it then I > > still end up with an image in docker registry that has not been updated > > after update is finished, am I right? > > Nope. All the images have been updated, but not all containers are using > the latest. Namely the side car container still uses the previous image > because > it's not managed by tht. What has been updated is the script that neutron > uses to create those side car containers. It now has the latest image > defined > in it. It means that next time it restarts it will use the correct latest > image. > > > > > That is a problem. We want to have latest bits in all our Neutron services. > > That makes me think there must be something wrong either with the > > configuration or the processes that update images. > > > > I'm trying to pinpoint where the problem lays, you're saying: > > 'Well, the issue here is "found a container whose image is not defined > > in the docker_images.yaml". That image is > > openstack-neutron-metadata-agent-ovn. ' > > > > Who is responsible for defining docker_images.yaml? Is that also used to > > deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it > > doesn't configure/update the image there. > > So docker_images.yaml is updated by infrared, by just dumping the definition > of the container image in that file. It has *only* the latest images. > The sidecar container is still running with the previous image which is is > not defined there. Hence the false positive returned by the validation > script. > > > > > > Thank you for your help. > > yw, I hope it makes sense. Yes, thanks To summarize what I understand is: 1) The agent image has been updated correctly. 2) side-cars are running with old version because we don't restart them - but validation script doesn't take into consideration side-cars when running, so the metadata-agent image is there actually twice ^ this still sounds like a bug to me, if update finished fine, we shouldn't fail in validation 3) However, due to lack of resources, validation script is not maintained. - is the validation script supported? is it documented that customers must not use the validation script? it's gonna fail 100% of time, if they have workloads, right? (In reply to Jakub Libosvar from comment #14) > (In reply to Sofer Athlan-Guyot from comment #13) > > Hi Jakub, > > > > (In reply to Jakub Libosvar from comment #12) > > > (In reply to Sofer Athlan-Guyot from comment #11) > > > > > > Thanks for detailed explanation. Unfortunately my lack of knowledge of > > > TripleO makes it harder to understand so please bear with me :) > > > > np. > > > > > To simplify the things, let's forget about neutron-haproxy-ovnmeta, the > > > side-car container, and let's focus on ovn_metadata_agent which is a > > > legitimate Neutron agent. If I also forget about the > > > validate_docker_images_versions.sh script and we won't execute it then I > > > still end up with an image in docker registry that has not been updated > > > after update is finished, am I right? > > > > Nope. All the images have been updated, but not all containers are using > > the latest. Namely the side car container still uses the previous image > > because > > it's not managed by tht. What has been updated is the script that neutron > > uses to create those side car containers. It now has the latest image > > defined > > in it. It means that next time it restarts it will use the correct latest > > image. > > > > > > > > That is a problem. We want to have latest bits in all our Neutron services. > > > That makes me think there must be something wrong either with the > > > configuration or the processes that update images. > > > > > > I'm trying to pinpoint where the problem lays, you're saying: > > > 'Well, the issue here is "found a container whose image is not defined > > > in the docker_images.yaml". That image is > > > openstack-neutron-metadata-agent-ovn. ' > > > > > > Who is responsible for defining docker_images.yaml? Is that also used to > > > deploy the pre-upgrade OSP13? It may also be a bug in Infrared that it > > > doesn't configure/update the image there. > > > > So docker_images.yaml is updated by infrared, by just dumping the definition > > of the container image in that file. It has *only* the latest images. > > The sidecar container is still running with the previous image which is is > > not defined there. Hence the false positive returned by the validation > > script. > > > > > > > > > > Thank you for your help. > > > > yw, I hope it makes sense. > > Yes, thanks > > To summarize what I understand is: > 1) The agent image has been updated correctly. Correct > 2) side-cars are running with old version because we don't restart them > - but validation script doesn't take into consideration side-cars when > running, so the metadata-agent image is there actually twice > ^ this still sounds like a bug to me, if update finished fine, we > shouldn't fail in validation > > 3) However, due to lack of resources, validation script is not maintained. > - is the validation script supported? is it documented that customers > must not use the validation script? it's gonna fail 100% of time, if they > have workloads, right? Oh, I believe I get it now. That validation was *only* use in CI. It's not part of tripleo/OSP. We have disabled it there[1]. It was useful at some point to check that everything was fine in *CI*. But as I said, because of lack of resource, we have disabled it until we can rework it altogether to make it more robust as we had too many false positive with it. I hope it clarifies totally the situation. This is why I closed this promptly, this was a false positive due to our own testing tooling: tripleo-upgrade ansible role validation is *not* used by costumers, only in our own CI jobs. [1] https://review.opendev.org/#/q/change:Ia5a767491c2f297b396c5cc937c1495e4267a4e3 |