Description of problem: This is kind of weird. When starting an introspection task the container introspection name does seem to be hard set. I ran introspection from 4.0 and 4.1 beta and the introspection name was 3 times manageiq-img-scan-919647e60d0a Also when failing with 4.1 on introspection and the container manageiq-img-scan-919647e60d0a was not cleaned up I was not able to start an introspection with 4.0 as it tried to create the same "manageiq-img-scan-919647e60d0a" container name. Version-Release number of selected component (if applicable): 5.5.3.4.20160407153134_b3e2a83 5.6.0.5-beta2.4-nightly.20160505105820_07f43f8 How reproducible: Steps to Reproduce: 1. run introspection in 5.6 let it fail 2. run introspection in 5.5 it will fail because the container name is the same 3. Actual results: introspection fails because "pod creation for management-infra/manageiq-img-scan-919647e60d0a failed: HTTP status code 409, pods "manageiq-img-scan-919647e60d0a" already exists " Expected results: pod name should be random and introspection work Additional info:
Mooli I think a quick fix for this is to append a random number/string (or maybe the initial part of the task id?) at the end to avoid Pod names collision. I engineered the SSA container scan to support any name for the scanning Pod (it should be stored in the db with the relevant task). So it's just a convention to make it easy to identify the scanning Pods and the image it's scanning. ldomb the workaround for now is to manually delete the Pod: # oc delete pod -n management-infra manageiq-img-scan-919647e60d0a
Suggested a fix upstream: https://github.com/ManageIQ/manageiq/pull/8747
Federico I think it makes sense to name pods like we do today. The reason is that if there is one job currently scanning an image this naming causes it to fail.
Dafna, Tony, Einat: Do you have any update on what were the original exceptions on you appliances that caused the job to fail for the first time? It is important to understand if there is a possible issue we are missing.
(In reply to Mooli Tayer from comment #6) > Dafna, Tony, Einat: > Do you have any update on what were the original exceptions on you > appliances that caused the job to fail for the first time? > > It is important to understand if there is a possible issue we are missing. Apparently this is the problem I ran into when I tried reproducing the bug on the fresh appliance. I made sure to delete all pods before executing the scan, yet both the single image scan and the scheduled scan failed. Still, at least the initial scan should succeed, right?
Right, Any comments to why it initially failed? Note it might be a timeout that you should see with the usual grep you use. [relevant bit from mailing list] On the first time when image-inspector is scheduled on a node[1], sometimes there is a timeout for the job while the node is pulling the image, and so further scans fail due do that job's lingering pod. [1] usually this happens after we change image-inspector's version, today should provide an opportunity to test that
(In reply to Mooli Tayer from comment #8) > Right, Any comments to why it initially failed? > > Note it might be a timeout that you should see with the usual grep you use. > > [relevant bit from mailing list] > On the first time when image-inspector is scheduled on a node[1], > sometimes there is a timeout for the job while the node is pulling the > image, and so further scans fail due do that job's lingering pod. > > [1] usually this happens after we change image-inspector's version, > today should provide an opportunity to test that Working with Erez to figure out why it failed in the first place. I am not seeing a timeout so far, but I'll investigate further.
(In reply to Mooli Tayer from comment #5) > Federico I think it makes sense to name pods like we do today. > The reason is that if there is one job currently scanning an image this > naming causes it to fail. We are not building the solution to rely on coincidences and side-effects. You are also aware that the OpenShift environment are shared among multiple ManageIQ instances, and I don't want to keep spending time debugging failures happened because two people were scanning the same image at the same time, or because a crash (or kill) of ManageIQ leaves a scanning pod around.
> We are not building the solution to rely on coincidences and side-effects. I thought that was the intention behind the pod naming. Your solution makes sense. I guess there is no immediate danger in allowing multiple concurrent scans of the same image. https://github.com/ManageIQ/manageiq/pull/8819
The issue was that the manageiq-img-scan container was launched first on the 5.6 appliance. As the image was non existent it failed but did not clean up the manageiq-img-scan container. When I then started the scan from the 5.5 appliance the manageiq-img-scan container could not be created as it tried to create the manageiq-img-scan container with the EXACT same name as the 5.6 appliance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1348