Bug 1333924 - Leftover pods should not interfere with future scans on the same image
Summary: Leftover pods should not interfere with future scans on the same image
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: SmartState Analysis
Version: 5.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.6.0
Assignee: Mooli Tayer
QA Contact: Tony
URL:
Whiteboard: container:smartstate
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-06 16:46 UTC by ldomb
Modified: 2016-10-13 14:34 UTC (History)
11 users (show)

Fixed In Version: 5.6.0.8
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-29 16:00:09 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1348 0 normal SHIPPED_LIVE CFME 5.6.0 bug fixes and enhancement update 2016-06-29 18:50:04 UTC

Description ldomb 2016-05-06 16:46:06 UTC
Description of problem:

This is kind of weird. When starting an introspection task the container introspection name does seem to be hard set. I ran introspection from 4.0 and 4.1 beta and the introspection name was 3 times 

manageiq-img-scan-919647e60d0a

Also when failing with 4.1 on introspection and the container manageiq-img-scan-919647e60d0a was not cleaned up I was not able to start an introspection with 4.0 as it tried to create the same "manageiq-img-scan-919647e60d0a" container name. 

Version-Release number of selected component (if applicable):



5.5.3.4.20160407153134_b3e2a83
5.6.0.5-beta2.4-nightly.20160505105820_07f43f8


How reproducible:


Steps to Reproduce:
1. run introspection in 5.6 let it fail
2. run introspection in 5.5 it will fail because the container name is the same
3.

Actual results:
introspection fails because "pod creation for management-infra/manageiq-img-scan-919647e60d0a failed: HTTP status code 409, pods "manageiq-img-scan-919647e60d0a" already exists "

Expected results:
pod name should be random and introspection work

Additional info:

Comment 3 Federico Simoncelli 2016-05-11 23:07:26 UTC
Mooli I think a quick fix for this is to append a random number/string (or maybe the initial part of the task id?) at the end to avoid Pod names collision.

I engineered the SSA container scan to support any name for the scanning Pod (it should be stored in the db with the relevant task).
So it's just a convention to make it easy to identify the scanning Pods and the image it's scanning.

ldomb the workaround for now is to manually delete the Pod:

 # oc delete pod -n management-infra manageiq-img-scan-919647e60d0a

Comment 4 Mooli Tayer 2016-05-16 21:44:43 UTC
Suggested a fix upstream:
https://github.com/ManageIQ/manageiq/pull/8747

Comment 5 Mooli Tayer 2016-05-16 21:55:36 UTC
Federico I think it makes sense to name pods like we do today.
The reason is that if there is one job currently scanning an image this naming causes it to fail.

Comment 6 Mooli Tayer 2016-05-17 08:51:28 UTC
Dafna, Tony, Einat:
Do you have any update on what were the original exceptions on you appliances that caused the job to fail for the first time?

It is important to understand if there is a possible issue we are missing.

Comment 7 Tony 2016-05-19 11:47:11 UTC
(In reply to Mooli Tayer from comment #6)
> Dafna, Tony, Einat:
> Do you have any update on what were the original exceptions on you
> appliances that caused the job to fail for the first time?
> 
> It is important to understand if there is a possible issue we are missing.

Apparently this is the problem I ran into when I tried reproducing the bug on the fresh appliance. I made sure to delete all pods before executing the scan, yet both the single image scan and the scheduled scan failed. Still, at least the initial scan should succeed, right?

Comment 8 Mooli Tayer 2016-05-19 11:52:44 UTC
Right, Any comments to why it initially failed?

Note it might be a timeout that you should see with the usual grep you use.

[relevant bit from mailing list]
On the first time when image-inspector is scheduled on a node[1],
sometimes there is a timeout for the job while the node is pulling the
image, and so further scans fail due do that job's lingering pod.

[1] usually this happens after we change image-inspector's version,
today should provide an opportunity to test that

Comment 9 Tony 2016-05-19 12:08:47 UTC
(In reply to Mooli Tayer from comment #8)
> Right, Any comments to why it initially failed?
> 
> Note it might be a timeout that you should see with the usual grep you use.
> 
> [relevant bit from mailing list]
> On the first time when image-inspector is scheduled on a node[1],
> sometimes there is a timeout for the job while the node is pulling the
> image, and so further scans fail due do that job's lingering pod.
> 
> [1] usually this happens after we change image-inspector's version,
> today should provide an opportunity to test that

Working with Erez to figure out why it failed in the first place. I am not seeing a timeout so far, but I'll investigate further.

Comment 10 Federico Simoncelli 2016-05-19 12:57:18 UTC
(In reply to Mooli Tayer from comment #5)
> Federico I think it makes sense to name pods like we do today.
> The reason is that if there is one job currently scanning an image this
> naming causes it to fail.

We are not building the solution to rely on coincidences and side-effects.

You are also aware that the OpenShift environment are shared among multiple ManageIQ instances, and I don't want to keep spending time debugging failures happened because two people were scanning the same image at the same time, or because a crash (or kill) of ManageIQ leaves a scanning pod around.

Comment 11 Mooli Tayer 2016-05-19 15:20:07 UTC
> We are not building the solution to rely on coincidences and side-effects.

I thought that was the intention behind the pod naming.

Your solution makes sense. I guess there is no immediate danger in allowing multiple concurrent scans of the same image.

https://github.com/ManageIQ/manageiq/pull/8819

Comment 12 ldomb 2016-06-06 11:34:47 UTC
The issue was that the manageiq-img-scan container was launched first on the 5.6 appliance. As the image was non existent it failed but did not clean up the manageiq-img-scan container. When I then started the scan from the 5.5 appliance the manageiq-img-scan container could not be created as it tried to create the manageiq-img-scan container with the EXACT same name as the 5.6 appliance.

Comment 14 errata-xmlrpc 2016-06-29 16:00:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348


Note You need to log in before you can comment on or make changes to this bug.