Bug 2003795

Summary: When deleting storageSystem cluster-cleanup-job failed to create
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Shay Rozen <srozen>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED CURRENTRELEASE QA Contact: Neha Berry <nberry>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot, sostapov, srai, tnielsen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-01 14:33:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shay Rozen 2021-09-13 18:03:55 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After deleting storageSystem there are the following errors:
openshift-storage                      24s         Warning   FailedCreate                        job/cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.internal                Error creating: Pod "cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.--1-rlpxk" is invalid: [metadata.generateName: Invalid value: "cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.--1-": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), metadata.name: Invalid value: "cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.--1-rlpxk": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]

I believe the problem is due to name of pod cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.--1-rlpxk the -- after the period



Version of all relevant components (if applicable):
odf 4.9.132-ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
I don't know what the impact beside the error

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP 4.9 and ODF 4.9
2. Create storageSystem
3. Delete storageSystem
4. issue oc get events -n openshift-storage --sort-by='.metadata.creationTimestamp' | grep cleanup

Actual results:
All cluster-cleanup-job pods fail to start with this error:
2m53s       Warning   FailedCreate                      job/cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.internal                Error creating: Pod "cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.--1-r574v" is invalid: [metadata.generateName: Invalid value: "cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.--1-": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), metadata.name: Invalid value: "cluster-cleanup-job-ip-10-0-188-204.us-east-2.compute.--1-r574v": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]

The problem is probably due to the pod name which contain .--


Expected results:
Cluster-cleanup-job should start 

Additional info:

Comment 3 Mudit Agarwal 2021-11-09 13:40:18 UTC
Not a 4.9 blocker, moving it out. 

Nitin, PTAL once you have some BW

Comment 4 Nitin Goyal 2021-11-09 14:02:47 UTC
As far as I know, cleanup jobs are handled by rook, not ocs-operator will confirm with Jose and work/move accordingly.

Comment 5 Jose A. Rivera 2022-01-18 16:43:57 UTC
This is still not a real blocker, to pushing it out of ODF 4.10.0.

That said, I don't want to leave it hanging, so I'll try to poke at it again and hopefully update with some answers.

Comment 6 Jose A. Rivera 2022-05-31 14:28:25 UTC
Took a quick glance, and yes it looks like rook is the one generating these jobs. Moving accordingly.

Comment 7 Subham Rai 2022-05-31 15:24:38 UTC
I tested with regex generator and yes seems like the error is due to `--` after `.`

The format that rook uses to generate the name `cluster-cleanup-job-<node-name>` can you confirm if the name of the node is `ip-10-0-188-204.us-east-2.compute.--1-rlpxk`?

Comment 8 Travis Nielsen 2022-05-31 17:07:43 UTC
Subham, please look at the TruncateNodeNameForJob() method called here: https://github.com/rook/rook/blob/4ea8cc6224efb0e9c18ffb8a39a6955f32d79a60/pkg/operator/ceph/cluster/cleanup.go#L75

Comment 9 Subham Rai 2022-06-01 04:34:49 UTC
seems like we already have a fix for this but not present in 4.9 . We need to bp this to 4.9 https://github.com/rook/rook/pull/9312
Travis to confirm. Thanks

Comment 10 Travis Nielsen 2022-06-01 14:33:21 UTC
The fix Subham mentioned is included in 4.10 and newer, but not 4.9. Given how old and low priority this BZ is, I'm going to assume it's not critical to backport to 4.9 and we can close it as fixed in 4.10 and newer.