Bug 1910158

Summary: Create a multi-arch resource monitor to auto-detect and clean-up leaked clusters
Product: OpenShift Container Platform Reporter: Jeremy Poulin <jpoulin>
Component: Multi-ArchAssignee: Basavaraju <bgirriam>
Multi-Arch sub component: IBM P / Z QA Contact: Deep Mistry <dmistry>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: low    
Priority: low CC: aos-bugs, clnperez, dslavens, mhamzy, rdossant, skuznets, wking
Version: 4.6Keywords: TestOnly
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1882785 Environment:
Last Closed: 2022-08-30 16:08:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1882785    
Bug Blocks:    

Description Jeremy Poulin 2020-12-22 21:45:38 UTC
See https://coreos.slack.com/archives/CBN38N3MW/p1608140054245400 for full discussion.

The basic idea is that boskos can drive a level based workflow.

Basically we need to add a state 'dirty' into which used leases go.
A level-driven controller (the level being a resource in the `dirty` state) runs periodically, scans for the leaked cluster, and attempts to clean it up. 
If it succeeds, it transitions the lease into 'free' state.
If it fails, the resource remains in 'dirty' state, and boskos will trigger a new clean up task on its new poll.

Comment 1 Dan Li 2021-01-04 18:53:46 UTC
As part of bug triage, I'm changing the status to "Assigned" as I see that the bug is currently assigned to Deep.

Comment 2 Dan Li 2021-01-11 16:59:22 UTC
Hi Deep, do you think this bug will be resolved before the end of this Sprint (January 16th)? If not, can we add "UpcomingSprint"?

Comment 3 Dan Li 2021-02-01 14:58:37 UTC
Hi Deep, do you know if this bug will be resolved before the end of this sprint (Feb. 6th)? If not, can we set the "Reviewed-In-Sprint" flag to "+"?

Comment 4 Deep Mistry 2021-02-01 15:24:49 UTC
At the moment we require more input from the testplatform team, this bug will not be resolved in this sprint.

Comment 5 Dan Li 2021-02-22 18:17:38 UTC
Hi Deep, do you think this bug will be resolved by the end of this sprint (Feb 27th)? If not, can we set "Reviewed-in-Sprint"?

Comment 6 Dan Li 2021-03-15 17:59:11 UTC
Hi Deep, do you think this bug will be resolved by the end of this sprint (Mar 20th)? If not, can we set "Reviewed-in-Sprint"?

Comment 7 Dan Li 2021-04-05 18:57:39 UTC
Hi Deep, do you think this bug will be resolved by the end of this sprint (Apr 10th)? If not, can we set "Reviewed-in-Sprint"?

Comment 8 Dan Li 2021-04-26 19:50:10 UTC
Hi Deep, do you think this bug will be resolved by the end of this sprint (May 1st)? If not, can we set "Reviewed-in-Sprint"?

Comment 9 Deep Mistry 2021-06-29 14:07:17 UTC
Some progress have been made after initial investigation. 

@steve kuz Can you provide any info as to how we can test the controller locally?

cc @mhamzy

Comment 10 Deep Mistry 2021-06-29 14:31:07 UTC
cc @skuznets

Comment 12 Deep Mistry 2021-07-28 12:35:05 UTC
More discussion on the progress https://coreos.slack.com/archives/CBN38N3MW/p1627399369095300

Comment 14 Dan Li 2021-09-20 18:25:06 UTC
Hi Deep, do you think this bug will be resolved before the end of the current sprint (Sep 24th)? If not, can we add "reviewed-in-sprint" flag?

Comment 15 Dan Li 2021-11-22 21:02:41 UTC
Hi Deep, do you think this bug will be resolved before the end of the current sprint (Nov 27th)? If not, can we set the "reviewed-in-sprint" flag?

Comment 16 Dan Li 2022-01-04 17:32:00 UTC
Hi Deep, do you think this bug will be resolved before the end of the current sprint (January 8th)? If not, can we set "reviewed-in-sprint"?

Comment 17 Dan Li 2022-01-06 13:17:31 UTC
Hi Deep, it was mentioned during backlog refinement that the assignee for this bug might change. Can we change the assignee to the correct personnel working on this bug?

Comment 18 Dan Li 2022-01-24 15:57:24 UTC
Hi Basava, do you think this bug would be resolved before the end of the current sprint (January 29th)? If not, can we set the "reviewed-in-Sprint" flag to indicate that we have looked at the bug?

Comment 19 Dan Li 2022-01-28 13:21:38 UTC
Adding reviewed-in-sprint, as it was mentioned during yesterday's sprint planning that Basava will continue to work on this bug.

Comment 20 Dan Li 2022-02-14 18:39:40 UTC
Hi Basava, do you think this bug would be resolved before the end of the current sprint (February 19th)? If not, can we set the "reviewed-in-Sprint" flag to indicate that we have looked at the bug and will continue to work on it?

Comment 21 Dan Li 2022-03-07 15:42:21 UTC
Basava indicated that he will continue to work on this in the next sprint. So setting the flag.

Comment 22 Dan Li 2022-03-28 17:31:52 UTC
Chatted with Basava - this bug will continue in the next sprint. Keeping the "reviewed-in-sprint+" label

Comment 23 Dan Li 2022-04-18 14:10:01 UTC
Hi Basava, do you know if this bug will be resolved before the end of the current sprint (April 23rd)? If not, can we set the "reviewed-in-sprint" flag?

Comment 24 Dan Li 2022-04-20 11:10:54 UTC
Chatted with Basava and found out that this is in QA testing. Marking the status as ON_QA

Comment 25 Dan Li 2022-05-11 12:15:06 UTC
Basava's latest results:

recently in testing its failed to delete the resources do to missing libvirt binaries

{"component":"janitor","error":"Post \"http://boskos.test-pods.svc.cluster.local./acquire?dest=cleaning\u0026owner=Janitor\u0026state=dirty\u0026type=libvirt-ppc64le-quota-slice\": dial tcp: lookup boskos.test-pods.svc.cluster.local.: no such host","file":"/go/src/app/cmd/janitor/janitor.go:137","func":"main.run","level":"info","msg":"no available resource libvirt-ppc64le-quota-slice","severity":"info","time":"2022-05-06T07:09:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:146","func":"main.run","level":"info","msg":"Acquired resources libvirt-ppc64le-0-2 of type libvirt-ppc64le-quota-slice","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:101","func":"main.janitorClean","level":"info","msg":"executing janitor: /root/libvirt-ppc64le-janitor.sh --slice=libvirt-ppc64le-0-2 --hours=0","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:146","func":"main.run","level":"info","msg":"Acquired resources libvirt-ppc64le-0-0 of type libvirt-ppc64le-quota-slice","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:101","func":"main.janitorClean","level":"info","msg":"executing janitor: /root/libvirt-ppc64le-janitor.sh --slice=libvirt-ppc64le-0-0 --hours=0","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:146","func":"main.run","level":"info","msg":"Acquired resources libvirt-ppc64le-1-0 of type libvirt-ppc64le-quota-slice","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:101","func":"main.janitorClean","level":"info","msg":"executing janitor: /root/libvirt-ppc64le-janitor.sh --slice=libvirt-ppc64le-1-0 --hours=0","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:146","func":"main.run","level":"info","msg":"Acquired resources libvirt-ppc64le-0-1 of type libvirt-ppc64le-quota-slice","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","file":"/go/src/app/cmd/janitor/janitor.go:101","func":"main.janitorClean","level":"info","msg":"executing janitor: /root/libvirt-ppc64le-janitor.sh --slice=libvirt-ppc64le-0-1 --hours=0","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","error":"resources not found","file":"/go/src/app/cmd/janitor/janitor.go:137","func":"main.run","level":"info","msg":"no available resource libvirt-ppc64le-quota-slice","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","error":"exit status 127","file":"/go/src/app/cmd/janitor/janitor.go:105","func":"main.janitorClean","level":"info","msg":"failed to clean up project libvirt-ppc64le-0-1, error info: libvirtcli command not found, installing it.\nlibvirtcli: error while loading shared libraries: libvirt-lxc.so.0: cannot open shared object file: No such file or directory\n","severity":"info","time":"2022-05-06T07:10:55Z"}
{"component":"janitor","error":"exit status 1","file":"/go/src/app/cmd/janitor/janitor.go:105","func":"main.janitorClean","level":"info","msg":"failed to clean up project libvirt-ppc64le-0-0, error info: libvirtcli command not found, installing it.\nmv: cannot stat './libvirtcli': No such file or directory\n","severity":"info","time":"2022-05-06T07:10:56Z"}
{"component":"janitor","error":"exit status 1","file":"/go/src/app/cmd/janitor/janitor.go:105","func":"main.janitorClean","level":"info","msg":"failed to clean up project libvirt-ppc64le-0-2, error info: libvirtcli command not found, installing it.\nmv: cannot stat './libvirtcli': No such file or directory\n","severity":"info","time":"2022-05-06T07:10:56Z"}
{"component":"janitor","error":"exit status 127","file":"/go/src/app/cmd/janitor/janitor.go:105","func":"main.janitorClean","level":"info","msg":"failed to clean up project libvirt-ppc64le-1-0, error info: libvirtcli command not found, installing it.\nlibvirtcli: error while loading shared libraries: libvirt-lxc.so.0: cannot open shared object file: No such file or directory\n","severity":"info","time":"2022-05-06T07:10:56Z"}
root@basavarg-boskos-testing:~/dev/test-infra/config/prow/cluster#
root@basavarg-boskos-testing:~/dev/test-infra/config/prow/cluster# 
 Resources are marked dirty:

root@basavarg-boskos-testing:~/dev/test-infra/config/prow/cluster# kubectl get resources -n test-pods
NAME                  TYPE                          STATE   OWNER   LAST-UPDATED
libvirt-ppc64le-0-0   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-0-1   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-0-2   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-0-3   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-1-0   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-1-1   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-1-2   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-2-0   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-2-1   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-2-2   libvirt-ppc64le-quota-slice   dirty           3s
libvirt-ppc64le-2-3   libvirt-ppc64le-quota-slice   dirty           3s
root@basavarg-boskos-testing:~/dev/test-infra/config/prow/cluster#
working on fixing shared library issue.

 

libvirt client currently in my repo:

https://github.com/Basavaraju-G/janitor

Comment 29 Douglas Slavens 2022-08-30 16:07:57 UTC
Talked to Deep and Florian and this bug was verified and can be closed.