Bug 1723558
| Summary: | Deleting the namespace under ci-operator prow job is not detected | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Test Infrastructure | Assignee: | Steve Kuznetsov <skuznets> |
| Status: | CLOSED UPSTREAM | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.5 | CC: | agarcial, aos-bugs, aos-cloud, calfonso, dgoodwin, evb, jhou, markmc |
| Target Milestone: | --- | ||
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1685225 | Environment: | |
| Last Closed: | 2020-04-22 14:38:39 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2019-06-24 19:43:57 UTC
We should just set up a watch on the namespace after we resolve it and bind to it, then wire that into the early-exit signal handler. A different run returned a different output: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/716 2019/06/24 19:17:53 Running pod launch-aws 2019/06/24 19:50:26 Container setup in pod launch-aws completed successfully 2019/06/24 20:04:00 Container artifacts in pod launch-aws completed successfully 2019/06/24 20:04:00 Container teardown in pod launch-aws completed successfully 2019/06/24 20:04:00 Container test in pod launch-aws completed successfully 2019/06/24 20:04:00 Pod launch-aws succeeded after 46m7s 2019/06/24 20:04:00 error: unable to gather container logs: could not list pod: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot list pods in the namespace "ci-ln-s0j12v2": no RBAC policy matched 2019/06/24 20:04:00 error: unable to signal to artifacts container to terminate in pod launch-aws, triggering deletion: could not run remote command: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot create pods/exec in the namespace "ci-ln-s0j12v2": no RBAC policy matched 2019/06/24 20:04:00 error: unable to retrieve artifacts from pod launch-aws and the pod could not be deleted: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot delete pods in the namespace "ci-ln-s0j12v2": no RBAC policy matched 2019/06/24 20:04:00 error: unable to retrieve artifacts from pod launch-aws: could not read gzipped artifacts: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot create pods/exec in the namespace "ci-ln-s0j12v2": no RBAC policy matched ERROR: logging before flag.Parse: E0624 20:04:05.839322 15 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15ab3b1259e2ed48", GenerateName:"", Namespace:"ci-ln-s0j12v2", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-ln-s0j12v2", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-launch-aws for PR https://github.com/openshift/cluster-monitoring-operator/pull/375 in namespace ci-ln-s0j12v2 from author system:serviceaccount:ci:ci-chat-bot", Source:v1.EventSource{Component:"ci-ln-s0j12v2", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3c690d70ba1b48, ext:3055465056210, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3c690d70ba1b48, ext:3055465056210, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events ".15ab3b1259e2ed48" is forbidden: unable to create new content in namespace ci-ln-s0j12v2 because it is being terminated' (will not retry!) 2019/06/24 20:04:06 Ran for 50m56s error: could not run steps: step launch-aws failed: template pod "launch-aws" failed: could not create watcher for pod: unknown (get pods) which is the expected behavior (although we didn't handle errors as cleanly as I would have liked) Similar "unable to create new content in namespace" error - not clear if it's the same issue as this bz https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/123 2019/06/26 09:30:51 Using namespace ci-op-w1xgkhb0 2019/06/26 09:30:51 Running [release-inputs], [release:latest], [release:initial], [images], e2e-aws-upgrade 2019/06/26 09:30:51 Creating namespace ci-op-w1xgkhb0 2019/06/26 09:30:51 Setting a soft TTL of 1h0m0s for the namespace 2019/06/26 09:30:51 Setting a hard TTL of 12h0m0s for the namespace 2019/06/26 09:30:51 Setting up pipeline imagestream for the test 2019/06/26 09:30:51 Created secret pull-secret 2019/06/26 09:30:51 Created secret e2e-aws-upgrade-cluster-profile 2019/06/26 09:30:51 Tagged shared images from ocp/4.1:${component}, images will be pullable from registry.svc.ci.openshift.org/ci-op-w1xgkhb0/stable:${component} 2019/06/26 09:30:55 Importing release image latest 2019/06/26 09:30:55 Importing release image initial ERROR: logging before flag.Parse: E0626 09:35:33.353589 14 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15abb5eede5853a8", GenerateName:"", Namespace:"ci-op-w1xgkhb0", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-op-w1xgkhb0", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2 in namespace ci-op-w1xgkhb0", Source:v1.EventSource{Component:"ci-op-w1xgkhb0", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3cecf954eea1a8, ext:283248516305, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3cecf954eea1a8, ext:283248516305, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events ".15abb5eede5853a8" is forbidden: unable to create new content in namespace ci-op-w1xgkhb0 because it is being terminated' (will not retry!) 2019/06/26 09:35:34 Ran for 4m44s error: could not run steps: some steps failed: * step [release:latest] failed: unable to find the 'cli' image in the provided release image: the pod ci-op-w1xgkhb0/release-images-latest-cli failed after 3m30s (failed containers: ): ContainerFailed one or more containers exited * step [release:initial] failed: unable to find the 'cli' image in the provided release image: the pod ci-op-w1xgkhb0/release-images-initial-cli failed after 3m37s (failed containers: ): ContainerFailed one or more containers exited Yes. That's likely the same issue. This continues to happen in very rare cases. I see five hits in the last two weeks. Making this low priority as it only affects clusters that get preempted by some abortion logic, and it is rare. We track this in https://issues.redhat.com/browse/DPTP-451 |