When the namespace for a ci-operator launched job was deleted (launch-aws / e2e-aws job) the ci-operator process doesn't detect that. It's possible that the 404 errors from the apiserver during "wait for pod completion" are being ignored / eaten, or we miss the watch event for deletion and then wait on an empty watch for ever. Scenario,: started job https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/713 cluster was launched, and test pod is in "sleep 7200" oc logs -c test chat-bot-2019-06-24-170748.6302 2019/06/24 17:08:08 Resolved source https://github.com/openshift/cluster-version-operator to release-4.1@b67dd0d0, merging: #209 6ccfb36c @system:serviceaccount:ci:ci-chat-bot 2019/06/24 17:08:08 Resolved openshift/release:golang-1.10 to sha256:50381c1b8ad97d3fe715646fb4eb436ee420262f3e80bc0effe53ec77c5a48a8 2019/06/24 17:08:08 Resolved ocp/4.1:base to sha256:00a670f0ce28dfa5ddb5df74ca2b032e596c78ec8f8319cca48246b60727fc81 2019/06/24 17:08:08 Using namespace ci-ln-pczspwt 2019/06/24 17:08:08 Running [input:root], [input:base], [release-inputs], src, cluster-version-operator, [output:stable:cluster-version-operator], [images], [release:latest], launch-aws 2019/06/24 17:08:08 Creating namespace ci-ln-pczspwt 2019/06/24 17:08:08 Creating rolebinding for user system:serviceaccount:ci:ci-chat-bot in namespace ci-ln-pczspwt 2019/06/24 17:08:08 Setting a soft TTL of 1h0m0s for the namespace 2019/06/24 17:08:08 Setting a hard TTL of 12h0m0s for the namespace 2019/06/24 17:08:08 Setting up pipeline imagestream for the test 2019/06/24 17:08:08 Created secret pull-secret 2019/06/24 17:08:08 Created secret launch-aws-cluster-profile 2019/06/24 17:08:08 Tagging openshift/release:golang-1.10 into pipeline:root 2019/06/24 17:08:08 Tagging ocp/4.1:base into pipeline:base 2019/06/24 17:08:08 Tagged shared images from ocp/4.1:${component}, images will be pullable from registry.svc.ci.openshift.org/ci-ln-pczspwt/stable:${component} 2019/06/24 17:08:08 Building src 2019/06/24 17:08:32 Build src succeeded after 23s 2019/06/24 17:08:32 Building cluster-version-operator 2019/06/24 17:09:08 Build cluster-version-operator succeeded after 36s 2019/06/24 17:09:08 Tagging cluster-version-operator into stable 2019/06/24 17:09:08 Create release image registry.svc.ci.openshift.org/ci-ln-pczspwt/release:latest 2019/06/24 17:09:30 Copied 1.00Mi of artifacts from release-latest to /logs/artifacts/release-latest 2019/06/24 17:09:35 Executing template launch-aws 2019/06/24 17:09:35 Creating or restarting template instance 2019/06/24 17:09:35 Waiting for template instance to be ready 2019/06/24 17:09:37 Running pod launch-aws 2019/06/24 17:38:04 Container setup in pod launch-aws completed successfully ---- At around 17:50 we deleted the namespace. The namespace completely was removed, but ci-operator didn't detect that from within the waitForPodCompletion loop. At 18:51 I sent a sig term to the process, which triggers ---- {"component":"entrypoint","file":"prow/entrypoint/run.go:163","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","time":"2019-06-24T18:51:26Z"} 2019/06/24 18:51:26 error: Process interrupted with signal interrupt, exiting in 10s ... 2019/06/24 18:51:26 cleanup: Deleting release pod release-latest 2019/06/24 18:51:26 cleanup: Deleting template launch-aws 2019/06/24 18:51:26 error: Could not delete template instance: templateinstances.template.openshift.io "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot delete templateinstances.template.openshift.io in the namespace "ci-ln-pczspwt": no RBAC policy matched 2019/06/24 18:51:26 error: Could not delete release pod: pods "release-latest" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot delete pods in the namespace "ci-ln-pczspwt": no RBAC policy matched ERROR: logging before flag.Parse: E0624 18:51:26.365480 14 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15ab371b5286a351", GenerateName:"", Namespace:"ci-ln-pczspwt", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-ln-pczspwt", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-launch-aws for PR https://github.com/openshift/cluster-version-operator/pull/209 in namespace ci-ln-pczspwt from author system:serviceaccount:ci:ci-chat-bot", Source:v1.EventSource{Component:"ci-ln-pczspwt", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3c64cb92035751, ext:6197948969139, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3c64cb92035751, ext:6197948969139, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "ci-ln-pczspwt" not found' (will not retry!) I expected the namespace deletion to cause the job to fail within several minutes (doesn't have to be immediate).
We should just set up a watch on the namespace after we resolve it and bind to it, then wire that into the early-exit signal handler.
A different run returned a different output: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/716 2019/06/24 19:17:53 Running pod launch-aws 2019/06/24 19:50:26 Container setup in pod launch-aws completed successfully 2019/06/24 20:04:00 Container artifacts in pod launch-aws completed successfully 2019/06/24 20:04:00 Container teardown in pod launch-aws completed successfully 2019/06/24 20:04:00 Container test in pod launch-aws completed successfully 2019/06/24 20:04:00 Pod launch-aws succeeded after 46m7s 2019/06/24 20:04:00 error: unable to gather container logs: could not list pod: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot list pods in the namespace "ci-ln-s0j12v2": no RBAC policy matched 2019/06/24 20:04:00 error: unable to signal to artifacts container to terminate in pod launch-aws, triggering deletion: could not run remote command: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot create pods/exec in the namespace "ci-ln-s0j12v2": no RBAC policy matched 2019/06/24 20:04:00 error: unable to retrieve artifacts from pod launch-aws and the pod could not be deleted: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot delete pods in the namespace "ci-ln-s0j12v2": no RBAC policy matched 2019/06/24 20:04:00 error: unable to retrieve artifacts from pod launch-aws: could not read gzipped artifacts: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot create pods/exec in the namespace "ci-ln-s0j12v2": no RBAC policy matched ERROR: logging before flag.Parse: E0624 20:04:05.839322 15 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15ab3b1259e2ed48", GenerateName:"", Namespace:"ci-ln-s0j12v2", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-ln-s0j12v2", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-launch-aws for PR https://github.com/openshift/cluster-monitoring-operator/pull/375 in namespace ci-ln-s0j12v2 from author system:serviceaccount:ci:ci-chat-bot", Source:v1.EventSource{Component:"ci-ln-s0j12v2", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3c690d70ba1b48, ext:3055465056210, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3c690d70ba1b48, ext:3055465056210, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events ".15ab3b1259e2ed48" is forbidden: unable to create new content in namespace ci-ln-s0j12v2 because it is being terminated' (will not retry!) 2019/06/24 20:04:06 Ran for 50m56s error: could not run steps: step launch-aws failed: template pod "launch-aws" failed: could not create watcher for pod: unknown (get pods) which is the expected behavior (although we didn't handle errors as cleanly as I would have liked)
Similar "unable to create new content in namespace" error - not clear if it's the same issue as this bz https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/123 2019/06/26 09:30:51 Using namespace ci-op-w1xgkhb0 2019/06/26 09:30:51 Running [release-inputs], [release:latest], [release:initial], [images], e2e-aws-upgrade 2019/06/26 09:30:51 Creating namespace ci-op-w1xgkhb0 2019/06/26 09:30:51 Setting a soft TTL of 1h0m0s for the namespace 2019/06/26 09:30:51 Setting a hard TTL of 12h0m0s for the namespace 2019/06/26 09:30:51 Setting up pipeline imagestream for the test 2019/06/26 09:30:51 Created secret pull-secret 2019/06/26 09:30:51 Created secret e2e-aws-upgrade-cluster-profile 2019/06/26 09:30:51 Tagged shared images from ocp/4.1:${component}, images will be pullable from registry.svc.ci.openshift.org/ci-op-w1xgkhb0/stable:${component} 2019/06/26 09:30:55 Importing release image latest 2019/06/26 09:30:55 Importing release image initial ERROR: logging before flag.Parse: E0626 09:35:33.353589 14 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15abb5eede5853a8", GenerateName:"", Namespace:"ci-op-w1xgkhb0", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-op-w1xgkhb0", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2 in namespace ci-op-w1xgkhb0", Source:v1.EventSource{Component:"ci-op-w1xgkhb0", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3cecf954eea1a8, ext:283248516305, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3cecf954eea1a8, ext:283248516305, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events ".15abb5eede5853a8" is forbidden: unable to create new content in namespace ci-op-w1xgkhb0 because it is being terminated' (will not retry!) 2019/06/26 09:35:34 Ran for 4m44s error: could not run steps: some steps failed: * step [release:latest] failed: unable to find the 'cli' image in the provided release image: the pod ci-op-w1xgkhb0/release-images-latest-cli failed after 3m30s (failed containers: ): ContainerFailed one or more containers exited * step [release:initial] failed: unable to find the 'cli' image in the provided release image: the pod ci-op-w1xgkhb0/release-images-initial-cli failed after 3m37s (failed containers: ): ContainerFailed one or more containers exited
Yes. That's likely the same issue. This continues to happen in very rare cases. I see five hits in the last two weeks.
Making this low priority as it only affects clusters that get preempted by some abortion logic, and it is rare.
We track this in https://issues.redhat.com/browse/DPTP-451