Bug 1723558 - Deleting the namespace under ci-operator prow job is not detected
Summary: Deleting the namespace under ci-operator prow job is not detected
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Infrastructure
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.4.0
Assignee: Steve Kuznetsov
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-24 19:43 UTC by Clayton Coleman
Modified: 2019-12-11 22:51 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1685225
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Clayton Coleman 2019-06-24 19:43:57 UTC
When the namespace for a ci-operator launched job was deleted (launch-aws / e2e-aws job) the ci-operator process doesn't detect that.  It's possible that the 404 errors from the apiserver during "wait for pod completion" are being ignored / eaten, or we miss the watch event for deletion and then wait on an empty watch for ever.

Scenario,:

started job https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/713

cluster was launched, and test pod is in "sleep 7200"

oc logs -c test chat-bot-2019-06-24-170748.6302
2019/06/24 17:08:08 Resolved source https://github.com/openshift/cluster-version-operator to release-4.1@b67dd0d0, merging: #209 6ccfb36c @system:serviceaccount:ci:ci-chat-bot
2019/06/24 17:08:08 Resolved openshift/release:golang-1.10 to sha256:50381c1b8ad97d3fe715646fb4eb436ee420262f3e80bc0effe53ec77c5a48a8
2019/06/24 17:08:08 Resolved ocp/4.1:base to sha256:00a670f0ce28dfa5ddb5df74ca2b032e596c78ec8f8319cca48246b60727fc81
2019/06/24 17:08:08 Using namespace ci-ln-pczspwt
2019/06/24 17:08:08 Running [input:root], [input:base], [release-inputs], src, cluster-version-operator, [output:stable:cluster-version-operator], [images], [release:latest], launch-aws
2019/06/24 17:08:08 Creating namespace ci-ln-pczspwt
2019/06/24 17:08:08 Creating rolebinding for user system:serviceaccount:ci:ci-chat-bot in namespace ci-ln-pczspwt
2019/06/24 17:08:08 Setting a soft TTL of 1h0m0s for the namespace
2019/06/24 17:08:08 Setting a hard TTL of 12h0m0s for the namespace
2019/06/24 17:08:08 Setting up pipeline imagestream for the test
2019/06/24 17:08:08 Created secret pull-secret
2019/06/24 17:08:08 Created secret launch-aws-cluster-profile
2019/06/24 17:08:08 Tagging openshift/release:golang-1.10 into pipeline:root
2019/06/24 17:08:08 Tagging ocp/4.1:base into pipeline:base
2019/06/24 17:08:08 Tagged shared images from ocp/4.1:${component}, images will be pullable from registry.svc.ci.openshift.org/ci-ln-pczspwt/stable:${component}
2019/06/24 17:08:08 Building src
2019/06/24 17:08:32 Build src succeeded after 23s
2019/06/24 17:08:32 Building cluster-version-operator
2019/06/24 17:09:08 Build cluster-version-operator succeeded after 36s
2019/06/24 17:09:08 Tagging cluster-version-operator into stable
2019/06/24 17:09:08 Create release image registry.svc.ci.openshift.org/ci-ln-pczspwt/release:latest
2019/06/24 17:09:30 Copied 1.00Mi of artifacts from release-latest to /logs/artifacts/release-latest
2019/06/24 17:09:35 Executing template launch-aws
2019/06/24 17:09:35 Creating or restarting template instance
2019/06/24 17:09:35 Waiting for template instance to be ready
2019/06/24 17:09:37 Running pod launch-aws
2019/06/24 17:38:04 Container setup in pod launch-aws completed successfully

----

At around 17:50 we deleted the namespace.  The namespace completely was removed, but ci-operator didn't detect that from within the waitForPodCompletion loop.

At 18:51 I sent a sig term to the process, which triggers

----

{"component":"entrypoint","file":"prow/entrypoint/run.go:163","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","time":"2019-06-24T18:51:26Z"}
2019/06/24 18:51:26 error: Process interrupted with signal interrupt, exiting in 10s ...
2019/06/24 18:51:26 cleanup: Deleting release pod release-latest
2019/06/24 18:51:26 cleanup: Deleting template launch-aws
2019/06/24 18:51:26 error: Could not delete template instance: templateinstances.template.openshift.io "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot delete templateinstances.template.openshift.io in the namespace "ci-ln-pczspwt": no RBAC policy matched
2019/06/24 18:51:26 error: Could not delete release pod: pods "release-latest" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot delete pods in the namespace "ci-ln-pczspwt": no RBAC policy matched
ERROR: logging before flag.Parse: E0624 18:51:26.365480      14 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15ab371b5286a351", GenerateName:"", Namespace:"ci-ln-pczspwt", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-ln-pczspwt", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-launch-aws for PR https://github.com/openshift/cluster-version-operator/pull/209 in namespace ci-ln-pczspwt from author system:serviceaccount:ci:ci-chat-bot", Source:v1.EventSource{Component:"ci-ln-pczspwt", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3c64cb92035751, ext:6197948969139, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3c64cb92035751, ext:6197948969139, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'namespaces "ci-ln-pczspwt" not found' (will not retry!)

I expected the namespace deletion to cause the job to fail within several minutes (doesn't have to be immediate).

Comment 1 Steve Kuznetsov 2019-06-24 21:21:38 UTC
We should just set up a watch on the namespace after we resolve it and bind to it, then wire that into the early-exit signal handler.

Comment 3 Clayton Coleman 2019-06-24 21:44:53 UTC
A different run returned a different output:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/716

2019/06/24 19:17:53 Running pod launch-aws
2019/06/24 19:50:26 Container setup in pod launch-aws completed successfully
2019/06/24 20:04:00 Container artifacts in pod launch-aws completed successfully
2019/06/24 20:04:00 Container teardown in pod launch-aws completed successfully
2019/06/24 20:04:00 Container test in pod launch-aws completed successfully
2019/06/24 20:04:00 Pod launch-aws succeeded after 46m7s
2019/06/24 20:04:00 error: unable to gather container logs: could not list pod: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot list pods in the namespace "ci-ln-s0j12v2": no RBAC policy matched
2019/06/24 20:04:00 error: unable to signal to artifacts container to terminate in pod launch-aws, triggering deletion: could not run remote command: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot create pods/exec in the namespace "ci-ln-s0j12v2": no RBAC policy matched
2019/06/24 20:04:00 error: unable to retrieve artifacts from pod launch-aws and the pod could not be deleted: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot delete pods in the namespace "ci-ln-s0j12v2": no RBAC policy matched
2019/06/24 20:04:00 error: unable to retrieve artifacts from pod launch-aws: could not read gzipped artifacts: pods "launch-aws" is forbidden: User "system:serviceaccount:ci:ci-operator" cannot create pods/exec in the namespace "ci-ln-s0j12v2": no RBAC policy matched
ERROR: logging before flag.Parse: E0624 20:04:05.839322      15 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15ab3b1259e2ed48", GenerateName:"", Namespace:"ci-ln-s0j12v2", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-ln-s0j12v2", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-launch-aws for PR https://github.com/openshift/cluster-monitoring-operator/pull/375 in namespace ci-ln-s0j12v2 from author system:serviceaccount:ci:ci-chat-bot", Source:v1.EventSource{Component:"ci-ln-s0j12v2", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3c690d70ba1b48, ext:3055465056210, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3c690d70ba1b48, ext:3055465056210, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events ".15ab3b1259e2ed48" is forbidden: unable to create new content in namespace ci-ln-s0j12v2 because it is being terminated' (will not retry!)
2019/06/24 20:04:06 Ran for 50m56s
error: could not run steps: step launch-aws failed: template pod "launch-aws" failed: could not create watcher for pod: unknown (get pods)

which is the expected behavior (although we didn't handle errors as cleanly as I would have liked)

Comment 4 Mark McLoughlin 2019-06-26 11:19:23 UTC
Similar "unable to create new content in namespace" error - not clear if it's the same issue as this bz

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/123

2019/06/26 09:30:51 Using namespace ci-op-w1xgkhb0
2019/06/26 09:30:51 Running [release-inputs], [release:latest], [release:initial], [images], e2e-aws-upgrade
2019/06/26 09:30:51 Creating namespace ci-op-w1xgkhb0
2019/06/26 09:30:51 Setting a soft TTL of 1h0m0s for the namespace
2019/06/26 09:30:51 Setting a hard TTL of 12h0m0s for the namespace
2019/06/26 09:30:51 Setting up pipeline imagestream for the test
2019/06/26 09:30:51 Created secret pull-secret
2019/06/26 09:30:51 Created secret e2e-aws-upgrade-cluster-profile
2019/06/26 09:30:51 Tagged shared images from ocp/4.1:${component}, images will be pullable from registry.svc.ci.openshift.org/ci-op-w1xgkhb0/stable:${component}
2019/06/26 09:30:55 Importing release image latest
2019/06/26 09:30:55 Importing release image initial
ERROR: logging before flag.Parse: E0626 09:35:33.353589      14 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:".15abb5eede5853a8", GenerateName:"", Namespace:"ci-op-w1xgkhb0", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"", Namespace:"ci-op-w1xgkhb0", Name:"", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"CiJobFailed", Message:"Running job release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2 in namespace ci-op-w1xgkhb0", Source:v1.EventSource{Component:"ci-op-w1xgkhb0", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf3cecf954eea1a8, ext:283248516305, loc:(*time.Location)(0x1e75ca0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf3cecf954eea1a8, ext:283248516305, loc:(*time.Location)(0x1e75ca0)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events ".15abb5eede5853a8" is forbidden: unable to create new content in namespace ci-op-w1xgkhb0 because it is being terminated' (will not retry!)
2019/06/26 09:35:34 Ran for 4m44s
error: could not run steps: some steps failed:
  * step [release:latest] failed: unable to find the 'cli' image in the provided release image: the pod ci-op-w1xgkhb0/release-images-latest-cli failed after 3m30s (failed containers: ): ContainerFailed one or more containers exited
  * step [release:initial] failed: unable to find the 'cli' image in the provided release image: the pod ci-op-w1xgkhb0/release-images-initial-cli failed after 3m37s (failed containers: ): ContainerFailed one or more containers exited

Comment 5 Steve Kuznetsov 2019-08-29 15:28:53 UTC
Yes. That's likely the same issue. This continues to happen in very rare cases. I see five hits in the last two weeks.

Comment 6 Steve Kuznetsov 2019-09-10 21:44:47 UTC
Making this low priority as it only affects clusters that get preempted by some abortion logic, and it is rare.


Note You need to log in before you can comment on or make changes to this bug.