Description of problem: Installation using 4.2.0-0.nightly-2019-07-09-072333 is not finishing because operator-lifecycle-manager-packageserver failed to initialize because there is no packageserver serviceaccount created. $ oc get co operator-lifecycle-manager-packageserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE operator-lifecycle-manager-packageserver False True False 29m $ oc get all -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE pod/catalog-operator-56df78d788-p9mps 1/1 Running 0 36m pod/olm-operator-7f5d5c6b69-hzl9r 1/1 Running 0 36m pod/olm-operators-nhzbn 0/1 CrashLoopBackOff 10 30m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/catalog-operator-metrics ClusterIP 172.30.210.23 <none> 8081/TCP 36m service/olm-operator-metrics ClusterIP 172.30.27.111 <none> 8081/TCP 36m service/olm-operators ClusterIP 172.30.227.185 <none> 50051/TCP 30m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/catalog-operator 1/1 1 1 36m deployment.apps/olm-operator 1/1 1 1 36m deployment.apps/packageserver 0/2 0 0 35m NAME DESIRED CURRENT READY AGE replicaset.apps/catalog-operator-56df78d788 1 1 1 36m replicaset.apps/olm-operator-7f5d5c6b69 1 1 1 36m replicaset.apps/packageserver-7494f7c665 2 0 0 35m $ oc describe replicaset.apps/packageserver-7494f7c665 -n openshift-operator-lifecycle-manager Name: packageserver-7494f7c665 Namespace: openshift-operator-lifecycle-manager Selector: app=packageserver,pod-template-hash=7494f7c665 Labels: app=packageserver pod-template-hash=7494f7c665 Annotations: deployment.kubernetes.io/desired-replicas: 2 deployment.kubernetes.io/max-replicas: 3 deployment.kubernetes.io/revision: 1 Controlled By: Deployment/packageserver Replicas: 0 current / 2 desired Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: app=packageserver pod-template-hash=7494f7c665 Service Account: packageserver Containers: packageserver: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3f452fc2d6e82c27668eb0e9a2e6654afa9a2b7c09247642d0bf630f85f3a2cd Port: 5443/TCP Host Port: 0/TCP Command: /bin/package-server -v=4 --secure-port 5443 --global-namespace openshift-marketplace Liveness: http-get https://:5443/healthz delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get https://:5443/healthz delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ ReplicaFailure True FailedCreate Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 35m (x14 over 36m) replicaset-controller Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found Warning FailedCreate 30m (x16 over 32m) replicaset-controller Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found Warning FailedCreate 29m (x15 over 30m) replicaset-controller Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found Warning FailedCreate 28m (x13 over 28m) replicaset-controller Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found Warning FailedCreate 27m (x13 over 27m) replicaset-controller Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found Warning FailedCreate 5m53s (x21 over 27m) replicaset-controller Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found Version-Release number of the following components: $ ./openshift-install version ./openshift-install v4.2.0-201907090219-dirty built from commit 1cb12b5c5f6429a64306b1d7ec3374aa036ff150 release image registry.svc.ci.openshift.org/ocp/release@sha256:c4b310c59d7abc4799622b565c3a71b2a503a0af28c101fce7ce6a98b1ea1c79 Version: 4.2.0-0.nightly-2019-07-09-072333 How reproducible: Steps to Reproduce: 1.Install cluster 2.Verify cluster operators status 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.2/129/artifacts/e2e-aws/pods/openshift-operator-lifecycle-manager_olm-operators-mcnvg_configmap-registry-server.log contains the relevant logs that triggered this issue: fatal error: unexpected signal during runtime execution [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0] runtime stack: runtime.throw(0x14e0e4a, 0x2a) /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:608 +0x72 runtime.sigpanic() /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/signal_unix.go:374 +0x2f2 goroutine 1 [syscall, locked to thread]: runtime.cgocall(0x115f0a0, 0xc00006de48, 0x1377b00) /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/cgocall.go:128 +0x5e fp=0xc00006de18 sp=0xc00006dde0 pc=0x40baae crypto/internal/boring._Cfunc__goboringcrypto_DLOPEN_OPENSSL(0x0) _cgo_gotypes.go:597 +0x4a fp=0xc00006de48 sp=0xc00006de18 pc=0x60687a crypto/internal/boring.init.0() /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/crypto/internal/boring/boring.go:37 +0x47 fp=0xc00006de70 sp=0xc00006de48 pc=0x60c607 crypto/internal/boring.init() <autogenerated>:1 +0x12a fp=0xc00006dea0 sp=0xc00006de70 pc=0x617b8a crypto/ecdsa.init() <autogenerated>:1 +0x4b fp=0xc00006ded0 sp=0xc00006dea0 pc=0x63529b crypto/tls.init() <autogenerated>:1 +0x55 fp=0xc00006df10 sp=0xc00006ded0 pc=0x69d335 google.golang.org/grpc/credentials.init() <autogenerated>:1 +0x50 fp=0xc00006df50 sp=0xc00006df10 pc=0x6e0790 google.golang.org/grpc.init() <autogenerated>:1 +0x64 fp=0xc00006df88 sp=0xc00006df50 pc=0x83cb64 main.init() <autogenerated>:1 +0x5c fp=0xc00006df98 sp=0xc00006df88 pc=0x10b953c runtime.main() /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/proc.go:189 +0x1bd fp=0xc00006dfe0 sp=0xc00006df98 pc=0x4350cd runtime.goexit() /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x4604d1 The catalog pod failed because the image being used is bad. These are from openshift-operator-lifecycle-manager/olm-operators What happened: 1. ART swapped our 1.10 base image to 1.11 a week ago 2. We started getting emails that our ART nightly for this image was failing (because the code was placed in the gopath, but the build scripts are configured to use modules if 1.11 is detected) 3. We updated the image to 1.11: https://github.com/operator-framework/operator-registry/commit/dfc65a7287e07ec91b0bb260c6546421c7db5ca7 4. We got this bug report. We now have https://github.com/operator-framework/operator-registry/pull/55 open to fix this issue. We also have a related PR that would've prevented this issue from blocking install: https://github.com/operator-framework/operator-lifecycle-manager/pull/942 This would not have prevented the failed image build, but it would've meant that the issue wouldn't block openshift installations.
One thing that I don't think I made quite clear is the reason that this error was manifesting as "serviceaccount not found". In our manifests, we include both the Subscription and the CSV for packageserver. When the subscription resolves, it creates the serviceaccount for the operator. The subscription never resolved because the catalog pod was failing, so no service account was created. An additional related PR is changing the ocp config to use golang-1.12: https://github.com/openshift/ocp-build-data/pull/163
Has this been seen since https://gitlab.cee.redhat.com/openshift-art/ocp-build-data/commit/12798ecfba6c3103f5e124a7230ae22b8cd959eb merged?
Tracked down the issue. We were statically linking our go binary, which worked fine in 1.10 but not in 1.11. The build still passed because they were emitted as linker warnings. This PR: https://github.com/operator-framework/operator-registry/pull/56 should fix the issue and get the images working again.
This build contains the fix: https://openshift-release.svc.ci.openshift.org/releasestream/4.2.0-0.ci/release/4.2.0-0.ci-2019-07-10-200538
*** Bug 1728324 has been marked as a duplicate of this bug. ***
*** Bug 1728639 has been marked as a duplicate of this bug. ***
Fixed on 4.2.0-0.nightly-2019-07-11-071248 $ oc get co | grep operator-lifecycle-manager-packageserver operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-07-11-071248 True False False 2m51s $ oc describe co operator-lifecycle-manager-packageserver Name: operator-lifecycle-manager-packageserver Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2019-07-11T10:42:46Z Generation: 1 Resource Version: 11390 Self Link: /apis/config.openshift.io/v1/clusteroperators/operator-lifecycle-manager-packageserver UID: 9f08f4f6-a3c8-11e9-b2f3-fa163e6c0a2b Spec: Status: Conditions: Last Transition Time: 2019-07-11T10:42:46Z Status: False Type: Degraded Last Transition Time: 2019-07-11T10:54:53Z Status: True Type: Available Last Transition Time: 2019-07-11T10:54:53Z Message: Deployed version 0.10.1 Status: False Type: Progressing Extension: <nil> Related Objects: Group: Name: openshift-operator-lifecycle-manager Resource: namespaces Group: operators.coreos.com Name: packageserver.v0.10.1 Namespace: openshift-operator-lifecycle-manager Resource: ClusterServiceVersion Versions: Name: operator Version: 4.2.0-0.nightly-2019-07-11-071248 Name: packageserver.v0.10.1 Version: 0.10.1 Events: <none> $ oc get all -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE pod/catalog-operator-895469bbf-4wlvd 1/1 Running 0 20m pod/olm-operator-85c49dd55c-27drr 1/1 Running 0 20m pod/olm-operators-6js48 1/1 Running 0 15m pod/packageserver-5bbbb9869d-t6gkp 1/1 Running 0 5m22s pod/packageserver-5bbbb9869d-zzjs6 1/1 Running 0 5m15s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/catalog-operator-metrics ClusterIP 172.30.175.111 <none> 8081/TCP 20m service/olm-operator-metrics ClusterIP 172.30.70.142 <none> 8081/TCP 20m service/olm-operators ClusterIP 172.30.119.46 <none> 50051/TCP 15m service/v1-packages-operators-coreos-com ClusterIP 172.30.177.152 <none> 443/TCP 5m24s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/catalog-operator 1/1 1 1 20m deployment.apps/olm-operator 1/1 1 1 20m deployment.apps/packageserver 2/2 2 2 12m NAME DESIRED CURRENT READY AGE replicaset.apps/catalog-operator-895469bbf 1 1 1 20m replicaset.apps/olm-operator-85c49dd55c 1 1 1 20m replicaset.apps/packageserver-5bbbb9869d 2 2 2 5m24s replicaset.apps/packageserver-6b44d5f7b4 0 0 0 12m replicaset.apps/packageserver-6d8f748f9c 0 0 0 5m25s replicaset.apps/packageserver-77c87749cf 0 0 0 12m
Verified on 4.2.0-0.nightly-2019-07-11-071248
We're seeing this at on a 4.1.11 cluster. Currently trying to upgrade to 4.1.13, but failing due to the exact above error. the cluster is following the stable-4.1 channel.
What would the correct process be to fix this and correctly update a 4.1.11 cluster?
The issue for 4.1 is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1743061 (it's a different issue, with a similar symptom) The issue described above is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922