1728223 – operator-lifecycle-manager-packageserver failed to get available due to missing serviceaccount

Bug 1728223 - operator-lifecycle-manager-packageserver failed to get available due to missing serviceaccount

Summary: operator-lifecycle-manager-packageserver failed to get available due to missi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Evan Cordell
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1728324 1728639 (view as bug list)
Depends On:
Blocks:	1726370
TreeView+	depends on / blocked

Reported:	2019-07-09 11:17 UTC by David Sanz
Modified:	2019-11-15 06:04 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:33:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	operator-framework operator-registry pull 55	'None'	closed	Bug 1728223: chore(image): change base image to go 1.12	2021-02-19 02:01:54 UTC
Github	operator-framework operator-registry pull 56	'None'	closed	Bug 1728223: fix(build): don't build static in downstream images	2021-02-19 02:01:54 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:33:36 UTC

Description David Sanz 2019-07-09 11:17:21 UTC

Description of problem:
Installation using 4.2.0-0.nightly-2019-07-09-072333 is not finishing because operator-lifecycle-manager-packageserver failed to initialize because there is no packageserver serviceaccount created.

$ oc get co operator-lifecycle-manager-packageserver
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
operator-lifecycle-manager-packageserver             False       True          False      29m

$ oc get all -n openshift-operator-lifecycle-manager
NAME                                    READY   STATUS             RESTARTS   AGE
pod/catalog-operator-56df78d788-p9mps   1/1     Running            0          36m
pod/olm-operator-7f5d5c6b69-hzl9r       1/1     Running            0          36m
pod/olm-operators-nhzbn                 0/1     CrashLoopBackOff   10         30m

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
service/catalog-operator-metrics   ClusterIP   172.30.210.23    <none>        8081/TCP    36m
service/olm-operator-metrics       ClusterIP   172.30.27.111    <none>        8081/TCP    36m
service/olm-operators              ClusterIP   172.30.227.185   <none>        50051/TCP   30m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/catalog-operator   1/1     1            1           36m
deployment.apps/olm-operator       1/1     1            1           36m
deployment.apps/packageserver      0/2     0            0           35m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/catalog-operator-56df78d788   1         1         1       36m
replicaset.apps/olm-operator-7f5d5c6b69       1         1         1       36m
replicaset.apps/packageserver-7494f7c665      2         0         0       35m


$ oc describe replicaset.apps/packageserver-7494f7c665 -n openshift-operator-lifecycle-manager
Name:           packageserver-7494f7c665
Namespace:      openshift-operator-lifecycle-manager
Selector:       app=packageserver,pod-template-hash=7494f7c665
Labels:         app=packageserver
                pod-template-hash=7494f7c665
Annotations:    deployment.kubernetes.io/desired-replicas: 2
                deployment.kubernetes.io/max-replicas: 3
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/packageserver
Replicas:       0 current / 2 desired
Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=packageserver
                    pod-template-hash=7494f7c665
  Service Account:  packageserver
  Containers:
   packageserver:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3f452fc2d6e82c27668eb0e9a2e6654afa9a2b7c09247642d0bf630f85f3a2cd
    Port:       5443/TCP
    Host Port:  0/TCP
    Command:
      /bin/package-server
      -v=4
      --secure-port
      5443
      --global-namespace
      openshift-marketplace
    Liveness:     http-get https://:5443/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:5443/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Conditions:
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
Events:
  Type     Reason        Age                   From                   Message
  ----     ------        ----                  ----                   -------
  Warning  FailedCreate  35m (x14 over 36m)    replicaset-controller  Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found
  Warning  FailedCreate  30m (x16 over 32m)    replicaset-controller  Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found
  Warning  FailedCreate  29m (x15 over 30m)    replicaset-controller  Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found
  Warning  FailedCreate  28m (x13 over 28m)    replicaset-controller  Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found
  Warning  FailedCreate  27m (x13 over 27m)    replicaset-controller  Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found
  Warning  FailedCreate  5m53s (x21 over 27m)  replicaset-controller  Error creating: pods "packageserver-7494f7c665-" is forbidden: error looking up service account openshift-operator-lifecycle-manager/packageserver: serviceaccount "packageserver" not found


Version-Release number of the following components:
$ ./openshift-install version
./openshift-install v4.2.0-201907090219-dirty
built from commit 1cb12b5c5f6429a64306b1d7ec3374aa036ff150
release image registry.svc.ci.openshift.org/ocp/release@sha256:c4b310c59d7abc4799622b565c3a71b2a503a0af28c101fce7ce6a98b1ea1c79

Version: 4.2.0-0.nightly-2019-07-09-072333

How reproducible:

Steps to Reproduce:
1.Install cluster
2.Verify cluster operators status
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Evan Cordell 2019-07-09 17:16:38 UTC

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.2/129/artifacts/e2e-aws/pods/openshift-operator-lifecycle-manager_olm-operators-mcnvg_configmap-registry-server.log contains the relevant logs that triggered this issue:

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]

runtime stack:
runtime.throw(0x14e0e4a, 0x2a)
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:608 +0x72
runtime.sigpanic()
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/signal_unix.go:374 +0x2f2

goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x115f0a0, 0xc00006de48, 0x1377b00)
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/cgocall.go:128 +0x5e fp=0xc00006de18 sp=0xc00006dde0 pc=0x40baae
crypto/internal/boring._Cfunc__goboringcrypto_DLOPEN_OPENSSL(0x0)
	_cgo_gotypes.go:597 +0x4a fp=0xc00006de48 sp=0xc00006de18 pc=0x60687a
crypto/internal/boring.init.0()
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/crypto/internal/boring/boring.go:37 +0x47 fp=0xc00006de70 sp=0xc00006de48 pc=0x60c607
crypto/internal/boring.init()
	<autogenerated>:1 +0x12a fp=0xc00006dea0 sp=0xc00006de70 pc=0x617b8a
crypto/ecdsa.init()
	<autogenerated>:1 +0x4b fp=0xc00006ded0 sp=0xc00006dea0 pc=0x63529b
crypto/tls.init()
	<autogenerated>:1 +0x55 fp=0xc00006df10 sp=0xc00006ded0 pc=0x69d335
google.golang.org/grpc/credentials.init()
	<autogenerated>:1 +0x50 fp=0xc00006df50 sp=0xc00006df10 pc=0x6e0790
google.golang.org/grpc.init()
	<autogenerated>:1 +0x64 fp=0xc00006df88 sp=0xc00006df50 pc=0x83cb64
main.init()
	<autogenerated>:1 +0x5c fp=0xc00006df98 sp=0xc00006df88 pc=0x10b953c
runtime.main()
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/proc.go:189 +0x1bd fp=0xc00006dfe0 sp=0xc00006df98 pc=0x4350cd
runtime.goexit()
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00006dfe8 sp=0xc00006dfe0 pc=0x4604d1


The catalog pod failed because the image being used is bad. These are from openshift-operator-lifecycle-manager/olm-operators

What happened:

1. ART swapped our 1.10 base image to 1.11 a week ago
2. We started getting emails that our ART nightly for this image was failing (because the code was placed in the gopath, but the build scripts are configured to use modules if 1.11 is detected)
3. We updated the image to 1.11: https://github.com/operator-framework/operator-registry/commit/dfc65a7287e07ec91b0bb260c6546421c7db5ca7
4. We got this bug report.

We now have https://github.com/operator-framework/operator-registry/pull/55 open to fix this issue.

We also have a related PR that would've prevented this issue from blocking install: https://github.com/operator-framework/operator-lifecycle-manager/pull/942 
This would not have prevented the failed image build, but it would've meant that the issue wouldn't block openshift installations.

Comment 2 Evan Cordell 2019-07-09 17:31:54 UTC

One thing that I don't think I made quite clear is the reason that this error was manifesting as "serviceaccount not found".

In our manifests, we include both the Subscription and the CSV for packageserver. When the subscription resolves, it creates the serviceaccount for the operator. The subscription never resolved because the catalog pod was failing, so no service account was created.

An additional related PR is changing the ocp config to use golang-1.12: https://github.com/openshift/ocp-build-data/pull/163

Comment 4 Evan Cordell 2019-07-10 13:36:50 UTC

Has this been seen since https://gitlab.cee.redhat.com/openshift-art/ocp-build-data/commit/12798ecfba6c3103f5e124a7230ae22b8cd959eb merged?

Comment 7 Evan Cordell 2019-07-10 17:49:06 UTC

Tracked down the issue. We were statically linking our go binary, which worked fine in 1.10 but not in 1.11. The build still passed because they were emitted as linker warnings. 

This PR: https://github.com/operator-framework/operator-registry/pull/56 should fix the issue and get the images working again.

Comment 8 Evan Cordell 2019-07-10 20:29:42 UTC

This build contains the fix: https://openshift-release.svc.ci.openshift.org/releasestream/4.2.0-0.ci/release/4.2.0-0.ci-2019-07-10-200538

Comment 9 Evan Cordell 2019-07-10 20:30:12 UTC

*** Bug 1728324 has been marked as a duplicate of this bug. ***

Comment 10 Jian Zhang 2019-07-11 01:34:02 UTC

*** Bug 1728639 has been marked as a duplicate of this bug. ***

Comment 11 David Sanz 2019-07-11 10:59:23 UTC

Fixed on 4.2.0-0.nightly-2019-07-11-071248

$ oc get co | grep operator-lifecycle-manager-packageserver
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-07-11-071248   True        False         False      2m51s
$ oc describe co operator-lifecycle-manager-packageserver
Name:         operator-lifecycle-manager-packageserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-07-11T10:42:46Z
  Generation:          1
  Resource Version:    11390
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/operator-lifecycle-manager-packageserver
  UID:                 9f08f4f6-a3c8-11e9-b2f3-fa163e6c0a2b
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-07-11T10:42:46Z
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-07-11T10:54:53Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-07-11T10:54:53Z
    Message:               Deployed version 0.10.1
    Status:                False
    Type:                  Progressing
  Extension:               <nil>
  Related Objects:
    Group:      
    Name:       openshift-operator-lifecycle-manager
    Resource:   namespaces
    Group:      operators.coreos.com
    Name:       packageserver.v0.10.1
    Namespace:  openshift-operator-lifecycle-manager
    Resource:   ClusterServiceVersion
  Versions:
    Name:     operator
    Version:  4.2.0-0.nightly-2019-07-11-071248
    Name:     packageserver.v0.10.1
    Version:  0.10.1
Events:       <none>
$ oc get all -n openshift-operator-lifecycle-manager
NAME                                   READY   STATUS    RESTARTS   AGE
pod/catalog-operator-895469bbf-4wlvd   1/1     Running   0          20m
pod/olm-operator-85c49dd55c-27drr      1/1     Running   0          20m
pod/olm-operators-6js48                1/1     Running   0          15m
pod/packageserver-5bbbb9869d-t6gkp     1/1     Running   0          5m22s
pod/packageserver-5bbbb9869d-zzjs6     1/1     Running   0          5m15s

NAME                                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
service/catalog-operator-metrics           ClusterIP   172.30.175.111   <none>        8081/TCP    20m
service/olm-operator-metrics               ClusterIP   172.30.70.142    <none>        8081/TCP    20m
service/olm-operators                      ClusterIP   172.30.119.46    <none>        50051/TCP   15m
service/v1-packages-operators-coreos-com   ClusterIP   172.30.177.152   <none>        443/TCP     5m24s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/catalog-operator   1/1     1            1           20m
deployment.apps/olm-operator       1/1     1            1           20m
deployment.apps/packageserver      2/2     2            2           12m

NAME                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/catalog-operator-895469bbf   1         1         1       20m
replicaset.apps/olm-operator-85c49dd55c      1         1         1       20m
replicaset.apps/packageserver-5bbbb9869d     2         2         2       5m24s
replicaset.apps/packageserver-6b44d5f7b4     0         0         0       12m
replicaset.apps/packageserver-6d8f748f9c     0         0         0       5m25s
replicaset.apps/packageserver-77c87749cf     0         0         0       12m

Comment 12 David Sanz 2019-07-11 12:08:25 UTC

Verified on 4.2.0-0.nightly-2019-07-11-071248

Comment 13 jnaess 2019-08-29 09:08:53 UTC

We're seeing this at on a 4.1.11 cluster. Currently trying to upgrade to 4.1.13, but failing due to the exact above error.

the cluster is following the stable-4.1 channel.

Comment 14 jnaess 2019-08-29 09:13:38 UTC

What would the correct process be to fix this and correctly update a 4.1.11 cluster?

Comment 15 Evan Cordell 2019-08-29 13:08:40 UTC

The issue for 4.1 is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1743061 (it's a different issue, with a similar symptom)

The issue described above is fixed.

Comment 16 errata-xmlrpc 2019-10-16 06:33:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.