Bug 2101357

Summary: catalog-operator fatal error: concurrent map writes
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: OLMAssignee: Per da Silva <pegoncal>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: ableisch, abraj, agreene, gakendre, npaez, pegoncal, sburhade
Version: 4.11   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: An internal issue caused the reconciler to update a resource's annotations without first making a copy of the resource. This caused a fatal concurrent map write error. Consequence: The reconciler process would get killed Fix: Ensured that all resource changes would be make to a copy of the underlying resource. Thus, not causing the concurrent map write error. Result: Reconciler no longer getting killed due to this error
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:50:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2117324    

Description Jian Zhang 2022-06-27 09:50:10 UTC
Description of problem:

        message: "her.go:105 +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
          +0x130\n\ngoroutine 5545 [select, 7 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc00240a780,
          0xc003321a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
          +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0xc002efea80?,
          0xc0009cc7a0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
          +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
          +0x30a\n\ngoroutine 5678 [select, 3 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc000b70480,
          0xc0035d4500)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
          +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc002999e90?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
          +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
          +0x30a\n\ngoroutine 5836 [select, 1 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc003b00180,
          0xc003ff8a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
          +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc003a1c8d0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
          +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
          +0x30a\n\ngoroutine 5905 [chan receive, 1 minutes]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel.func1()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:51
          +0x85\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:50
          +0x231\n"
        reason: Error
        startedAt: "2022-06-27T00:00:59Z"


Version-Release number of selected component (if applicable):
mac:~ jianzhang$ oc exec catalog-operator-66cb8fd8c5-j7vkx -- olm --version
OLM version: 0.19.0
git commit: 8c2bd46147a90d58e98de73d34fd79477769f11f
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         10h     Cluster version is 4.11.0-0.nightly-2022-06-25-081133

How reproducible:
always

Steps to Reproduce:
1. Install OCP 4.11
2. Check OLM pods


Actual results:

mac:~ jianzhang$ oc get pods
NAME                                      READY   STATUS      RESTARTS     AGE
catalog-operator-66cb8fd8c5-j7vkx         1/1     Running     2 (8h ago)   10h
collect-profiles-27605340-wgsvf           0/1     Completed   0            42m
collect-profiles-27605355-ffgxd           0/1     Completed   0            27m
collect-profiles-27605370-w7ds7           0/1     Completed   0            12m
olm-operator-6cfd444b8f-r5q4t             1/1     Running     0            10h
package-server-manager-66589d4bf8-csr7j   1/1     Running     0            10h
packageserver-59977db6cf-nkn5w            1/1     Running     0            10h
packageserver-59977db6cf-nxbnx            1/1     Running     0            10h


mac:~ jianzhang$ oc get pods catalog-operator-66cb8fd8c5-j7vkx  -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.26"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.26"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: nonroot-v2
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  creationTimestamp: "2022-06-26T23:12:45Z"
  generateName: catalog-operator-66cb8fd8c5-
  labels:
    app: catalog-operator
    pod-template-hash: 66cb8fd8c5
  name: catalog-operator-66cb8fd8c5-j7vkx
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: catalog-operator-66cb8fd8c5
    uid: bcf173be-97bc-4152-8cec-d45f820e167c
  resourceVersion: "67395"
  uid: bb34c37b-b22d-4412-bbc0-1e57a1b2bd3a
spec:
  containers:
  - args:
    - --namespace
    - openshift-marketplace
    - --configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:033bd57b54bb2e827b0ae95a03cb6f7ceeb65422e85de635185cfed88c17b2b4
    - --opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:033bd57b54bb2e827b0ae95a03cb6f7ceeb65422e85de635185cfed88c17b2b4
    - --util-image
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
    - --writeStatusName
    - operator-lifecycle-manager-catalog
    - --tls-cert
    - /srv-cert/tls.crt
    - --tls-key
    - /srv-cert/tls.key
    - --client-ca
    - /profile-collector-cert/tls.crt
    command:
    - /bin/catalog
    env:
    - name: RELEASE_VERSION
      value: 4.11.0-0.nightly-2022-06-25-081133
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8443
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: catalog-operator
    ports:
    - containerPort: 8443
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8443
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 10m
        memory: 80Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /srv-cert
      name: srv-cert
      readOnly: true
    - mountPath: /profile-collector-cert
      name: profile-collector-cert
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-7txv8
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-0-173-128.ap-southeast-1.compute.internal
  nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
  preemptionPolicy: PreemptLowerPriority
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    seLinuxOptions:
      level: s0:c20,c0
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: olm-operator-serviceaccount
  serviceAccountName: olm-operator-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: srv-cert
    secret:
      defaultMode: 420
      secretName: catalog-operator-serving-cert
  - name: profile-collector-cert
    secret:
      defaultMode: 420
      secretName: pprof-cert
  - name: kube-api-access-7txv8
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-06-26T23:16:59Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T01:19:09Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T01:19:09Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-06-26T23:16:59Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://f3fce12480556b5ab279236c290acdf15e1bc850426078730dcfd333ecda6795
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
    lastState:
      terminated:
        containerID: cri-o://5e3ce04f1cf538f679af008dcb62f9477131aae1772f527d6207fdc7bb247a55
        exitCode: 2
        finishedAt: "2022-06-27T01:19:08Z"
        message: "her.go:105 +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
          +0x130\n\ngoroutine 5545 [select, 7 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc00240a780,
          0xc003321a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
          +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0xc002efea80?,
          0xc0009cc7a0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
          +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
          +0x30a\n\ngoroutine 5678 [select, 3 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc000b70480,
          0xc0035d4500)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
          +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc002999e90?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
          +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
          +0x30a\n\ngoroutine 5836 [select, 1 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc003b00180,
          0xc003ff8a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
          +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc003a1c8d0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
          +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
          +0x30a\n\ngoroutine 5905 [chan receive, 1 minutes]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel.func1()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:51
          +0x85\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:50
          +0x231\n"
        reason: Error
        startedAt: "2022-06-27T00:00:59Z"
    name: catalog-operator
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2022-06-27T01:19:08Z"
  hostIP: 10.0.173.128
  phase: Running
  podIP: 10.130.0.26
  podIPs:
  - ip: 10.130.0.26
  qosClass: Burstable
  startTime: "2022-06-26T23:16:59Z"


Expected results:
catalog-operator works well.

Additional info:
Operators can be subscribed successfully.
mac:~ jianzhang$ oc get sub -A
NAMESPACE                    NAME                     PACKAGE                  SOURCE            CHANNEL
jian                         learn                    learn                    qe-app-registry   beta
openshift-logging            cluster-logging          cluster-logging          qe-app-registry   stable
openshift-operators-redhat   elasticsearch-operator   elasticsearch-operator   qe-app-registry   stable
mac:~ jianzhang$ 
mac:~ jianzhang$ oc get pods -n jian
NAME                                                              READY   STATUS      RESTARTS   AGE
552b4660850a7fe1e1f142091eb5e4305f18af151727c56f70aa5dffc1dg8cg   0/1     Completed   0          54m
learn-operator-666b687bfb-7qppm                                   1/1     Running     0          54m
qe-app-registry-hbzxg                                             1/1     Running     0          58m
mac:~ jianzhang$ oc get csv -n jian
NAME                            DISPLAY                            VERSION   REPLACES                PHASE
elasticsearch-operator.v5.5.0   OpenShift Elasticsearch Operator   5.5.0                             Succeeded
learn-operator.v0.0.3           Learn Operator                     0.0.3     learn-operator.v0.0.2   Succeeded

Comment 3 Jian Zhang 2022-06-28 02:05:39 UTC
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-132614   True        False         145m    Cluster version is 4.11.0-0.nightly-2022-06-25-132614

mac:~ jianzhang$ oc get pods
NAME                                      READY   STATUS      RESTARTS        AGE
catalog-operator-869fb4bd4d-lbhgj         1/1     Running     3 (9m25s ago)   170m
collect-profiles-27606330-4wg5r           0/1     Completed   0               33m
collect-profiles-27606345-lmk4q           0/1     Completed   0               18m
collect-profiles-27606360-mksv6           0/1     Completed   0               3m17s
olm-operator-5f485d9d5f-wczjc             1/1     Running     0               170m
package-server-manager-6cf996b4cc-79lrw   1/1     Running     2 (156m ago)    170m
packageserver-5f668f98d7-2vjdn            1/1     Running     0               165m
packageserver-5f668f98d7-mb2wc            1/1     Running     0               165m
mac:~ jianzhang$ oc get pods catalog-operator-869fb4bd4d-lbhgj -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.34"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.34"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: nonroot-v2
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  creationTimestamp: "2022-06-27T23:13:12Z"
  generateName: catalog-operator-869fb4bd4d-
  labels:
    app: catalog-operator
    pod-template-hash: 869fb4bd4d
  name: catalog-operator-869fb4bd4d-lbhgj
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: catalog-operator-869fb4bd4d
    uid: 3a1a8cd3-2151-4650-a96b-c1951d461c67
  resourceVersion: "75671"
  uid: 2f06d663-0697-4e88-9e9a-f802c6641efe
spec:
  containers:
  - args:
    - --namespace
    - openshift-marketplace
    - --configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
    - --opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
    - --util-image
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    - --writeStatusName
    - operator-lifecycle-manager-catalog
    - --tls-cert
    - /srv-cert/tls.crt
    - --tls-key
    - /srv-cert/tls.key
    - --client-ca
    - /profile-collector-cert/tls.crt
    command:
    - /bin/catalog
    env:
    - name: RELEASE_VERSION
      value: 4.11.0-0.nightly-2022-06-25-132614
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8443
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: catalog-operator
    ports:
    - containerPort: 8443
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8443
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 10m
        memory: 80Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /srv-cert
      name: srv-cert
      readOnly: true
    - mountPath: /profile-collector-cert
      name: profile-collector-cert
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-sdjns
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-0-190-130.ap-south-1.compute.internal
  nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
  preemptionPolicy: PreemptLowerPriority
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    seLinuxOptions:
      level: s0:c20,c0
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: olm-operator-serviceaccount
  serviceAccountName: olm-operator-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: srv-cert
    secret:
      defaultMode: 420
      secretName: catalog-operator-serving-cert
  - name: profile-collector-cert
    secret:
      defaultMode: 420
      secretName: pprof-cert
  - name: kube-api-access-sdjns
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-06-28T01:53:53Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-06-28T01:53:53Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://2a83d3ba503aa760d27aeef28bf03af84b9ff134ea76c6545251f3253507c22b
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    lastState:
      terminated:
        containerID: cri-o://2766bb3bad38c02f531915aac0321ffb18c7a993a7d84b27e8c7199935f7c858
        exitCode: 2
        finishedAt: "2022-06-28T01:53:52Z"
        message: "izer/streaming/streaming.go:77 +0xa7\nk8s.io/client-go/rest/watch.(*Decoder).Decode(0xc002b12500)\n\t/build/vendor/k8s.io/client-go/rest/watch/decoder.go:49
          +0x4f\nk8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0036178c0)\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105
          +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
          +0x130\n\ngoroutine 3479 [sync.Cond.Wait]:\nsync.runtime_notifyListWait(0xc003a987c8,
          0x8)\n\t/usr/lib/golang/src/runtime/sema.go:513 +0x13d\nsync.(*Cond).Wait(0xc001f9bf10?)\n\t/usr/lib/golang/src/sync/cond.go:56
          +0x8c\ngolang.org/x/net/http2.(*pipe).Read(0xc003a987b0, {0xc0036e4001,
          0xdff, 0xdff})\n\t/build/vendor/golang.org/x/net/http2/pipe.go:76 +0xeb\ngolang.org/x/net/http2.transportResponseBody.Read({0x0?},
          {0xc0036e4001?, 0x2?, 0x203511d?})\n\t/build/vendor/golang.org/x/net/http2/transport.go:2407
          +0x85\nencoding/json.(*Decoder).refill(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:165
          +0x17f\nencoding/json.(*Decoder).readValue(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:140
          +0xbb\nencoding/json.(*Decoder).Decode(0xc002fc0640, {0x1d377c0, 0xc003523098})\n\t/usr/lib/golang/src/encoding/json/stream.go:63
          +0x78\nk8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc003127770,
          {0xc0036dd500, 0x1000, 0x1500})\n\t/build/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152
          +0x19c\nk8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc003502aa0,
          0xc001f9bf10?, {0x2366870, 0xc0044dca80})\n\t/build/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77
          +0xa7\nk8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00059f700)\n\t/build/vendor/k8s.io/client-go/rest/watch/decoder.go:49
          +0x4f\nk8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0044dcd40)\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105
          +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
          +0x130\n"
        reason: Error
        startedAt: "2022-06-28T01:06:59Z"
    name: catalog-operator
    ready: true
    restartCount: 3
    started: true
    state:
      running:
        startedAt: "2022-06-28T01:53:53Z"
  hostIP: 10.0.190.130
  phase: Running
  podIP: 10.130.0.34
  podIPs:
  - ip: 10.130.0.34
  qosClass: Burstable
  startTime: "2022-06-27T23:16:47Z"

Comment 4 Jian Zhang 2022-06-28 02:09:23 UTC
mac:~ jianzhang$ oc get pods package-server-manager-6cf996b4cc-79lrw -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.13"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.13"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: nonroot-v2
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  creationTimestamp: "2022-06-27T23:13:10Z"
  generateName: package-server-manager-6cf996b4cc-
  labels:
    app: package-server-manager
    pod-template-hash: 6cf996b4cc
  name: package-server-manager-6cf996b4cc-79lrw
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: package-server-manager-6cf996b4cc
    uid: 191cb627-a1f3-452b-aed2-336de7871004
  resourceVersion: "19056"
  uid: 63cb7200-32d6-4b0d-bfb2-ba725cda7fbd
spec:
  containers:
  - args:
    - --name
    - $(PACKAGESERVER_NAME)
    - --namespace
    - $(PACKAGESERVER_NAMESPACE)
    command:
    - /bin/psm
    - start
    env:
    - name: PACKAGESERVER_NAME
      value: packageserver
    - name: PACKAGESERVER_IMAGE
      value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    - name: PACKAGESERVER_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: RELEASE_VERSION
      value: 4.11.0-0.nightly-2022-06-25-132614
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 30
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: package-server-manager
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 30
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-k9hh2
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-0-190-130.ap-south-1.compute.internal
  nodeSelector:
    kubernetes.io/os: linux
    node-role.kubern

Comment 5 Jian Zhang 2022-06-28 02:10:02 UTC
preemptionPolicy: PreemptLowerPriority
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    seLinuxOptions:
      level: s0:c20,c0
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: olm-operator-serviceaccount
  serviceAccountName: olm-operator-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: kube-api-access-k9hh2
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:27:28Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:27:28Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://2c56089fabbaa6d7067feb231750ad20422e299dc33dabc0cd19ceca5951ed3a
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    lastState:
      terminated:
        containerID: cri-o://76cc5448de346ec636070a71868432c8f3aa213e16fe211c8fe18a3fee112d23
        exitCode: 1
        finishedAt: "2022-06-27T23:26:36Z"
        message: "d/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\n1.6563723963629913e+09\tERROR\tFailed
          to get API Group-Resources\t{\"error\": \"Get \\\"https://172.30.0.1:443/api?timeout=32s\\\":
          dial tcp 172.30.0.1:443: connect: connection refused\"}\nsigs.k8s.io/controller-runtime/pkg/cluster.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:322\nmain.run\n\t/build/cmd/package-server-manager/main.go:67\ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\n1.6563723963631017e+09\tERROR\tsetup\tfailed
          to setup manager instance\t{\"error\": \"Get \\\"https://172.30.0.1:443/api?timeout=32s\\\":
          dial tcp 172.30.0.1:443: connect: connection refused\"}\ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\nError:
          Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
          connect: connection refused\nencountered an error while executing the binary:
          Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
          connect: connection refused\n"
        reason: Error
        startedAt: "2022-06-27T23:26:11Z"
    name: package-server-manager
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2022-06-27T23:26:54Z"
  hostIP: 10.0.190.130
  phase: Running
  podIP: 10.130.0.13
  podIPs:
  - ip: 10.130.0.13
  qosClass: Burstable
  startTime: "2022-06-27T23:16:47Z"

Comment 6 Jian Zhang 2022-06-29 08:43:51 UTC
Observed the error restarts:

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-28-160049   True        False         5h57m   Cluster version is 4.11.0-0.nightly-2022-06-28-160049

mac:~ jianzhang$ oc get pods
NAME                                     READY   STATUS      RESTARTS      AGE
catalog-operator-7b88dddfbc-rsfhz        1/1     Running     6 (26m ago)   5h51m
collect-profiles-27608160-6m7r6          0/1     Completed   0             37m
collect-profiles-27608175-94n56          0/1     Completed   0             22m
collect-profiles-27608190-nbzcf          0/1     Completed   0             7m55s
olm-operator-5977ffb855-lgfn8            1/1     Running     0             9h
package-server-manager-75db6dcfc-hql4v   1/1     Running     0             9h
packageserver-5955fb79cd-9n56n           1/1     Running     0             9h
packageserver-5955fb79cd-xf6f6           1/1     Running     0             9h

mac:~ jianzhang$ oc get pods catalog-operator-7b88dddfbc-rsfhz -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.121"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.130.0.121"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: nonroot-v2
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  creationTimestamp: "2022-06-29T02:46:23Z"
  generateName: catalog-operator-7b88dddfbc-
  labels:
    app: catalog-operator
    pod-template-hash: 7b88dddfbc
  name: catalog-operator-7b88dddfbc-rsfhz
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: catalog-operator-7b88dddfbc
    uid: 4c796f4d-b2f2-4e0d-b2de-6e8ab5f45ce2
  resourceVersion: "278732"
  uid: 10e8c6e9-8f1c-4484-b934-c5208016c426
spec:
  containers:
  - args:
    - --debug
    - --namespace
    - openshift-marketplace
    - --configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
    - --opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
    - --util-image
    - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    - --writeStatusName
    - operator-lifecycle-manager-catalog
    - --tls-cert
    - /srv-cert/tls.crt
    - --tls-key
    - /srv-cert/tls.key
    - --client-ca
    - /profile-collector-cert/tls.crt
    command:
    - /bin/catalog
    env:
    - name: RELEASE_VERSION
      value: 4.11.0-0.nightly-2022-06-28-160049
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8443
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: catalog-operator
    ports:
    - containerPort: 8443
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8443
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 10m
        memory: 80Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /srv-cert
      name: srv-cert
      readOnly: true
    - mountPath: /profile-collector-cert
      name: profile-collector-cert
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-cv642
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: olm-operator-serviceaccount-dockercfg-f26bf
  nodeName: ip-10-0-130-83.ap-south-1.compute.internal
  nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
  preemptionPolicy: PreemptLowerPriority
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    seLinuxOptions:
      level: s0:c20,c0
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: olm-operator-serviceaccount
  serviceAccountName: olm-operator-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: srv-cert
    secret:
      defaultMode: 420
      secretName: catalog-operator-serving-cert
  - name: profile-collector-cert
    secret:
      defaultMode: 420
      secretName: pprof-cert
  - name: kube-api-access-cv642
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T02:46:23Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T08:11:56Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T08:11:56Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T02:46:23Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://64a99805cca11e9af66452d739a91259566a35a9e580bfac81dd7205f9272e18
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    lastState:
      terminated:
        containerID: cri-o://938e2eb90465caf8ac99ff4405cfbb9a0f03b16598474f944cca44c3e70af9df
        exitCode: 2
        finishedAt: "2022-06-29T08:11:54Z"
        message: "ub.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128
          +0xc3 fp=0xc003c11180 sp=0xc003c11118 pc=0x1a36603\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc000a2f420,
          {0x237a238, 0xc0004ce4c0}, {0x236a448, 0xc004993ce0})\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/syncer.go:77
          +0x535 fp=0xc003c11648 sp=0xc003c11180 pc=0x1a4f535\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc0001e3ad0,
          {0x237a238, 0xc0004ce4c0}, 0xc000cee360)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287
          +0x57c fp=0xc003c11f70 sp=0xc003c11648 pc=0x1a3ca7c\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x10000c0008fd6e0?,
          {0x237a238, 0xc0004ce4c0}, 0xc0004837b8?)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231
          +0x45 fp=0xc003c11fb0 sp=0xc003c11f70 pc=0x1a3c4a5\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
          +0x32 fp=0xc003c11fe0 sp=0xc003c11fb0 pc=0x1a3c152\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571
          +0x1 fp=0xc003c11fe8 sp=0xc003c11fe0 pc=0x4719c1\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
          +0x557\n"
        reason: Error
        startedAt: "2022-06-29T07:56:16Z"
    name: catalog-operator
    ready: true
    restartCount: 6
    started: true
    state:
      running:
        startedAt: "2022-06-29T08:11:55Z"
  hostIP: 10.0.130.83
  phase: Running
  podIP: 10.130.0.121
  podIPs:
  - ip: 10.130.0.121
  qosClass: Burstable
  startTime: "2022-06-29T02:46:23Z"

Comment 7 Jian Zhang 2022-07-04 03:50:38 UTC
Please ignore comment 4, 5, they are nothing with this issue.

Comment 8 Jian Zhang 2022-07-04 06:57:24 UTC
Check the `previous` log.

mac:~ jianzhang$ oc logs catalog-operator-f8ddcb57b-j5rf2 --previous
time="2022-07-03T23:49:00Z" level=info msg="log level info"
...
...
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
fatal error: concurrent map writes
fatal error: concurrent map writes

goroutine 559 [running]:
runtime.throw({0x2048546?, 0x0?})
	/usr/lib/golang/src/runtime/panic.go:992 +0x71 fp=0xc001f9c508 sp=0xc001f9c4d8 pc=0x43e9f1
runtime.mapassign_faststr(0x1d09880, 0xc0031847b0, {0x2079799, 0x2e})
	/usr/lib/golang/src/runtime/map_faststr.go:295 +0x38b fp=0xc001f9c570 sp=0xc001f9c508 pc=0x419b4b
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.Pod(0xc001f4a900, {0x203dd0f, 0xf}, {0xc00132ccc0, 0x38}, {0xc003582d50, 0x13}, 0xc00452c1e0, 0xc0031847b0, 0x5, ...)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/reconciler.go:227 +0xbdf fp=0xc001f9cbb0 sp=0xc001f9c570 pc=0x1a475ff
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*grpcCatalogSourceDecorator).Pod(0xc001f9ce50, {0xc003582d50, 0x13})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:125 +0xf9 fp=0xc001f9cc30 sp=0xc001f9cbb0 pc=0x1a42c99
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).currentPodsWithCorrectImageAndSpec(0xc001f9ce68?, {0xc001f4a900}, {0xc003582d50, 0x13})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:190 +0x198 fp=0xc001f9ce48 sp=0xc001f9cc30 pc=0x1a437b8
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).CheckRegistryServer(0xc000bcbf80?, 0x493b77?)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:453 +0x4c fp=0xc001f9ce88 sp=0xc001f9ce48 pc=0x1a45fcc
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).healthy(0x38ca8453?, 0xc001f4a900)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:196 +0x7e fp=0xc001f9ced0 sp=0xc001f9ce88 pc=0x1a4ae1e
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).health(0x1bc37c0?, 0xc003e7e7e0, 0x8?)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:159 +0x2a fp=0xc001f9cf10 sp=0xc001f9ced0 pc=0x1a4ac8a
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).catalogHealth(0xc000a59a90, {0xc003356a50, 0x11})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:137 +0x387 fp=0xc001f9d040 sp=0xc001f9cf10 pc=0x1a4a827
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).Reconcile(0xc000a59a90, {0x237a238, 0xc0003fe5c0}, {0x7f9f6e5b3900?, 0xc0050f64f0?})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:82 +0x21e fp=0xc001f9d118 sp=0xc001f9d040 pc=0x1a4a1be
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.ReconcilerChain.Reconcile({0xc0008e43c0?, 0x3, 0xc001f9d258?}, {0x237a238, 0xc0003fe5c0}, {0x7f9f6e5b3328?, 0xc0050f6490?})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128 +0xc3 fp=0xc001f9d180 sp=0xc001f9d118 pc=0x1a36603
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc0004dfd50, {0x237a238, 0xc0003fe5c0}, {0x236a448, 0xc003cbb760})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/syncer.go:77 +0x535 fp=0xc001f9d648 sp=0xc001f9d180 pc=0x1a4f535
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc00057a580, {0x237a238, 0xc0003fe5c0}, 0xc000954720)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287 +0x57c fp=0xc001f9df70 sp=0xc001f9d648 pc=0x1a3ca7c
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x0?, {0x237a238, 0xc0003fe5c0}, 0x0?)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231 +0x45 fp=0xc001f9dfb0 sp=0xc001f9df70 pc=0x1a3c4a5
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x32 fp=0xc001f9dfe0 sp=0xc001f9dfb0 pc=0x1a3c152
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc001f9dfe8 sp=0xc001f9dfe0 pc=0x4719c1
created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x557


Seems like it failed at: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227

Comment 9 Alexander Greene 2022-07-05 16:01:22 UTC
As Jian pointed out, the catalog operator is failing due to a concurrent write at https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227.


This is happening because:
- The grpcCatalogSourceDecorator includes a pointer to a catalogSource: https://github.com/operator-framework/operator-lifecycle-manager/blob/0b7970c3d609e1f483a0ba68f599a5ffd2e139b5/pkg/controller/registry/reconciler/grpc.go#L32-L35
- The grpcCatalogSourceDecorator's  Annotations function returns the catalogSource's annotations: https://github.com/operator-framework/operator-lifecycle-manager/blob/0b7970c3d609e1f483a0ba68f599a5ffd2e139b5/pkg/controller/registry/reconciler/grpc.go#L66
- The grpcCatalogSourceDecorator's Pod function uses the annotations returned from it's Annotations function: https://github.com/operator-framework/operator-lifecycle-manager/blob/0b7970c3d609e1f483a0ba68f599a5ffd2e139b5/pkg/controller/registry/reconciler/grpc.go#L125
- These catalogSource's annotations are updated within the Pod function here: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227

Line 227 in the reconciler.go directly mutates the catalogSource's annotations. The grpcCatalogSourceDecorator's Annotations function should be returning a copy of the annotations or it should be created with a deepcopy of the catalogSource to avoid mutating an object in the lister cache.

This doesn't seem to be a blocker, but we should get a fix in swiftly.

Comment 10 Jian Zhang 2022-07-13 05:02:04 UTC
1, Create a cluster with the fixed PR via the Cluster-bot.
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest   True        False         126m    Cluster version is 4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest

2, Subscribe some operators.
mac:~ jianzhang$ oc get sub -A
NAMESPACE                    NAME                     PACKAGE                  SOURCE                CHANNEL
default                      etcd                     etcd                     community-operators   singlenamespace-alpha
openshift-logging            cluster-logging          cluster-logging          redhat-operators      stable
openshift-operators-redhat   elasticsearch-operator   elasticsearch-operator   redhat-operators      stable

mac:~ jianzhang$ oc get sub -A
NAMESPACE                    NAME                     PACKAGE                  SOURCE                CHANNEL
default                      etcd                     etcd                     community-operators   singlenamespace-alpha
openshift-logging            cluster-logging          cluster-logging          redhat-operators      stable
openshift-operators-redhat   elasticsearch-operator   elasticsearch-operator   redhat-operators      stable
mac:~ jianzhang$ 
mac:~ jianzhang$ 
mac:~ jianzhang$ oc get csv -n openshift-operators-redhat
NAME                           DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.5.4.2   OpenShift Elasticsearch Operator   5.4.2                Succeeded
mac:~ jianzhang$ oc get csv -n openshift-logging 
NAME                           DISPLAY                            VERSION   REPLACES   PHASE
cluster-logging.5.4.2          Red Hat OpenShift Logging          5.4.2                Succeeded
elasticsearch-operator.5.4.2   OpenShift Elasticsearch Operator   5.4.2                Succeeded
mac:~ jianzhang$ oc get csv -n default
NAME                           DISPLAY                            VERSION   REPLACES              PHASE
elasticsearch-operator.5.4.2   OpenShift Elasticsearch Operator   5.4.2                           Succeeded
etcdoperator.v0.9.4            etcd                               0.9.4     etcdoperator.v0.9.2   Succeeded

3, Check OLM catalog-operator pods status.
mac:~ jianzhang$ oc get pods
NAME                                      READY   STATUS      RESTARTS   AGE
catalog-operator-546db7cdf5-7pldg         1/1     Running     0          145m
collect-profiles-27628110-lr2nv           0/1     Completed   0          30m
collect-profiles-27628125-br8b8           0/1     Completed   0          15m
collect-profiles-27628140-m64gp           0/1     Completed   0          38s
olm-operator-754d7f6f56-26qhw             1/1     Running     0          145m
package-server-manager-77d5cbf696-v9w4p   1/1     Running     0          145m
packageserver-6884994d98-2smtw            1/1     Running     0          143m
packageserver-6884994d98-5d7jg            1/1     Running     0          143m


mac:~ jianzhang$ oc logs catalog-operator-546db7cdf5-7pldg  --previous
Error from server (BadRequest): previous terminated container "catalog-operator" in pod "catalog-operator-546db7cdf5-7pldg" not found

No terminated container. catalog-operator works well. Verify it.

Comment 12 Jian Zhang 2022-07-18 07:23:08 UTC
Changed the status to VERIFIED based on comment 10.

Comment 14 Andreas Bleischwitz 2022-11-18 16:31:29 UTC
The issue seem to affect OpenShift 4.10.39 as well:

% oc get logs catalog-operator-6479d65774-znsnk...
time="2022-11-18T14:44:20Z" level=info msg="log level info"
time="2022-11-18T14:44:20Z" level=info msg="TLS keys set, using https for metrics"
time="2022-11-18T14:44:28Z" level=info msg="connection established. cluster-version: v1.23.12+7566c4d"
...
time="2022-11-18T16:02:39Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
fatal error: concurrent map writes

goroutine 746 [running]:
runtime.throw({0x1ee9827, 0xc00e2a85a8})
	/usr/lib/golang/src/runtime/panic.go:1198 +0x71 fp=0xc00e2a8550 sp=0xc00e2a8520 pc=0x43bcd1
runtime.mapassign_faststr(0x1be0e40, 0xc004351200, {0x1f17e43, 0x2e})
	/usr/lib/golang/src/runtime/map_faststr.go:211 +0x39c fp=0xc00e2a85b8 sp=0xc00e2a8550 pc=0x41949c
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.Pod(0xc009b1a480, {0x1edf878, 0xf}, {0xc0172e1e40, 0x1}, {0xc018a283a8, 0x13}, 0xc004351200, 0xc014ebc0f0, 0x5, ...)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/reconciler.go:227 +0xbbf fp=0xc00e2a8bf0 sp=0xc00e2a85b8 pc=0x195e1ff
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*grpcCatalogSourceDecorator).Pod(0xc00e2a8e90, {0xc018a283a8, 0x13})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:125 +0xf9 fp=0xc00e2a8c70 sp=0xc00e2a8bf0 pc=0x1959919
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).currentPodsWithCorrectImageAndSpec(0xc00e2a8ea8, {0xc009b1a480}, {0xc018a283a8, 0x13})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:190 +0x1b1 fp=0xc00e2a8e88 sp=0xc00e2a8c70 pc=0x195a491
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).CheckRegistryServer(0xc00119f980, 0xc0163d6780)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:453 +0x4c fp=0xc00e2a8ec8 sp=0xc00e2a8e88 pc=0x195cb2c
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).healthy(0x3719777f, 0xc009b1a480)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:196 +0x7e fp=0xc00e2a8f10 sp=0xc00e2a8ec8 pc=0x19619be
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).health(0x1ab0320, 0xc00e7890e0, 0x4)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:159 +0x2a fp=0xc00e2a8f50 sp=0xc00e2a8f10 pc=0x196182a
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).catalogHealth(0xc0005f5860, {0xc000467b18, 0xc001c6bad0})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:137 +0x372 fp=0xc00e2a9060 sp=0xc00e2a8f50 pc=0x19614f2
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).Reconcile(0xc0005f5860, {0x21b7b70, 0xc0009c33c0}, {0x7fb4bc391e88, 0xc001c6b940})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:82 +0x20e fp=0xc00e2a9128 sp=0xc00e2a9060 pc=0x1960ece
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.ReconcilerChain.Reconcile({0xc0005ade40, 0x3, 0xc001c6b900}, {0x21b7b70, 0xc0009c33c0}, {0x7fb4bc3918b0, 0xc001c6b900})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128 +0xc3 fp=0xc00e2a9190 sp=0xc00e2a9128 pc=0x194cfe3
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc0007ba620, {0x21b7b70, 0xc0009c33c0}, {0x21a1318, 0xc015a37de0})
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/syncer.go:77 +0x532 fp=0xc00e2a9660 sp=0xc00e2a9190 pc=0x1964992
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc000022160, {0x21b7b70, 0xc0009c33c0}, 0xc00121b140)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287 +0x895 fp=0xc00e2a9f70 sp=0xc00e2a9660 pc=0x1953915
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0xc0003c84e0, {0x21b7b70, 0xc0009c33c0}, 0x0)
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231 +0x45 fp=0xc00e2a9fb0 sp=0xc00e2a9f70 pc=0x1953025
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start·dwrap·7()
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x32 fp=0xc00e2a9fe0 sp=0xc00e2a9fb0 pc=0x1952cd2
runtime.goexit()
	/usr/lib/golang/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc00e2a9fe8 sp=0xc00e2a9fe0 pc=0x46f181
created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start
	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x55c

Clusterversion: 4.10.39

Are those crashes also blocking updates of already installed versions of operators? We saw that a installed operator wasn't marked as update-able even though there is a newer version in the packagemanifest.
Also are there any workarounds/mitigations to avoid those crashes?

Comment 25 errata-xmlrpc 2023-01-17 19:50:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Comment 26 Red Hat Bugzilla 2023-09-18 04:40:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days