Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1953977

Summary: [4.8] packageserver pods restart many times on the SNO cluster
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: OLMAssignee: Nick Hale <nhale>
OLM sub component: OLM QA Contact: Bruno Andrade <bandrade>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: ecordell, kuiwang, nhale, tflannag
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:04:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
olm must-gather
none
packageserver logs
none
all namespaces logs
none
packageserver logs after manual delete
none
all must-gather logs none

Description Jian Zhang 2021-04-27 10:44:32 UTC
Description of problem:
Packageserver pods restart many times on the SNO cluster. Met multi times crash when checking operator on Web console. 
[jzhang@dhcp-140-36 ~]$ oc get pods
NAME                              READY   STATUS    RESTARTS   AGE
catalog-operator-668df847-s2vmv   1/1     Running   0          26h
olm-operator-54895774c8-zrmr2     1/1     Running   0          26h
packageserver-7d54d47865-jtrxq    1/1     Running   10         26h
packageserver-7d54d47865-q5swj    1/1     Running   10         26h


Version-Release number of selected component (if applicable):
Cluster version is 4.8.0-0.nightly-2021-04-25-231500

How reproducible:
sometimes

Steps to Reproduce:
1. Install a SNO cluster, 
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-25-231500   True        False         114m    Cluster version is 4.8.0-0.nightly-2021-04-25-231500

[jzhang@dhcp-140-36 ~]$ oc get nodes
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-66-150.us-east-2.compute.internal   Ready    master,worker   23h   v1.21.0-rc.0+6143dea

2. do some operator subscription.
3. wait for the cluster to run for hours, and then check the packageserver pods.

Actual results:
The packageserver pods restart many times and encounter the crash of the OperatorHub page on the Web console.

Expected results:
The packageserver never restart, running stable.

Additional info:
1, Liveness and Readiness probe failure.
[jzhang@dhcp-140-36 ~]$ oc get event
LAST SEEN   TYPE      REASON       OBJECT                               MESSAGE
177m        Normal    Pulled       pod/packageserver-7d54d47865-jtrxq   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f5aa58c57ad035675f4fcdc5e1526ce91bbad9816e0e2e98ade865d9c46fcbdc" already present on machine
178m        Warning   ProbeError   pod/packageserver-7d54d47865-jtrxq   Readiness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
178m        Warning   Unhealthy    pod/packageserver-7d54d47865-jtrxq   Readiness probe failed: HTTP probe failed with statuscode: 500
163m        Warning   ProbeError   pod/packageserver-7d54d47865-jtrxq   Liveness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
163m        Warning   Unhealthy    pod/packageserver-7d54d47865-jtrxq   Liveness probe failed: HTTP probe failed with statuscode: 500
163m        Warning   ProbeError   pod/packageserver-7d54d47865-q5swj   Liveness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
163m        Warning   Unhealthy    pod/packageserver-7d54d47865-q5swj   Liveness probe failed: HTTP probe failed with statuscode: 500
178m        Warning   ProbeError   pod/packageserver-7d54d47865-q5swj   Readiness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
178m        Warning   Unhealthy    pod/packageserver-7d54d47865-q5swj   Readiness probe failed: HTTP probe failed with statuscode: 500
176m        Warning   ProbeError   pod/packageserver-7d54d47865-q5swj   Liveness probe error: Get "https://10.128.0.47:5443/healthz": context deadline exceeded
body:
162m        Warning   ComponentUnhealthy   clusterserviceversion/packageserver   APIServices not installed
162m        Normal    InstallSucceeded     clusterserviceversion/packageserver   install strategy completed with no errors

Comment 1 Jian Zhang 2021-04-27 11:18:49 UTC
Created attachment 1775913 [details]
olm must-gather

Comment 2 Nick Hale 2021-04-27 13:35:07 UTC
Jian,

To clarify, is this saying that visting the OperatorHub section of Console causes the packageserver to restart?

Comment 3 Jian Zhang 2021-04-28 10:14:24 UTC
Created attachment 1776494 [details]
packageserver logs

Comment 4 Jian Zhang 2021-04-28 10:16:09 UTC
Hi Nick,

I guess no, even if I don't do any operator, the packageserver still restart.
I have a fresh cluster running hours without any subscription operatoration.
mac:openshift-tests-private jianzhang$ oc get nodes
NAME                                        STATUS   ROLES           AGE     VERSION
ip-10-0-150-70.us-east-2.compute.internal   Ready    master,worker   3h29m   v1.21.0-rc.0+6143dea

mac:openshift-tests-private jianzhang$ oc get pods -n openshift-operator-lifecycle-manager 
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   0          3h29m
olm-operator-7995565994-dpxhk       1/1     Running   0          3h29m
packageserver-687d4b4b8d-8f8jz      1/1     Running   6          3h26m
packageserver-687d4b4b8d-jh7w2      1/1     Running   6          3h26m

mac:openshift-tests-private jianzhang$ oc get sub -A
No resources found

Attach the packageserver logs.
mac:openshift-tests-private jianzhang$ oc logs deployment/packageserver > log
Found 2 pods, using pod/packageserver-687d4b4b8d-8f8jz

Comment 5 Nick Hale 2021-04-28 14:42:41 UTC
Jian,

Thanks for the info, it helps.

From the pod YAML itself, it looks like there's a short window where the API server is refusing connections when the pod is first started:


```
    lastState:
      terminated:
        containerID: cri-o://25db25ce648d79524b0c64063296b6fd22539dd6338de4a35380b5185d27be8d
        exitCode: 1
        finishedAt: "2021-04-27T07:37:26Z"
        message: "_CHACHA20_POLY1305_SHA256, TLS_RSA_WITH_3DES_EDE_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA,
          TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384.
          \n                                                                Insecure
          values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA,
          TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256,
          TLS_RSA_WITH_RC4_128_SHA.\n      --tls-min-version string                                  Minimum
          TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12,
          VersionTLS13\n      --tls-private-key-file string                             File
          containing the default x509 private key matching --tls-cert-file.\n      --tls-sni-cert-key
          namedCertKey                           A pair of x509 certificate and private
          key file paths, optionally suffixed with a list of domain patterns which
          are fully qualified domain names, possibly with prefixed wildcard segments.
          The domain patterns also allow IP addresses, but IPs should only be used
          if the apiserver has visibility to the IP address requested by a client.
          If no domain patterns are provided, the names of the certificate are extracted.
          Non-wildcard matches trump over wildcard matches, explicit domain patterns
          trump over extracted names. For multiple key/certificate pairs, use the
          --tls-sni-cert-key multiple times. Examples: \"example.crt,example.key\"
          or \"foo.crt,foo.key:*.foo.com,foo.com\". (default [])\n  -v, --v Level
          \                                                number for the log level
          verbosity (default 0)\n      --vmodule moduleSpec                                      comma-separated
          list of pattern=N settings for file-filtered logging\n\ntime=\"2021-04-27T07:37:26Z\"
          level=fatal msg=\"unable to load configmap based request-header-client-ca-file:
          Get \\\"https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s\\\":
          dial tcp 172.30.0.1:443: connect: connection refused\"\n"
        reason: Error
        startedAt: "2021-04-27T07:37:25Z"
    name: packageserver
    ready: true
    restartCount: 10
    started: true
    state:
      running:
        startedAt: "2021-04-27T07:37:51Z"
  hostIP: 10.0.66.150
  phase: Running
  podIP: 10.128.0.47
  podIPs:
  - ip: 10.128.0.47
  qosClass: Burstable
  startTime: "2021-04-26T07:35:11Z"
```

The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds.

If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts?

Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console?

Comment 6 Nick Hale 2021-04-28 18:58:06 UTC
> The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds.

I misread minutes as seconds here. This should be "stable after the first 2 minutes", which is MUCH worse.

Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original).

Comment 7 Jian Zhang 2021-04-29 01:11:50 UTC
Created attachment 1776978 [details]
all namespaces logs

Comment 8 Jian Zhang 2021-04-29 01:58:21 UTC
Hi Nick,

> Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original).

Sorry for that, added it.

After running a day, we subscribed some operators in it, it restarted tens of times.

[jzhang@dhcp-140-36 0428]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   3          18h
olm-operator-7995565994-dpxhk       1/1     Running   3          18h
packageserver-687d4b4b8d-8f8jz      1/1     Running   24         18h
packageserver-687d4b4b8d-jh7w2      1/1     Running   25         18h

> If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts?

No restarts.
[jzhang@dhcp-140-36 0428]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   3          18h
olm-operator-7995565994-dpxhk       1/1     Running   3          18h
packageserver-687d4b4b8d-755wj      1/1     Running   0          21m
packageserver-687d4b4b8d-p9fkz      1/1     Running   0          21m

But after installing an operator, it restarts.
[jzhang@dhcp-140-36 0428]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   5          19h
olm-operator-7995565994-dpxhk       1/1     Running   5          19h
packageserver-687d4b4b8d-755wj      1/1     Running   2          37m
packageserver-687d4b4b8d-p9fkz      1/1     Running   2          37m

[jzhang@dhcp-140-36 0428]$ oc logs deployment/packageserver > packageserver-logs
Found 2 pods, using pod/packageserver-687d4b4b8d-p9fkz


> Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console?

A screenshot is here: https://user-images.githubusercontent.com/15416633/116493222-cfecd200-a8d0-11eb-8dfc-fa079b2434e2.png

Comment 9 Jian Zhang 2021-04-29 02:00:30 UTC
Created attachment 1776989 [details]
packageserver logs after manual delete

Comment 12 Jian Zhang 2021-05-06 08:21:36 UTC
Created attachment 1780155 [details]
all must-gather logs

Comment 19 Bruno Andrade 2021-05-15 04:08:48 UTC
OLM version: 0.17.0
git commit: a5f4f3bf64cd4fc101f276d9fd0c687c198c90e2
OCP Version: 4.8.0-0.nightly-2021-05-14-075932%

oc get nodes                                       
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-134-54.us-east-2.compute.internal   Ready    master,worker   13h   v1.21.0-rc.0+41625cd

After 13 hours of running and installing some operators, packageserver looks healthy:

oc get pods -n openshift-operator-lifecycle-manager                    
NAME                              READY   STATUS    RESTARTS   AGE
catalog-operator-85d769b5-n89pn   1/1     Running   0          13h
olm-operator-56b6cdc6f9-lftfk     1/1     Running   0          13h
packageserver-fbddc774-8fs2s      1/1     Running   0          13h
packageserver-fbddc774-pvjhr      1/1     Running   0          13h

LGTM, marking as verified

Comment 22 errata-xmlrpc 2021-07-27 23:04:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438