Bug 1953977 - [4.8] packageserver pods restart many times on the SNO cluster
Summary: [4.8] packageserver pods restart many times on the SNO cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Nick Hale
QA Contact: Bruno Andrade
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-27 10:44 UTC by Jian Zhang
Modified: 2021-07-27 23:04 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:04:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
olm must-gather (1.66 MB, application/gzip)
2021-04-27 11:18 UTC, Jian Zhang
no flags Details
packageserver logs (1.71 MB, text/plain)
2021-04-28 10:14 UTC, Jian Zhang
no flags Details
all namespaces logs (10.09 MB, application/gzip)
2021-04-29 01:11 UTC, Jian Zhang
no flags Details
packageserver logs after manual delete (79.59 KB, text/plain)
2021-04-29 02:00 UTC, Jian Zhang
no flags Details
all must-gather logs (17.68 MB, application/gzip)
2021-05-06 08:21 UTC, Jian Zhang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 77 0 None open Bug 1953977: Add tolerant delegating auth config for PackageServer 2021-05-12 17:06:21 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:04:22 UTC

Description Jian Zhang 2021-04-27 10:44:32 UTC
Description of problem:
Packageserver pods restart many times on the SNO cluster. Met multi times crash when checking operator on Web console. 
[jzhang@dhcp-140-36 ~]$ oc get pods
NAME                              READY   STATUS    RESTARTS   AGE
catalog-operator-668df847-s2vmv   1/1     Running   0          26h
olm-operator-54895774c8-zrmr2     1/1     Running   0          26h
packageserver-7d54d47865-jtrxq    1/1     Running   10         26h
packageserver-7d54d47865-q5swj    1/1     Running   10         26h


Version-Release number of selected component (if applicable):
Cluster version is 4.8.0-0.nightly-2021-04-25-231500

How reproducible:
sometimes

Steps to Reproduce:
1. Install a SNO cluster, 
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-25-231500   True        False         114m    Cluster version is 4.8.0-0.nightly-2021-04-25-231500

[jzhang@dhcp-140-36 ~]$ oc get nodes
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-66-150.us-east-2.compute.internal   Ready    master,worker   23h   v1.21.0-rc.0+6143dea

2. do some operator subscription.
3. wait for the cluster to run for hours, and then check the packageserver pods.

Actual results:
The packageserver pods restart many times and encounter the crash of the OperatorHub page on the Web console.

Expected results:
The packageserver never restart, running stable.

Additional info:
1, Liveness and Readiness probe failure.
[jzhang@dhcp-140-36 ~]$ oc get event
LAST SEEN   TYPE      REASON       OBJECT                               MESSAGE
177m        Normal    Pulled       pod/packageserver-7d54d47865-jtrxq   Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f5aa58c57ad035675f4fcdc5e1526ce91bbad9816e0e2e98ade865d9c46fcbdc" already present on machine
178m        Warning   ProbeError   pod/packageserver-7d54d47865-jtrxq   Readiness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
178m        Warning   Unhealthy    pod/packageserver-7d54d47865-jtrxq   Readiness probe failed: HTTP probe failed with statuscode: 500
163m        Warning   ProbeError   pod/packageserver-7d54d47865-jtrxq   Liveness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
163m        Warning   Unhealthy    pod/packageserver-7d54d47865-jtrxq   Liveness probe failed: HTTP probe failed with statuscode: 500
163m        Warning   ProbeError   pod/packageserver-7d54d47865-q5swj   Liveness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
163m        Warning   Unhealthy    pod/packageserver-7d54d47865-q5swj   Liveness probe failed: HTTP probe failed with statuscode: 500
178m        Warning   ProbeError   pod/packageserver-7d54d47865-q5swj   Readiness probe error: HTTP probe failed with statuscode: 500
body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused
178m        Warning   Unhealthy    pod/packageserver-7d54d47865-q5swj   Readiness probe failed: HTTP probe failed with statuscode: 500
176m        Warning   ProbeError   pod/packageserver-7d54d47865-q5swj   Liveness probe error: Get "https://10.128.0.47:5443/healthz": context deadline exceeded
body:
162m        Warning   ComponentUnhealthy   clusterserviceversion/packageserver   APIServices not installed
162m        Normal    InstallSucceeded     clusterserviceversion/packageserver   install strategy completed with no errors

Comment 1 Jian Zhang 2021-04-27 11:18:49 UTC
Created attachment 1775913 [details]
olm must-gather

Comment 2 Nick Hale 2021-04-27 13:35:07 UTC
Jian,

To clarify, is this saying that visting the OperatorHub section of Console causes the packageserver to restart?

Comment 3 Jian Zhang 2021-04-28 10:14:24 UTC
Created attachment 1776494 [details]
packageserver logs

Comment 4 Jian Zhang 2021-04-28 10:16:09 UTC
Hi Nick,

I guess no, even if I don't do any operator, the packageserver still restart.
I have a fresh cluster running hours without any subscription operatoration.
mac:openshift-tests-private jianzhang$ oc get nodes
NAME                                        STATUS   ROLES           AGE     VERSION
ip-10-0-150-70.us-east-2.compute.internal   Ready    master,worker   3h29m   v1.21.0-rc.0+6143dea

mac:openshift-tests-private jianzhang$ oc get pods -n openshift-operator-lifecycle-manager 
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   0          3h29m
olm-operator-7995565994-dpxhk       1/1     Running   0          3h29m
packageserver-687d4b4b8d-8f8jz      1/1     Running   6          3h26m
packageserver-687d4b4b8d-jh7w2      1/1     Running   6          3h26m

mac:openshift-tests-private jianzhang$ oc get sub -A
No resources found

Attach the packageserver logs.
mac:openshift-tests-private jianzhang$ oc logs deployment/packageserver > log
Found 2 pods, using pod/packageserver-687d4b4b8d-8f8jz

Comment 5 Nick Hale 2021-04-28 14:42:41 UTC
Jian,

Thanks for the info, it helps.

From the pod YAML itself, it looks like there's a short window where the API server is refusing connections when the pod is first started:


```
    lastState:
      terminated:
        containerID: cri-o://25db25ce648d79524b0c64063296b6fd22539dd6338de4a35380b5185d27be8d
        exitCode: 1
        finishedAt: "2021-04-27T07:37:26Z"
        message: "_CHACHA20_POLY1305_SHA256, TLS_RSA_WITH_3DES_EDE_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA,
          TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384.
          \n                                                                Insecure
          values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA,
          TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256,
          TLS_RSA_WITH_RC4_128_SHA.\n      --tls-min-version string                                  Minimum
          TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12,
          VersionTLS13\n      --tls-private-key-file string                             File
          containing the default x509 private key matching --tls-cert-file.\n      --tls-sni-cert-key
          namedCertKey                           A pair of x509 certificate and private
          key file paths, optionally suffixed with a list of domain patterns which
          are fully qualified domain names, possibly with prefixed wildcard segments.
          The domain patterns also allow IP addresses, but IPs should only be used
          if the apiserver has visibility to the IP address requested by a client.
          If no domain patterns are provided, the names of the certificate are extracted.
          Non-wildcard matches trump over wildcard matches, explicit domain patterns
          trump over extracted names. For multiple key/certificate pairs, use the
          --tls-sni-cert-key multiple times. Examples: \"example.crt,example.key\"
          or \"foo.crt,foo.key:*.foo.com,foo.com\". (default [])\n  -v, --v Level
          \                                                number for the log level
          verbosity (default 0)\n      --vmodule moduleSpec                                      comma-separated
          list of pattern=N settings for file-filtered logging\n\ntime=\"2021-04-27T07:37:26Z\"
          level=fatal msg=\"unable to load configmap based request-header-client-ca-file:
          Get \\\"https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s\\\":
          dial tcp 172.30.0.1:443: connect: connection refused\"\n"
        reason: Error
        startedAt: "2021-04-27T07:37:25Z"
    name: packageserver
    ready: true
    restartCount: 10
    started: true
    state:
      running:
        startedAt: "2021-04-27T07:37:51Z"
  hostIP: 10.0.66.150
  phase: Running
  podIP: 10.128.0.47
  podIPs:
  - ip: 10.128.0.47
  qosClass: Burstable
  startTime: "2021-04-26T07:35:11Z"
```

The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds.

If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts?

Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console?

Comment 6 Nick Hale 2021-04-28 18:58:06 UTC
> The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds.

I misread minutes as seconds here. This should be "stable after the first 2 minutes", which is MUCH worse.

Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original).

Comment 7 Jian Zhang 2021-04-29 01:11:50 UTC
Created attachment 1776978 [details]
all namespaces logs

Comment 8 Jian Zhang 2021-04-29 01:58:21 UTC
Hi Nick,

> Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original).

Sorry for that, added it.

After running a day, we subscribed some operators in it, it restarted tens of times.

[jzhang@dhcp-140-36 0428]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   3          18h
olm-operator-7995565994-dpxhk       1/1     Running   3          18h
packageserver-687d4b4b8d-8f8jz      1/1     Running   24         18h
packageserver-687d4b4b8d-jh7w2      1/1     Running   25         18h

> If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts?

No restarts.
[jzhang@dhcp-140-36 0428]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   3          18h
olm-operator-7995565994-dpxhk       1/1     Running   3          18h
packageserver-687d4b4b8d-755wj      1/1     Running   0          21m
packageserver-687d4b4b8d-p9fkz      1/1     Running   0          21m

But after installing an operator, it restarts.
[jzhang@dhcp-140-36 0428]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6cc4c85fb7-lzxk5   1/1     Running   5          19h
olm-operator-7995565994-dpxhk       1/1     Running   5          19h
packageserver-687d4b4b8d-755wj      1/1     Running   2          37m
packageserver-687d4b4b8d-p9fkz      1/1     Running   2          37m

[jzhang@dhcp-140-36 0428]$ oc logs deployment/packageserver > packageserver-logs
Found 2 pods, using pod/packageserver-687d4b4b8d-p9fkz


> Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console?

A screenshot is here: https://user-images.githubusercontent.com/15416633/116493222-cfecd200-a8d0-11eb-8dfc-fa079b2434e2.png

Comment 9 Jian Zhang 2021-04-29 02:00:30 UTC
Created attachment 1776989 [details]
packageserver logs after manual delete

Comment 12 Jian Zhang 2021-05-06 08:21:36 UTC
Created attachment 1780155 [details]
all must-gather logs

Comment 19 Bruno Andrade 2021-05-15 04:08:48 UTC
OLM version: 0.17.0
git commit: a5f4f3bf64cd4fc101f276d9fd0c687c198c90e2
OCP Version: 4.8.0-0.nightly-2021-05-14-075932%

oc get nodes                                       
NAME                                        STATUS   ROLES           AGE   VERSION
ip-10-0-134-54.us-east-2.compute.internal   Ready    master,worker   13h   v1.21.0-rc.0+41625cd

After 13 hours of running and installing some operators, packageserver looks healthy:

oc get pods -n openshift-operator-lifecycle-manager                    
NAME                              READY   STATUS    RESTARTS   AGE
catalog-operator-85d769b5-n89pn   1/1     Running   0          13h
olm-operator-56b6cdc6f9-lftfk     1/1     Running   0          13h
packageserver-fbddc774-8fs2s      1/1     Running   0          13h
packageserver-fbddc774-pvjhr      1/1     Running   0          13h

LGTM, marking as verified

Comment 22 errata-xmlrpc 2021-07-27 23:04:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.