Description of problem: Packageserver pods restart many times on the SNO cluster. Met multi times crash when checking operator on Web console. [jzhang@dhcp-140-36 ~]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-668df847-s2vmv 1/1 Running 0 26h olm-operator-54895774c8-zrmr2 1/1 Running 0 26h packageserver-7d54d47865-jtrxq 1/1 Running 10 26h packageserver-7d54d47865-q5swj 1/1 Running 10 26h Version-Release number of selected component (if applicable): Cluster version is 4.8.0-0.nightly-2021-04-25-231500 How reproducible: sometimes Steps to Reproduce: 1. Install a SNO cluster, [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-25-231500 True False 114m Cluster version is 4.8.0-0.nightly-2021-04-25-231500 [jzhang@dhcp-140-36 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-66-150.us-east-2.compute.internal Ready master,worker 23h v1.21.0-rc.0+6143dea 2. do some operator subscription. 3. wait for the cluster to run for hours, and then check the packageserver pods. Actual results: The packageserver pods restart many times and encounter the crash of the OperatorHub page on the Web console. Expected results: The packageserver never restart, running stable. Additional info: 1, Liveness and Readiness probe failure. [jzhang@dhcp-140-36 ~]$ oc get event LAST SEEN TYPE REASON OBJECT MESSAGE 177m Normal Pulled pod/packageserver-7d54d47865-jtrxq Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f5aa58c57ad035675f4fcdc5e1526ce91bbad9816e0e2e98ade865d9c46fcbdc" already present on machine 178m Warning ProbeError pod/packageserver-7d54d47865-jtrxq Readiness probe error: HTTP probe failed with statuscode: 500 body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused 178m Warning Unhealthy pod/packageserver-7d54d47865-jtrxq Readiness probe failed: HTTP probe failed with statuscode: 500 163m Warning ProbeError pod/packageserver-7d54d47865-jtrxq Liveness probe error: HTTP probe failed with statuscode: 500 body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused 163m Warning Unhealthy pod/packageserver-7d54d47865-jtrxq Liveness probe failed: HTTP probe failed with statuscode: 500 163m Warning ProbeError pod/packageserver-7d54d47865-q5swj Liveness probe error: HTTP probe failed with statuscode: 500 body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused 163m Warning Unhealthy pod/packageserver-7d54d47865-q5swj Liveness probe failed: HTTP probe failed with statuscode: 500 178m Warning ProbeError pod/packageserver-7d54d47865-q5swj Readiness probe error: HTTP probe failed with statuscode: 500 body: Internal Server Error: "/healthz": Post "https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": dial tcp 172.30.0.1:443: connect: connection refused 178m Warning Unhealthy pod/packageserver-7d54d47865-q5swj Readiness probe failed: HTTP probe failed with statuscode: 500 176m Warning ProbeError pod/packageserver-7d54d47865-q5swj Liveness probe error: Get "https://10.128.0.47:5443/healthz": context deadline exceeded body: 162m Warning ComponentUnhealthy clusterserviceversion/packageserver APIServices not installed 162m Normal InstallSucceeded clusterserviceversion/packageserver install strategy completed with no errors
Created attachment 1775913 [details] olm must-gather
Jian, To clarify, is this saying that visting the OperatorHub section of Console causes the packageserver to restart?
Created attachment 1776494 [details] packageserver logs
Hi Nick, I guess no, even if I don't do any operator, the packageserver still restart. I have a fresh cluster running hours without any subscription operatoration. mac:openshift-tests-private jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-150-70.us-east-2.compute.internal Ready master,worker 3h29m v1.21.0-rc.0+6143dea mac:openshift-tests-private jianzhang$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 0 3h29m olm-operator-7995565994-dpxhk 1/1 Running 0 3h29m packageserver-687d4b4b8d-8f8jz 1/1 Running 6 3h26m packageserver-687d4b4b8d-jh7w2 1/1 Running 6 3h26m mac:openshift-tests-private jianzhang$ oc get sub -A No resources found Attach the packageserver logs. mac:openshift-tests-private jianzhang$ oc logs deployment/packageserver > log Found 2 pods, using pod/packageserver-687d4b4b8d-8f8jz
Jian, Thanks for the info, it helps. From the pod YAML itself, it looks like there's a short window where the API server is refusing connections when the pod is first started: ``` lastState: terminated: containerID: cri-o://25db25ce648d79524b0c64063296b6fd22539dd6338de4a35380b5185d27be8d exitCode: 1 finishedAt: "2021-04-27T07:37:26Z" message: "_CHACHA20_POLY1305_SHA256, TLS_RSA_WITH_3DES_EDE_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384. \n Insecure values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_RC4_128_SHA.\n --tls-min-version string Minimum TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13\n --tls-private-key-file string File containing the default x509 private key matching --tls-cert-file.\n --tls-sni-cert-key namedCertKey A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: \"example.crt,example.key\" or \"foo.crt,foo.key:*.foo.com,foo.com\". (default [])\n -v, --v Level \ number for the log level verbosity (default 0)\n --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging\n\ntime=\"2021-04-27T07:37:26Z\" level=fatal msg=\"unable to load configmap based request-header-client-ca-file: Get \\\"https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s\\\": dial tcp 172.30.0.1:443: connect: connection refused\"\n" reason: Error startedAt: "2021-04-27T07:37:25Z" name: packageserver ready: true restartCount: 10 started: true state: running: startedAt: "2021-04-27T07:37:51Z" hostIP: 10.0.66.150 phase: Running podIP: 10.128.0.47 podIPs: - ip: 10.128.0.47 qosClass: Burstable startTime: "2021-04-26T07:35:11Z" ``` The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds. If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts? Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console?
> The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds. I misread minutes as seconds here. This should be "stable after the first 2 minutes", which is MUCH worse. Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original).
Created attachment 1776978 [details] all namespaces logs
Hi Nick, > Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original). Sorry for that, added it. After running a day, we subscribed some operators in it, it restarted tens of times. [jzhang@dhcp-140-36 0428]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 3 18h olm-operator-7995565994-dpxhk 1/1 Running 3 18h packageserver-687d4b4b8d-8f8jz 1/1 Running 24 18h packageserver-687d4b4b8d-jh7w2 1/1 Running 25 18h > If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts? No restarts. [jzhang@dhcp-140-36 0428]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 3 18h olm-operator-7995565994-dpxhk 1/1 Running 3 18h packageserver-687d4b4b8d-755wj 1/1 Running 0 21m packageserver-687d4b4b8d-p9fkz 1/1 Running 0 21m But after installing an operator, it restarts. [jzhang@dhcp-140-36 0428]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 5 19h olm-operator-7995565994-dpxhk 1/1 Running 5 19h packageserver-687d4b4b8d-755wj 1/1 Running 2 37m packageserver-687d4b4b8d-p9fkz 1/1 Running 2 37m [jzhang@dhcp-140-36 0428]$ oc logs deployment/packageserver > packageserver-logs Found 2 pods, using pod/packageserver-687d4b4b8d-p9fkz > Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console? A screenshot is here: https://user-images.githubusercontent.com/15416633/116493222-cfecd200-a8d0-11eb-8dfc-fa079b2434e2.png
Created attachment 1776989 [details] packageserver logs after manual delete
Created attachment 1780155 [details] all must-gather logs
OLM version: 0.17.0 git commit: a5f4f3bf64cd4fc101f276d9fd0c687c198c90e2 OCP Version: 4.8.0-0.nightly-2021-05-14-075932% oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-134-54.us-east-2.compute.internal Ready master,worker 13h v1.21.0-rc.0+41625cd After 13 hours of running and installing some operators, packageserver looks healthy: oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-85d769b5-n89pn 1/1 Running 0 13h olm-operator-56b6cdc6f9-lftfk 1/1 Running 0 13h packageserver-fbddc774-8fs2s 1/1 Running 0 13h packageserver-fbddc774-pvjhr 1/1 Running 0 13h LGTM, marking as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438