Bug 1953977
| Summary: | [4.8] packageserver pods restart many times on the SNO cluster | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jian Zhang <jiazha> | ||||||||||||
| Component: | OLM | Assignee: | Nick Hale <nhale> | ||||||||||||
| OLM sub component: | OLM | QA Contact: | Bruno Andrade <bandrade> | ||||||||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||||||||
| Severity: | medium | ||||||||||||||
| Priority: | medium | CC: | ecordell, kuiwang, nhale, tflannag | ||||||||||||
| Version: | 4.8 | ||||||||||||||
| Target Milestone: | --- | ||||||||||||||
| Target Release: | 4.8.0 | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2021-07-27 23:04:05 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Jian Zhang
2021-04-27 10:44:32 UTC
Created attachment 1775913 [details]
olm must-gather
Jian, To clarify, is this saying that visting the OperatorHub section of Console causes the packageserver to restart? Created attachment 1776494 [details]
packageserver logs
Hi Nick, I guess no, even if I don't do any operator, the packageserver still restart. I have a fresh cluster running hours without any subscription operatoration. mac:openshift-tests-private jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-150-70.us-east-2.compute.internal Ready master,worker 3h29m v1.21.0-rc.0+6143dea mac:openshift-tests-private jianzhang$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 0 3h29m olm-operator-7995565994-dpxhk 1/1 Running 0 3h29m packageserver-687d4b4b8d-8f8jz 1/1 Running 6 3h26m packageserver-687d4b4b8d-jh7w2 1/1 Running 6 3h26m mac:openshift-tests-private jianzhang$ oc get sub -A No resources found Attach the packageserver logs. mac:openshift-tests-private jianzhang$ oc logs deployment/packageserver > log Found 2 pods, using pod/packageserver-687d4b4b8d-8f8jz Jian,
Thanks for the info, it helps.
From the pod YAML itself, it looks like there's a short window where the API server is refusing connections when the pod is first started:
```
lastState:
terminated:
containerID: cri-o://25db25ce648d79524b0c64063296b6fd22539dd6338de4a35380b5185d27be8d
exitCode: 1
finishedAt: "2021-04-27T07:37:26Z"
message: "_CHACHA20_POLY1305_SHA256, TLS_RSA_WITH_3DES_EDE_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA,
TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384.
\n Insecure
values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA,
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256,
TLS_RSA_WITH_RC4_128_SHA.\n --tls-min-version string Minimum
TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12,
VersionTLS13\n --tls-private-key-file string File
containing the default x509 private key matching --tls-cert-file.\n --tls-sni-cert-key
namedCertKey A pair of x509 certificate and private
key file paths, optionally suffixed with a list of domain patterns which
are fully qualified domain names, possibly with prefixed wildcard segments.
The domain patterns also allow IP addresses, but IPs should only be used
if the apiserver has visibility to the IP address requested by a client.
If no domain patterns are provided, the names of the certificate are extracted.
Non-wildcard matches trump over wildcard matches, explicit domain patterns
trump over extracted names. For multiple key/certificate pairs, use the
--tls-sni-cert-key multiple times. Examples: \"example.crt,example.key\"
or \"foo.crt,foo.key:*.foo.com,foo.com\". (default [])\n -v, --v Level
\ number for the log level
verbosity (default 0)\n --vmodule moduleSpec comma-separated
list of pattern=N settings for file-filtered logging\n\ntime=\"2021-04-27T07:37:26Z\"
level=fatal msg=\"unable to load configmap based request-header-client-ca-file:
Get \\\"https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication?timeout=10s\\\":
dial tcp 172.30.0.1:443: connect: connection refused\"\n"
reason: Error
startedAt: "2021-04-27T07:37:25Z"
name: packageserver
ready: true
restartCount: 10
started: true
state:
running:
startedAt: "2021-04-27T07:37:51Z"
hostIP: 10.0.66.150
phase: Running
podIP: 10.128.0.47
podIPs:
- ip: 10.128.0.47
qosClass: Burstable
startTime: "2021-04-26T07:35:11Z"
```
The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds.
If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts?
Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console?
> The timestamps lead me to think that all of the restarts occur between the pod start time "07:35:11" and time that the latest running container started, "7:37:51" (state.running.startedAt). This means that the packageserver is completely stable after the first 2 seconds.
I misread minutes as seconds here. This should be "stable after the first 2 minutes", which is MUCH worse.
Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original).
Created attachment 1776978 [details]
all namespaces logs
Hi Nick, > Jian, could you post a full must-gather, including the k8s apiserver pods and logs (they aren't in the original). Sorry for that, added it. After running a day, we subscribed some operators in it, it restarted tens of times. [jzhang@dhcp-140-36 0428]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 3 18h olm-operator-7995565994-dpxhk 1/1 Running 3 18h packageserver-687d4b4b8d-8f8jz 1/1 Running 24 18h packageserver-687d4b4b8d-jh7w2 1/1 Running 25 18h > If you get the chance, can you manually delete the packageserver-* pods and see if you again see a non-zero number of restarts? No restarts. [jzhang@dhcp-140-36 0428]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 3 18h olm-operator-7995565994-dpxhk 1/1 Running 3 18h packageserver-687d4b4b8d-755wj 1/1 Running 0 21m packageserver-687d4b4b8d-p9fkz 1/1 Running 0 21m But after installing an operator, it restarts. [jzhang@dhcp-140-36 0428]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-6cc4c85fb7-lzxk5 1/1 Running 5 19h olm-operator-7995565994-dpxhk 1/1 Running 5 19h packageserver-687d4b4b8d-755wj 1/1 Running 2 37m packageserver-687d4b4b8d-p9fkz 1/1 Running 2 37m [jzhang@dhcp-140-36 0428]$ oc logs deployment/packageserver > packageserver-logs Found 2 pods, using pod/packageserver-687d4b4b8d-p9fkz > Can you also get a screenshot of the broken OperatorHub web console, and perhaps logs from the backing pod and browser console? A screenshot is here: https://user-images.githubusercontent.com/15416633/116493222-cfecd200-a8d0-11eb-8dfc-fa079b2434e2.png Created attachment 1776989 [details]
packageserver logs after manual delete
Created attachment 1780155 [details]
all must-gather logs
OLM version: 0.17.0 git commit: a5f4f3bf64cd4fc101f276d9fd0c687c198c90e2 OCP Version: 4.8.0-0.nightly-2021-05-14-075932% oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-134-54.us-east-2.compute.internal Ready master,worker 13h v1.21.0-rc.0+41625cd After 13 hours of running and installing some operators, packageserver looks healthy: oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-85d769b5-n89pn 1/1 Running 0 13h olm-operator-56b6cdc6f9-lftfk 1/1 Running 0 13h packageserver-fbddc774-8fs2s 1/1 Running 0 13h packageserver-fbddc774-pvjhr 1/1 Running 0 13h LGTM, marking as verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |