Bug 2001763
Summary: | [OCP on RHV] panic detected in openshift-oauth-apiserver pods | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ke Wang <kewang> |
Component: | Etcd | Assignee: | Nobody <nobody> |
Status: | CLOSED DEFERRED | QA Contact: | ge liu <geliu> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.9 | CC: | aos-bugs, llopezmo, mfojtik, nobody, pnarkhed, slaskawi, sttts, surbania, tjungblu |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-10-10 09:13:20 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ke Wang
2021-09-07 06:57:01 UTC
It's worth to mention that the API Server recovered from the panic. In case of `openshift-oauth-apiserver/pods/apiserver-d6f9b59c4-4pb5p/oauth-apiserver/oauth-apiserver/logs/current.log`, you can clearly see that it it keeps serving the requests. Nevertheless, all panics happen after the Etcd failure: 2021-09-03T10:38:21.610550225Z I0903 10:38:21.610498 1 healthz.go:244] etcd check failed: healthz 2021-09-03T10:38:21.610550225Z [-]etcd failed: error getting data from etcd: etcdserver: request timed out 2021-09-03T10:38:21.610927014Z E0903 10:38:21.610872 1 timeout.go:128] Header called after Handler finished 2021-09-03T10:38:21.610927014Z goroutine 1524727 [running]: 2021-09-03T10:38:21.610927014Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1(0xc001ad5aa0) ... Looking at this particular 2021-09-03T10:38:21 and Etcd health logs (namespaces/openshift-etcd/pods/etcd-ge2n1-xp525-master-0/etcd-health-monitor/etcd-health-monitor/logs/current.log), I can see that something was wrong: 2021-09-03T10:38:10.136161345Z {"level":"warn","ts":"2021-09-03T10:38:10.136Z","logger":"etcd-client","caller":"v3.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000641c00/#initially=[https://10.37.142.152:2379;https://10.37.142.161:2379;https://10.37.142.228:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} 2021-09-03T10:38:10.136186499Z {"level":"info","ts":"2021-09-03T10:38:10.136Z","caller":"health/log.go:29","msg":"health check","addresses":["https://10.37.142.228:2379"],"check":"QuorumReadSingleTarget","health":false,"start":"2021-09-03T10:38:07.134Z","error":"client timeout exceeded: 1s"} 2021-09-03T10:38:10.383664827Z {"level":"warn","ts":"2021-09-03T10:38:10.383Z","logger":"etcd-client","caller":"v3.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000641c00/#initially=[https://10.37.142.152:2379;https://10.37.142.161:2379;https://10.37.142.228:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} 2021-09-03T10:38:10.383706461Z {"level":"info","ts":"2021-09-03T10:38:10.383Z","caller":"health/log.go:29","msg":"health check","addresses":["https://10.37.142.228:2379"],"check":"SerializedReadSingleTarget","health":false,"start":"2021-09-03T10:38:07.381Z","error":"client timeout exceeded: 1s"} The Etcd logs also seems wrong (namespaces/openshift-etcd/pods/etcd-ge2n1-xp525-master-0/etcd/etcd/logs/current.log): 2021-09-03T10:38:32.492057632Z {"level":"warn","ts":"2021-09-03T10:38:32.483Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"11.297318ms","request":"header:<ID:9968853707752541488 > lease_revoke:<id:0a587ba1446d0bfd>","response":"size:29","error":"lease not found"} 2021-09-03T10:38:32.492355421Z {"level":"warn","ts":"2021-09-03T10:38:32.492Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"35.115µs","request":"header:<ID:9968853707752541493 > lease_revoke:<id:267b7ba1464878a2>","response":"size:29","error":"lease not found"} 2021-09-03T10:38:32.492534359Z {"level":"warn","ts":"2021-09-03T10:38:32.492Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"22.689µs","request":"header:<ID:9968853707752541498 > lease_revoke:<id:0a587ba1446d0b5e>","response":"size:29","error":"lease not found"} 2021-09-03T10:38:32.492668020Z {"level":"warn","ts":"2021-09-03T10:38:32.492Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"45.336µs","request":"header:<ID:9968853707752541500 > lease_revoke:<id:0a587ba1446d0cc4>","response":"size:29","error":"lease not found"} 2021-09-03T10:38:32.493464003Z {"level":"warn","ts":"2021-09-03T10:38:32.493Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"13.585µs","request":"header:<ID:9968853707752541501 > lease_revoke:<id:0a587ba1446d0bfd>","response":"size:29","error":"lease not found"} 2021-09-03T10:38:32.493506491Z {"level":"warn","ts":"2021-09-03T10:38:32.493Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"4.822µs","request":"header:<ID:9968853707752541503 > lease_revoke:<id:267b7ba1464878a2>","response":"size:29","error":"lease not found"} 2021-09-03T10:38:32.494580758Z {"level":"warn","ts":"2021-09-03T10:38:32.494Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"40.204µs","request":"header:<ID:9968853707752541505 > lease_revoke:<id:0a587ba1446d0b5e>","response":"size:29","error":"lease not found"} 2021-09-03T10:38:32.496841162Z {"level":"warn","ts":"2021-09-03T10:38:32.495Z","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"0a587ba1446d0cc4","error":"lease not found"} 2021-09-03T10:38:32.496921944Z {"level":"warn","ts":"2021-09-03T10:38:32.496Z","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"0a587ba1446d0bfd","error":"lease not found"} 2021-09-03T10:38:32.496948975Z {"level":"warn","ts":"2021-09-03T10:38:32.496Z","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"267b7ba1464878a2","error":"lease not found"} 2021-09-03T10:38:32.496997062Z {"level":"info","ts":"2021-09-03T10:38:32.496Z","caller":"traceutil/trace.go:171","msg":"trace[735177488] transaction","detail":"{read_only:false; number_of_response:1; response_revision:878625; }","duration":"3.7077888s","start":"2021-09-03T10:38:28.789Z","end":"2021-09-03T10:38:32.496Z","steps":["trace[735177488] 'process raft request' (duration: 3.705272134s)"],"step_count":1} Sending to the Etcd Team to assess what's going on. I'm also requesting feedback from @sttts to check if we can do something about the panic (but I don't think it's necessary, the output message was quite obvious: "etcd check failed: healthz"). Cloning bug on the following Jira ticket as agreed with Thomas: https://issues.redhat.com/browse/OCPBUGS-2136 This ticket can be closed again. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |