Created attachment 1765405 [details] must-gather from OSD cluster built on GCP with OVN Description of problem: 4.7.2 OSD on GCP clusters built with `networkType: OVNKubernetes` present with almost constant instability of the authentication cluster operator and occasional/intermittent apiserver unavailability. will attach m-g and install-config. I have built about half a dozen of these clusters and the behavior is consistent. I am happy to build these out for engineering to use to investigate if necessary. Version-Release number of selected component (if applicable): 4.7.2, cluster nodes on v1.20.0+5fbfd19 How reproducible: 100% Steps to Reproduce: 1. Ping me (or if I'm not available, @sre-platform-ic in #sd-sre-platform) 2. If I'm not available, please ask the IC to build a stage OSD cluster on GCP with a custom install-config using https://github.com/openshift/ops-sop/blob/master/v4/howto/custom-install-config.md as instructions, changing `networkType: OpenShiftSDN` to `networkType: OVNKubernetes` in the install-config 3. We can provide you with cluster-admin permissions using your Red Hat SAML credentials. Additional info: Please note that these clusters become completely non-responsive (due to auth issues) within 4-24 hours after build, so please ask us to build these when you have some time to investigate (build time is usually ~1 hour). We can extend the lives of these clusters past the default of 36 hours, but we likely will need to rebuild clusters once they've gone non-responsive. must-gather attached
Created attachment 1765406 [details] install-config for OSD GCP cluster with OVN
Candace; are the nodes still accessible via SSH at all? We just ran into a case where we see the authentication-operator taking 7.8G of RSS which caused pretty heavy load and swap on the node, which caused it to be unresponsive. Unfortunately we could not get logs of the container due to that: top - 21:57:30 up 1 day, 2:06, 1 user, load average: 189.69, 202.56, 211.44 Tasks: 399 total, 1 running, 398 sleeping, 0 stopped, 0 zombie %Cpu(s): 64.2 us, 27.2 sy, 0.0 ni, 0.0 id, 1.9 wa, 6.0 hi, 0.7 si, 0.0 st MiB Mem : 15018.3 total, 203.9 free, 13851.2 used, 963.2 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 750.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15060 root 20 0 1438952 54432 2444 S 40.7 0.4 18:04.36 cluster-kube-ap 38246 root 20 0 1501336 52544 0 S 27.5 0.3 13:57.29 oauth-server 2571 root 20 0 1365636 26328 876 D 8.3 0.2 6:02.63 cluster-kube-co 15694 core 20 0 737852 22588 0 S 8.3 0.1 5:43.89 dns-operator 10978 root 20 0 738888 12824 0 S 7.9 0.1 5:23.75 csi-snapshotter 6010 root 20 0 745072 43676 0 S 7.3 0.3 13:22.86 ovnkube 1854458 root 20 0 745456 45468 0 S 7.0 0.3 10:27.43 ovnkube 79 root 20 0 0 0 0 S 6.3 0.0 5:32.68 kswapd0 10906 root 20 0 739388 37720 0 D 5.0 0.2 7:30.89 csi-resizer 1664763 root 20 0 1437424 17676 0 S 3.0 0.1 6:23.38 cluster-kube-sc 14656 nfsnobo+ 20 0 810564 53568 0 S 2.0 0.3 9:05.36 machine-config- 1665826 root 20 0 1588024 37420 3068 S 1.7 0.2 10:33.38 cluster-kube-ap 1335 openvsw+ 10 -10 441308 73320 36052 S 0.7 0.5 20:55.10 ovs-vswitchd 17742 root 20 0 9734552 7.8g 0 S 0.7 53.5 144:51.38 authentication- 1652054 1000480+ 20 0 1582548 78856 0 S 0.7 0.5 9:23.14 openshift-contr 619 root 0 -20 0 0 0 I 0.3 0.0 0:18.27 kworker/0:1H-kblockd 669 root 0 -20 0 0 0 I 0.3 0.0 0:36.87 kworker/3:1H-kblockd 2336 root 20 0 1238352 248432 204 D 0.3 1.6 42:30.52 kube-controller 2767 root 20 0 711172 12992 0 S 0.3 0.1 3:22.93 apiserver-watch 3612 root 20 0 1363692 27504 1544 S 0.3 0.2 6:19.92 cluster-kube-sc 3876 root 20 0 9269712 284884 0 S 0.3 1.9 221:33.37 etcd 8968 root 20 0 1412756 37116 0 D 0.3 0.2 15:37.09 webhook 10652 root 20 0 740368 18468 0 D 0.3 0.1 16:12.10 csi-provisioner 14337 1000190+ 20 0 1500184 36876 0 S 0.3 0.2 11:48.69 cluster-samples 14787 499 20 0 1418516 22424 0 S 0.3 0.1 11:43.48 cluster-node-tu 14823 10400 20 0 1437244 54836 0 S 0.3 0.4 10:56.98 csi-snapshot-co 14948 1000190+ 20 0 1498648 20156 0 S 0.3 0.1 6:58.90 cluster-samples 15227 root 20 0 1437628 57168 0 S 0.3 0.4 38:08.82 oauth-apiserver 15733 1000080+ 20 0 747252 33948 0 S 0.3 0.2 10:36.10 ingress-operato 25112 nfsnobo+ 20 0 1492108 54632 0 S 0.3 0.4 8:28.58 cluster-baremet 25797 root 20 0 1515524 61584 0 S 0.3 0.4 48:50.51 cluster-kube-co 25828 1000310+ 20 0 1443664 58148 0 S 0.3 0.4 8:59.00 cluster-image-r 28496 root 20 0 749176 41032 0 S 0.3 0.3 10:37.16 kube-scheduler 1665666 root 20 0 1366572 27616 0 S 0.3 0.2 6:38.79 cluster-kube-ap 1853574 nfsnobo+ 20 0 736588 9888 0 S 0.3 0.1 3:51.67 kube-rbac-proxy 1954410 root 20 0 71448 4904 3700 R 0.3 0.0 0:00.29 top 1954427 root 20 0 77612 3236 2760 D 0.3 0.0 0:00.01 curl 1 root 20 0 257816 12596 3496 S 0.0 0.1 19:08.59 systemd If you are able to access those nodes via SSH, could you check 'ps aux'?
Another instance of auth operator large memory usage (6.9G RSS); though this time the CPU usage of various processes (including ovnkube master) is also interesting: top - 13:58:00 up 2 days, 17:59, 1 user, load average: 243.33, 243.22, 243.22 Tasks: 437 total, 3 running, 434 sleeping, 0 stopped, 0 zombie %Cpu(s): 65.4 us, 27.2 sy, 0.0 ni, 0.0 id, 0.2 wa, 6.8 hi, 0.4 si, 0.0 st MiB Mem : 15018.3 total, 229.5 free, 13867.4 used, 921.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 729.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2092636 root 20 0 1440304 20892 0 S 99.7 0.1 8:11.73 cluster-kube-ap 5908 root 20 0 745328 44152 0 S 73.2 0.3 13:35.11 ovnkube 3430916 1001 20 0 1786848 163972 0 S 45.4 1.1 12:36.55 olm 4662 root 20 0 1268032 17184 624 D 40.4 0.1 20:46.94 oauth-proxy 3431485 1000460+ 20 0 1645780 86428 0 S 39.4 0.6 10:22.85 service-ca-oper 14078 1000420+ 20 0 736588 12412 0 S 17.9 0.1 1:12.19 kube-rbac-proxy 10451 nfsnobo+ 20 0 736844 13208 528 S 9.6 0.1 4:39.23 kube-rbac-proxy 3431875 root 20 0 1439112 21572 0 S 8.9 0.1 16:36.59 cluster-kube-co 4990 root 20 0 1565724 14560 0 S 6.6 0.1 4:53.88 openshift-tuned 79 root 20 0 0 0 0 R 4.3 0.0 5:53.64 kswapd0 12078 1000420+ 20 0 816968 49564 0 D 3.6 0.3 21:10.57 operator 3431446 root 20 0 1433092 21272 0 S 3.6 0.1 10:26.06 cluster-policy- 3431549 1000480+ 20 0 1444908 28632 0 S 2.6 0.2 14:33.53 openshift-contr 4149733 root 20 0 736716 10444 0 S 2.0 0.1 5:14.59 ovndbchecker 1328 openvsw+ 10 -10 441376 74828 36052 S 0.7 0.5 57:46.44 ovs-vswitchd 5690 nfsnobo+ 20 0 736588 14852 0 S 0.7 0.1 1:26.31 kube-rbac-proxy 13860 1000240+ 20 0 1431604 34496 0 D 0.7 0.2 20:55.24 insights-operat 892509 root 0 -20 71448 4500 3324 R 0.7 0.0 0:00.09 top 3430289 root 20 0 8692492 6.9g 0 S 0.7 47.0 112:56.11 authentication- 3431070 10400 20 0 1436668 43260 0 S 0.7 0.3 9:55.39 csi-snapshot-co 3431593 root 20 0 1512044 51772 0 S 0.7 0.3 39:27.50 cluster-kube-sc 4149672 root 20 0 745456 47108 0 S 0.7 0.3 13:58.09 ovnkube 613 root 0 -20 0 0 0 I 0.3 0.0 1:15.25 kworker/3:1H-kblockd 1245 openvsw+ 10 -10 81972 6436 3528 S 0.3 0.0 3:02.73 ovsdb-server 1715 root 20 0 3236340 80336 64 S 0.3 0.5 53:34.20 crio 1752 root 20 0 3454532 303408 0 S 0.3 2.0 829:37.80 kubelet 2467 root 20 0 711684 12288 0 S 0.3 0.1 6:37.37 apiserver-watch 3125 root 20 0 1363948 29132 0 D 0.3 0.2 8:02.06 cluster-kube-sc 3971 root 20 0 9201356 245544 0 S 0.3 1.6 481:24.89 etcd 5068 root 20 0 746052 20980 0 D 0.3 0.1 7:15.54 machine-config- 5094 nfsnobo+ 20 0 719416 35388 0 D 0.3 0.2 84:26.28 node_exporter 5455 root 20 0 292556 31016 444 D 0.3 0.2 1:39.59 ovn-controller 5525 65532 20 0 737100 15836 0 S 0.3 0.1 1:49.33 kube-rbac-proxy 5542 root 20 0 716400 10796 0 S 0.3 0.1 0:09.33 csi-node-driver 8643 root 20 0 1413632 44196 0 S 0.3 0.3 7:07.94 network-metrics 8759 root 20 0 1413012 35980 0 D 0.3 0.2 11:18.67 webhook 8945 root 20 0 1429344 26792 0 S 0.3 0.2 16:40.20 coredns 10982 root 20 0 1349184 22016 0 S 0.3 0.1 8:57.02 machine-approve
Created attachment 1775202 [details] /var/log/containers/
Node NotReady reproduced on 4.7.0-0.nightly-2021-04-22-105612 UPI on GCP, OVN, etcd encryption. Gathered /var/log/containers/ from master.
after ssh-ing into the system I systemctl stopped kubelet. I then did SIGSTOP on the following in order, checking `top` after each STOP. After the SIGSTOPs load avg decrease to <1 kill -STOP 3430289 # authentication- kill -STOP 3432428 # oauth-apiserver kill -STOP 4662 # oauth-proxy kill -STOP 3431446 # cluster-policy- kill -STOP 3435335 # kube-apiserver stopping kube-apiserver seemed to have the most effect.
RE logs from comments 6-8, things actually start to go wrong around Apr 25 @ 12:24:47. Apr 25 12:24:47 rbrattai230024-8tzfq-master-1.c.openshift-qe.internal hyperkube[1752]: I0425 12:24:45.016512 1752 prober.go:117] Liveness probe for "machine-api-controllers-74866d59-hzhxv_openshift-machine-api(95c1756f-2c95-4c43-b0d5-bd94c84b8956):machine-controller" failed (failure): Get "http://10.129.0.24:9440/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 25 12:24:58 rbrattai230024-8tzfq-master-1.c.openshift-qe.internal systemd[3567721]: run-runc-e73ec1167dbae2acf643ad125dc637f7494c05669ca817997c7204ed7c76cce2-runc.Hnzidt.mount: Succeeded. shortly after that ovnkube NB first sees issues: 2021-04-25T12:25:09.847124078+00:00 stderr F 2021-04-25T12:25:09Z|00039|timeval|WARN|Unreasonably long 3319ms poll interval (0ms user, 2ms system) 2021-04-25T12:25:09.847255458+00:00 stderr F 2021-04-25T12:25:09Z|00040|timeval|WARN|faults: 2 minor, 1 major 2021-04-25T12:25:09.847255458+00:00 stderr F 2021-04-25T12:25:09Z|00041|timeval|WARN|disk: 7280 reads, 0 writes 2021-04-25T12:25:09.847363207+00:00 stderr F 2021-04-25T12:25:09Z|00042|timeval|WARN|context switches: 23 voluntary, 3 involuntary OVN SB has similar: 2021-04-25T12:25:16.193634505+00:00 stderr F 2021-04-25T12:25:16Z|00017|stream_ssl|WARN|SSL_accept: system error (Success) 2021-04-25T12:25:16.193741547+00:00 stderr F 2021-04-25T12:25:16Z|00018|jsonrpc|WARN|ssl:10.0.0.6:34130: receive error: Protocol error 2021-04-25T12:25:16.194100205+00:00 stderr F 2021-04-25T12:25:16Z|00019|timeval|WARN|Unreasonably long 4061ms poll interval (3ms user, 0ms system) 2021-04-25T12:25:16.194183313+00:00 stderr F 2021-04-25T12:25:16Z|00020|timeval|WARN|faults: 25 minor, 3 major 2021-04-25T12:25:16.194183313+00:00 stderr F 2021-04-25T12:25:16Z|00021|timeval|WARN|disk: 632 reads, 0 writes 2021-04-25T12:25:16.194212102+00:00 stderr F 2021-04-25T12:25:16Z|00022|timeval|WARN|context switches: 23 voluntary, 0 involuntary then kubelet fails to probe many more containers: Apr 25 12:27:54 rbrattai230024-8tzfq-master-1.c.openshift-qe.internal hyperkube[1752]: I0425 12:24:54.588055 1752 prober.go:117] Readiness probe for "machine-api-controllers-74866d59-hzhxv_openshift-machine-api(95c1756f-2c95-4c43-b0d5-bd94c84b8956):machine-controller" failed (failure): Get "http://10.129.0.24:9440/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 25 12:27:54 rbrattai230024-8tzfq-master-1.c.openshift-qe.internal hyperkube[1752]: I0425 12:24:54.713118 1752 prober.go:117] Readiness probe for "machine-api-controllers-74866d59-hzhxv_openshift-machine-api(95c1756f-2c95-4c43-b0d5-bd94c84b8956):machine-healthcheck-controller" failed (failure): Get "http://10.129.0.24:9442/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Apr 25 12:27:54 rbrattai230024-8tzfq-master-1.c.openshift-qe.internal hyperkube[1752]: I0425 12:24:45.545478 1752 prober.go:117] Liveness probe for "openshift-config-operator-8495d8d645-g5ss2_openshift-config-operator(83117659-58fa-41f3-bbcf-54620d394543):openshift-config-operator" failed (failure): Get "https://10.129.0.30:8443/healthz": context deadline exceeded It doesn't seem like the OVNSB is under pressure, since it only has a couple thousand logical flows, which is basically nothing: $ ovn-sbctl --columns _uuid --bare --no-headings --db=unix:/tmp/sbdb.sock find Logical_flow | sort | uniq | wc -l 3330
Created attachment 1775728 [details] non-failing 4.7 cjuster, auth-operator memory usage It takes ~20 hours for openshift-authentication-operator to grow to 8GB and then reset somehow.
Created attachment 1775729 [details] failed cluster, after recovery, openshift-authentication-operator memory grow again It seems metrics are lost during the failure, so can't see what was happening before node went NotReady.
Not much luck today, IC got me a cluster but it was slightly different platform (yet with OVN), but the expected issue did not occur. However, I identified one point of a possible failure and created a PR to fix that: https://github.com/openshift/cluster-authentication-operator/pull/440 I can't be sure the CAO memory consumption is fixed unless I am able to get a cluster that inherits the reported behavior, but Mario has been working hard on that, there seem to have been a lack of privileges for him in order to create a cluster like that but I'm told that a cluster is being instantiated. If there is someone else who might attempt to get a cluster-authentication-operator pprof dump from a failing cluster, that would help.
github.com/openshift/cluster-authentication-operator/pkg/controllers/proxyconfig.(*proxyConfigChecker).getCACerts cluster-authentication-operator/pkg/controllers/proxyconfig/proxyconfig_controller.go Total: 1MB 2.08GB (flat, cum) 90.68% 152 . . // In case this causes performance issues, consider caching the trusted 153 . . // certs pool. 154 . . // At the time of writing this comment, this should only happen once 155 . . // every 5 minutes and the trusted-ca CM contains around 130 certs. 156 . 2.08GB if ok := caPool.AppendCertsFromPEM([]byte(caCM.Data["ca-bundle.crt"])); !ok { 157 . . return nil, fmt.Errorf("unable to append system trust ca bundle") 158 . . }
Thanks Ross, I was able to find the authn-operator bug, and added the fix the the PR referenced in comment 20. I tested that in your cluster and the issue went away. It would only appear in a cluster where the HTTP proxy is configured. I am not sure this was root cause of the instability though, we were presented an OSD cluster with the reproducer from this BZ, the proxy was not configured and the memory consumption of the authentication operator was normal, yet the network still appeared to have been broken. I would advise not to wait for the lengthy process to merge the fix and attempt to reproduce the issue with a build that includes the provided fix, and if the network still breaks, keep the broken cluster and reassign back to networking.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.11 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1550
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? - customers using OpenShift 4.7 in versions that do not include this fix - a necessary requirement for the problem to appear is to have a cluster-wide HTTP proxy configured as per https://docs.openshift.com/container-platform/4.7/networking/enable-cluster-wide-proxy.html What is the impact? - high memory consumption (2GB+) by the authentication-operator and possibly by the router pods since the authentication-operator will be keeping many connections to the oauth-server open How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? - the pods involved would have to be restarted often (about once an hour) Is this a regression? Yes, it's a regression in 4.7 until the version including the fix (see the above erratum)
(In reply to Standa Laznicka from comment #33) > Yes, it's a regression in 4.7 until the version including the fix (see the > above erratum) I'm adding the 'Regression' keyword to reflect this^.
Updating the title to reflect that this BZ is suspected to be the cause of bug 1960787.
We did end up dropping the 4.6 -> 4.7.(<11) update recommendations based on this bug, based on comment 33 [1,2]. [1]: https://github.com/openshift/cincinnati-graph-data/pull/891 [2]: https://github.com/openshift/cincinnati-graph-data/pull/892