Description of problem: 4.11 installation failed at monitoring and network clusteroperators with error "conmon: option parsing failed: Unknown option --log-global-size-max" Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-06-16-221335 How reproducible: Always Steps to Reproduce: 1. Install 4.11 with latest payload Actual results: 1. Installation failed. The etcd error is tracked in bug 2097431. The monitoring and network clusteroperators both show error "conmon: option parsing failed: Unknown option --log-global-size-max", which seems irrelevant of the etcd error, so tracking separately with this bug. Installation logs: ... 06-17 03:04:45.610 level=error msg=Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }] 06-17 03:04:45.610 level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required 06-17 03:04:45.610 level=info msg=Cluster operator monitoring Available is False with MultipleTasksFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. 06-17 03:04:45.610 level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 06-17 03:04:45.610 level=error msg=Cluster operator monitoring Degraded is True with MultipleTasksFailed: Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas 06-17 03:04:45.611 level=error msg=updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas 06-17 03:04:45.611 level=error msg=Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z 06-17 03:04:45.611 level=info msg=Cluster operator network ManagementStateDegraded is False with : 06-17 03:04:45.611 level=info msg=Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-sdn/sdn" is not available (awaiting 6 nodes) 06-17 03:04:45.611 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. oc get co | grep -v "True .*False .*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE etcd 4.11.0-0.nightly-2022-06-16-221335 True False True 59m UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }] monitoring False True True 40m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.11.0-0.nightly-2022-06-16-221335 True True True 62m DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z Expected results: Installation succeeds. Additional info:
This is also killing `ovnkube-node` containers on RHEL 8.6 workers. network 4.11.0-0.ci.test-2022-06-16-162452-ci-ln-106kkgt-latest True True True 6h22m DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-06-17T00:36:44Z LAST SEEN TYPE REASON OBJECT MESSAGE 38s Warning Unhealthy pod/ovnkube-node-8n6jb Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max... 3m30s Warning Unhealthy pod/ovnkube-node-68pzf Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max...
Continuing comment 0: $ oc get po -n openshift-sdn NAME READY STATUS RESTARTS AGE sdn-bnc9n 1/2 Running 1 (101m ago) 102m ... sdn-f7x8t 1/2 Running 3 (101m ago) 102m sdn-gxd9t 1/2 Running 0 111m sdn-hqk99 1/2 Running 4 (101m ago) 102m sdn-mt9bg 1/2 Running 0 111m sdn-whc2g 1/2 Running 0 111m $ oc get po -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 5/6 Running 0 92m alertmanager-main-1 5/6 Running 0 87m ... prometheus-k8s-0 5/6 Running 0 92m prometheus-k8s-1 5/6 Running 0 87m From above, these pods all have one container that is not ready. Running oc describe on them, all show below kubelet error, so reporting this bug on kubelet, if wrong, please correct, thx: kubelet Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max $ oc describe po sdn-gxd9t -n openshift-sdn Name: sdn-gxd9t ... Containers: sdn: ... State: Running Started: Fri, 17 Jun 2022 10:15:59 +0800 Ready: False ... Normal Started 111m kubelet Started container kube-rbac-proxy Warning Unhealthy 91s (x1406 over 111m) kubelet Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max , stderr: , exit code -1 $ oc describe po prometheus-k8s-0 -n openshift-monitoring Name: prometheus-k8s-0 ... Containers: prometheus: ... State: Running Started: Fri, 17 Jun 2022 10:27:31 +0800 Ready: False ... Readiness: exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=3 Startup: exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=15s #success=1 #failure=60 ... Normal Started 71m kubelet Started container kube-rbac-proxy-thanos Warning Unhealthy 95s (x281 over 71m) kubelet Startup probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max , stderr: , exit code -1 $ oc describe po alertmanager-main-0 -n openshift-monitoring Name: alertmanager-main-0 ... Containers: alertmanager: ... State: Running Started: Fri, 17 Jun 2022 10:27:21 +0800 Ready: False ... Normal Started 93m kubelet Started container prom-label-proxy Warning Unhealthy 3m42s (x537 over 93m) kubelet Startup probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max , stderr: , exit code -1
*** Bug 2098151 has been marked as a duplicate of this bug. ***
Today launched cluster 4.11.0-0.nightly-2022-06-21-040754 successfully. But leaving the default QA Contact to further verify conmon-2.1.2-2, if any. Thanks!
Checking with latest payload and install was successful. % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-22-015220 True False 132m Cluster version is 4.11.0-0.nightly-2022-06-22-015220 % oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-130-40.us-east-2.compute.internal Ready worker 146m v1.24.0+284d62a ip-10-0-131-143.us-east-2.compute.internal Ready master 152m v1.24.0+284d62a ip-10-0-162-23.us-east-2.compute.internal Ready worker 145m v1.24.0+284d62a ip-10-0-183-140.us-east-2.compute.internal Ready master 152m v1.24.0+284d62a ip-10-0-202-170.us-east-2.compute.internal Ready worker 145m v1.24.0+284d62a ip-10-0-212-210.us-east-2.compute.internal Ready master 152m v1.24.0+284d62a
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069