Bug 2097954

Summary: 4.11 installation failed at monitoring and network clusteroperators with error "conmon: option parsing failed: Unknown option --log-global-size-max" making all jobs failing
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: akaris, pehunt, rbrattai, stbenjam, wking
Version: 4.11Keywords: Reopened
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: conmon-2.1.2-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:18:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xingxing Xia 2022-06-17 03:57:22 UTC
Description of problem:
4.11 installation failed at monitoring and network clusteroperators with error "conmon: option parsing failed: Unknown option --log-global-size-max"

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-06-16-221335

How reproducible:
Always

Steps to Reproduce:
1. Install 4.11 with latest payload

Actual results:
1. Installation failed. The etcd error is tracked in bug 2097431. The monitoring and network clusteroperators both show error "conmon: option parsing failed: Unknown option --log-global-size-max", which seems irrelevant of the etcd error, so tracking separately with this bug.

Installation logs:
...
06-17 03:04:45.610  level=error msg=Cluster operator etcd Degraded is True with UpgradeBackupController_Error: UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }]
06-17 03:04:45.610  level=info msg=Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required

06-17 03:04:45.610  level=info msg=Cluster operator monitoring Available is False with MultipleTasksFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
06-17 03:04:45.610  level=info msg=Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
06-17 03:04:45.610  level=error msg=Cluster operator monitoring Degraded is True with MultipleTasksFailed: Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas
06-17 03:04:45.611  level=error msg=updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 1 updated replicas

06-17 03:04:45.611  level=error msg=Cluster operator network Degraded is True with RolloutHung: DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z
06-17 03:04:45.611  level=info msg=Cluster operator network ManagementStateDegraded is False with : 
06-17 03:04:45.611  level=info msg=Cluster operator network Progressing is True with Deploying: DaemonSet "/openshift-sdn/sdn" is not available (awaiting 6 nodes)
06-17 03:04:45.611  level=error msg=Cluster initialization failed because one or more operators are not functioning properly.

oc get co | grep -v "True .*False .*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd                                       4.11.0-0.nightly-2022-06-16-221335   True        False         True       59m     UpgradeBackupControllerDegraded: unable to retrieve cluster version, no completed update was found in cluster version status history: [{Partial 2022-06-17 02:13:35 +0000 UTC <nil> 4.11.0-0.nightly-2022-06-16-221335 registry.ci.openshift.org/ocp/release@sha256:7d6c5e2594bd9d89592712c60f0af8f1ec750951c3ded3a16326551f431c8719 false }]
monitoring                                                                      False       True          True       40m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.11.0-0.nightly-2022-06-16-221335   True        True          True       62m     DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2022-06-17T02:25:11Z

Expected results:
Installation succeeds.

Additional info:

Comment 1 Ross Brattain 2022-06-17 04:09:52 UTC
This is also killing `ovnkube-node` containers on RHEL 8.6 workers.



network                                    4.11.0-0.ci.test-2022-06-16-162452-ci-ln-106kkgt-latest   True        True          True       6h22m   DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-06-17T00:36:44Z

LAST SEEN   TYPE      REASON      OBJECT                   MESSAGE
38s         Warning   Unhealthy   pod/ovnkube-node-8n6jb   Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max...
3m30s       Warning   Unhealthy   pod/ovnkube-node-68pzf   Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max...

Comment 2 Xingxing Xia 2022-06-17 04:12:08 UTC
Continuing comment 0:
$ oc get po -n openshift-sdn
NAME                   READY   STATUS    RESTARTS       AGE
sdn-bnc9n              1/2     Running   1 (101m ago)   102m
...
sdn-f7x8t              1/2     Running   3 (101m ago)   102m
sdn-gxd9t              1/2     Running   0              111m
sdn-hqk99              1/2     Running   4 (101m ago)   102m
sdn-mt9bg              1/2     Running   0              111m
sdn-whc2g              1/2     Running   0              111m

$ oc get po -n openshift-monitoring
NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-main-0                                      5/6     Running   0          92m
alertmanager-main-1                                      5/6     Running   0          87m
...
prometheus-k8s-0                                         5/6     Running   0          92m
prometheus-k8s-1                                         5/6     Running   0          87m

From above, these pods all have one container that is not ready. Running oc describe on them, all show below kubelet error, so reporting this bug on kubelet, if wrong, please correct, thx:
kubelet            Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max

$ oc describe po sdn-gxd9t -n openshift-sdn
Name:                 sdn-gxd9t
...
Containers:
  sdn:
    ...
    State:          Running
      Started:      Fri, 17 Jun 2022 10:15:59 +0800
    Ready:          False
...
  Normal   Started    111m                   kubelet            Started container kube-rbac-proxy
  Warning  Unhealthy  91s (x1406 over 111m)  kubelet            Readiness probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max
, stderr: , exit code -1

$ oc describe po prometheus-k8s-0 -n openshift-monitoring
Name:                 prometheus-k8s-0
...
Containers:
  prometheus:
    ...
    State:          Running
      Started:      Fri, 17 Jun 2022 10:27:31 +0800
    Ready:          False
    ...
    Readiness:    exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=3
    Startup:      exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=15s #success=1 #failure=60
...
  Normal   Started         71m                  kubelet            Started container kube-rbac-proxy-thanos
  Warning  Unhealthy       95s (x281 over 71m)  kubelet            Startup probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max
, stderr: , exit code -1


$ oc describe po alertmanager-main-0 -n openshift-monitoring
Name:                 alertmanager-main-0
...
Containers:
  alertmanager:
    ...
    State:          Running
      Started:      Fri, 17 Jun 2022 10:27:21 +0800
    Ready:          False
...
  Normal   Started         93m                    kubelet            Started container prom-label-proxy
  Warning  Unhealthy       3m42s (x537 over 93m)  kubelet            Startup probe errored: rpc error: code = Unknown desc = command error: EOF, stdout: conmon: option parsing failed: Unknown option --log-global-size-max
, stderr: , exit code -1

Comment 4 Xingxing Xia 2022-06-18 02:00:49 UTC
*** Bug 2098151 has been marked as a duplicate of this bug. ***

Comment 5 Xingxing Xia 2022-06-21 09:12:04 UTC
Today launched cluster 4.11.0-0.nightly-2022-06-21-040754 successfully. But leaving the default QA Contact to further verify conmon-2.1.2-2, if any. Thanks!

Comment 6 Sunil Choudhary 2022-06-22 16:12:06 UTC
Checking with latest payload and install was successful.

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-22-015220   True        False         132m    Cluster version is 4.11.0-0.nightly-2022-06-22-015220

% oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-130-40.us-east-2.compute.internal    Ready    worker   146m   v1.24.0+284d62a
ip-10-0-131-143.us-east-2.compute.internal   Ready    master   152m   v1.24.0+284d62a
ip-10-0-162-23.us-east-2.compute.internal    Ready    worker   145m   v1.24.0+284d62a
ip-10-0-183-140.us-east-2.compute.internal   Ready    master   152m   v1.24.0+284d62a
ip-10-0-202-170.us-east-2.compute.internal   Ready    worker   145m   v1.24.0+284d62a
ip-10-0-212-210.us-east-2.compute.internal   Ready    master   152m   v1.24.0+284d62a

Comment 8 errata-xmlrpc 2022-08-10 11:18:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069