Bug 1719967

Summary: During upgrade, node-tuning operator status rapidly alternates between new and old version
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Node Tuning OperatorAssignee: Jiří Mencák <jmencak>
Status: CLOSED ERRATA QA Contact: Simon <skordas>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: lmeyer, mifiedle, sejug, skordas, sponnaga, vlaad, wsun
Target Milestone: ---Keywords: Reopened
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 4.1.6
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The node-tuning-operator did not implement leader election. Consequence: Two operators can run for a brief period of time causing resource contention. Fix: Implemented "leader election for life". Result: Node-tuning operators can co-exist within the same namespace without causing resource contention.
Story Points: ---
Clone Of:
: 1724274 1729273 (view as bug list) Environment:
Last Closed: 2019-07-11 18:10:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1724274, 1729273    

Description Clayton Coleman 2019-06-12 20:33:57 UTC
In a ci upgrade test on 4.1.z, the node tuning operator is observed to alternate between reporting the new and old versions.  This could be because the operator doesn't have leader election, or because it doesn't use the recreate strategy, or because it's just broken:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2513

From the event logs during upgrade

Jun 12 19:26:48.688 I ns/openshift-cluster-node-tuning-operator deployment/cluster-node-tuning-operator Scaled up replica set cluster-node-tuning-operator-55864b7ff7 to 1
Jun 12 19:26:48.710 I ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-55864b7ff7-k5z8p node/ created
Jun 12 19:26:48.720 I ns/openshift-cluster-node-tuning-operator replicaset/cluster-node-tuning-operator-55864b7ff7 Created pod: cluster-node-tuning-operator-55864b7ff7-k5z8p
Jun 12 19:26:48.725 I ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-55864b7ff7-k5z8p Successfully assigned openshift-cluster-node-tuning-operator/cluster-node-tuning-operator-55864b7ff7-k5z8p to ip-10-0-150-12.ec2.internal
Jun 12 19:26:56.353 I ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-55864b7ff7-k5z8p Pulling image "registry.svc.ci.openshift.org/ocp/4.1-2019-06-12-184448@sha256:f49e2ab1c819585bbb33152a322070bd5eb994ef20a74840115dabb8d17b7a45"
Jun 12 19:26:57.694 I ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-55864b7ff7-k5z8p Successfully pulled image "registry.svc.ci.openshift.org/ocp/4.1-2019-06-12-184448@sha256:f49e2ab1c819585bbb33152a322070bd5eb994ef20a74840115dabb8d17b7a45"
Jun 12 19:26:57.925 I ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-55864b7ff7-k5z8p Created container cluster-node-tuning-operator
Jun 12 19:26:57.972 I ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-55864b7ff7-k5z8p Started container cluster-node-tuning-operator
Jun 12 19:26:58.395 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:26:58.915 W ns/openshift-cluster-node-tuning-operator pod/tuned-5xr4h node/ip-10-0-150-12.ec2.internal graceful deletion within 30s
Jun 12 19:26:58.917 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:26:58.930 I ns/openshift-cluster-node-tuning-operator daemonset/tuned Deleted pod: tuned-5xr4h
Jun 12 19:26:58.930 I ns/openshift-cluster-node-tuning-operator pod/tuned-5xr4h Stopping container tuned
Jun 12 19:26:59.031 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:26:59.062 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:26:59.062 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:26:59.184 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:26:59.184 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:26:59.284 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:26:59.284 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:26:59.315 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:26:59.315 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:26:59.399 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:26:59.446 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:26:59.501 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:26:59.501 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:26:59.617 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:26:59.617 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:26:59.707 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:26:59.793 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:00.101 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:01.132 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:01.317 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:01.317 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:01.786 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:02.339 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:02.513 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:02.513 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:02.524 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:02.921 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:03.046 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:27:03.046 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:03.147 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:27:03.393 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:03.618 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:03.770 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:03.910 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:04.116 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:04.523 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:27:04.523 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:04.554 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:04.554 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:04.944 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:27:04.944 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:04.972 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:04.972 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:05.314 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:27:05.314 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:05.709 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:05.709 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:06.031 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:27:06.031 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:06.112 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:06.112 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:27:06.429 W clusteroperator/node-tuning changed Progressing to True: Working towards "4.1.0-0.ci-2019-06-12-055003"
Jun 12 19:27:06.429 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:27:06.508 W clusteroperator/node-tuning changed Progressing to False: Cluster version is "4.1.0-0.ci-2019-06-12-184448"
Jun 12 19:27:06.508 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448

... several minutes ...

Jun 12 19:28:07.636 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:28:07.706 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:28:08.032 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-184448 -> 4.1.0-0.ci-2019-06-12-055003
Jun 12 19:28:08.106 I clusteroperator/node-tuning versions: operator 4.1.0-0.ci-2019-06-12-055003 -> 4.1.0-0.ci-2019-06-12-184448
Jun 12 19:28:08.386 I ns/openshift-cluster-node-tuning-operator deployment/cluster-node-tuning-operator Scaled down replica set cluster-node-tuning-operator-8485877487 to 0
Jun 12 19:28:08.391 W ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-8485877487-t2wr8 node/ip-10-0-150-12.ec2.internal graceful deletion within 30s
Jun 12 19:28:08.397 I ns/openshift-cluster-node-tuning-operator replicaset/cluster-node-tuning-operator-8485877487 Deleted pod: cluster-node-tuning-operator-8485877487-t2wr8
Jun 12 19:28:08.401 I ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-8485877487-t2wr8 Stopping container cluster-node-tuning-operator
Jun 12 19:28:10.769 E ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-8485877487-t2wr8 node/ip-10-0-150-12.ec2.internal container=cluster-node-tuning-operator container exited with code 255 (Error):  Reconciling Tuned openshift-cluster-node-tuning-operator/default\nI0612 19:28:07.215008
Jun 12 19:28:10.978 W ns/openshift-cluster-node-tuning-operator pod/cluster-node-tuning-operator-8485877487-t2wr8 node/ip-10-0-150-12.ec2.internal deleted
Jun 12 19:28:13.576 W ns/openshift-cluster-node-tuning-operator pod/tuned-xtrf5 node/ip-10-0-150-12.ec2.internal deleted
Jun 12 19:28:13.601 I ns/openshift-cluster-node-tuning-operator pod/tuned-z4nwc node/ created
Jun 12 19:28:13.609 I ns/openshift-cluster-node-tuning-operator daemonset/tuned Created pod: tuned-z4nwc
Jun 12 19:28:13.615 I ns/openshift-cluster-node-tuning-operator pod/tuned-z4nwc Successfully assigned openshift-cluster-node-tuning-operator/tuned-z4nwc to ip-10-0-150-12.ec2.internal
Jun 12 19:28:18.095 I ns/openshift-cluster-node-tuning-operator pod/tuned-z4nwc Pulling image "registry.svc.ci.openshift.org/ocp/4.1-2019-06-12-184448@sha256:b10e53d9095a7da59e1d1e31baca0ca2723854cf971f92050cecf1ec25eb3d22"
Jun 12 19:28:19.310 I ns/openshift-cluster-node-tuning-operator pod/tuned-z4nwc Successfully pulled image "registry.svc.ci.openshift.org/ocp/4.1-2019-06-12-184448@sha256:b10e53d9095a7da59e1d1e31baca0ca2723854cf971f92050cecf1ec25eb3d22"
Jun 12 19:28:19.523 I ns/openshift-cluster-node-tuning-operator pod/tuned-z4nwc Created container tuned
Jun 12 19:28:19.555 I ns/openshift-cluster-node-tuning-operator pod/tuned-z4nwc Started container tuned

Comment 2 Clayton Coleman 2019-06-20 17:57:29 UTC
Suspect this may be the cause of reported secret leaks in the node-tuning namespace (if two version run during an upgrade, they could be conflicting).

Comment 3 Jiří Mencák 2019-06-21 13:51:41 UTC
No, that's not it.  You were right in your original comment though, leader election is not yet implemented by the node-tuning operator.
(In reply to Clayton Coleman from comment #2)
> Suspect this may be the cause of reported secret leaks in the node-tuning
> namespace (if two version run during an upgrade, they could be conflicting).
No, that's not it.  You were right in your original comment though, leader election
is not (yet) implemented by the node-tuning operator.

Comment 4 Jiří Mencák 2019-06-24 15:46:29 UTC
Can no longer reproduce with upstream PR: https://github.com/openshift/cluster-node-tuning-operator/pull/67

Comment 9 Mike Fiedler 2019-07-01 13:24:29 UTC
The 4.1.x PR has not merged.   Moving back to POST.   https://github.com/openshift/cluster-node-tuning-operator/pull/68

Comment 13 Mike Fiedler 2019-07-02 11:37:41 UTC
The build currently attached to the 4.1.4 errata is cluster-node-tuning-operator-container-v4.1.4-201906271212 which does not contain this fix.    This will be picked up in 4.1.5 or if 4.1.4 is rebuilt

Comment 15 Simon 2019-07-02 18:08:08 UTC
Verification with builds:
[1] 4.1.0-0.nightly-2019-06-29-054428
[2] 4.1.0-0.nightly-2019-06-30-214248
[3] 4.1.0-0.nightly-2019-07-02-050624
[4] 4.1.0-0.nightly-2019-07-02-101812

Build [3] and [4] with fix.

I have tried different combinations for update and downgrade.

Last one:

# oc get clusterversions.config.openshift.io 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-07-02-050624   True        False         10m     Cluster version is 4.1.0-0.nightly-2019-07-02-050624

# oc get configmaps -n openshift-cluster-node-tuning-operator
NAME                        DATA   AGE
node-tuning-operator-lock   0      21m
tuned-profiles              1      4h16m
tuned-recommend             1      4h16m

# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812 --force
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812

 # oc get configmaps -n openshift-cluster-node-tuning-operator
NAME                        DATA   AGE
node-tuning-operator-lock   0      42m
tuned-profiles              1      5h5m
tuned-recommend             1      5h5m

# oc get clusterversions.config.openshift.io 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-07-02-101812   True        False         31m     Cluster version is 4.1.0-0.nightly-2019-07-02-101812

Can't reproduce issue any more!

Comment 17 errata-xmlrpc 2019-07-04 09:01:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1635