Bug 1622336

Summary: etcd migrate playbook fail if controllerLeaseTTL has 0s in master-config.yaml
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: Cluster Version OperatorAssignee: Scott Dodson <sdodson>
Status: CLOSED ERRATA QA Contact: Gaoyun Pei <gpei>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: aos-bugs, deads, jokerman, mmccomas, sdodson
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The etcd v2 to v3 migration playbooks improperly attempted to assign a TTL of 0 seconds to certain migrated keys when the environment was previously configured for a 0 second TTL. Consequence: The v2 to v3 migration would fail. Fix: The migration playbooks now assign a 1 second TTL to migrated keys when those keys had a 0 second TTL configured. Result: This ensures that TTLs are migrated and those keys will immediately expire after 1s. This effectively provides a 0 second TTL because this migration process happens while the API is offline and those keys would expire prior to the API coming back online.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-21 11:56:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kenjiro Nakayama 2018-08-26 12:15:42 UTC
Description of problem:

- When controllerLeaseTTL has 0s in master-config.yaml, "TASK [etcd : Re-introduce leases (as a replacement for key TTLs)] fails since oc adm migrate etcd-ttl --lease-duration 0 is invalid.

Version-Release number of the following components:

  # rpm -qa |grep ansible
  openshift-ansible-playbooks-3.7.61-1.git.0.36791ef.el7.noarch
  ansible-2.4.2.0-2.el7.noarch
  openshift-ansible-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-filter-plugins-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-callback-plugins-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-roles-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-docs-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-lookup-plugins-3.7.61-1.git.0.36791ef.el7.noarch

How reproducible: 100%

Steps to Reproduce:
1. Configure controllerLeaseTTL: 0 in master-config.yaml.

  /etc/origin/master/master-config.yaml
  ```
  controllerLeaseTTL: 0
  ```
2. Run /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml

Actual results:

Failed with following error:

  failed: [xxx.example.com] (item={u'keys': u'/openshift.io/leases/controllers', u'ttl': u'0s'}) => {"changed": true, "cmd": ["oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/mast$
  r.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://<etcdip>:2379", "--ttl-keys-prefix", "<built-in$
  , "method", "keys", "of", "dict", "object", "at", "0x7faff2b984b0>", "--lease-duration", "0s"], "delta": "0:00:00.230322", "end": "2018-08-26 14:43:21.794366", "item": {"keys": "/openshift.io/leases/control$
  ers", "ttl": "0s"}, "msg": "non-zero return code", "rc": 1, "start": "2018-08-26 14:43:21.564044", "stderr": "error: --lease-duration must be at least one second", "stderr_lines": ["error: --lease-duration $
  ust be at least one second"], "stdout": "", "stdout_lines": []}

Expected results:

Completed ansible playbook.

Additional info:
- Full logs attached in private

Comment 2 Kenjiro Nakayama 2018-08-26 12:39:49 UTC
Although we skipped the failed task and completed the playbook manually, the customer eagers to know we should set TTL of "/openshift.io/leases/controllers" to 30s (default in playbook) or should leave it.

Could you please advice we should run "oadm migrate etcd-ttl --lease-duration 30" for /openshift.io/leases/controllers" (like [1]) or do not need it when controllerLeaseTTL: 0 was set in master-config.yaml

[1] https://github.com/openshift/openshift-ansible/blob/release-3.6/roles/etcd_migrate/tasks/add_ttls.yml#L11-L33

Comment 3 Kenjiro Nakayama 2018-08-27 02:38:00 UTC
proposal fix: https://github.com/openshift/openshift-ansible/pull/9768

Comment 4 Scott Dodson 2018-08-27 12:43:04 UTC
David,

Can you comment on the validity of setting the controller lease TTL to 0s?

Comment 5 Kenjiro Nakayama 2018-08-28 00:45:03 UTC
David, Jordan

If we had "controllerLeaseTTL: 0" in master-config.yaml for etcdv2, what is the best TTL for "/openshift.io/leases/controllers" during the migration(etcdv2->v3), some default value(30s) or nothing should be set?

Comment 6 David Eads 2018-08-29 15:04:30 UTC
The controllerLeaseTTL is about leader election for a controller (https://github.com/openshift/origin/blob/release-3.7/pkg/cmd/server/api/v1/types.go#L216-L223)  `oc adm migrate etcd-ttl --lease-duration 0` is about migrating from etcd2 to etcd3.  I don't think they're related.  Why does the controllerLeaseTTL affect the `oc adm migrate` call?

Comment 7 Scott Dodson 2018-08-29 18:34:16 UTC
David, 

we were given a list of keys for which we should re-attach TTLs and told to attach TTLs that were inline with master configuration values that would've affected their original TTLs.

To me, the question is when we face a TTL configured for 0s what do we do? Do we skip migrating TTLs for that key? Do we set a TTL of 1s because we need to ensure that some TTL exists to clean up the key?

Comment 8 Scott Dodson 2018-08-30 13:43:32 UTC
Discussed with David,

If we find a configured value of 0s we should migrate with a 1s TTL so that we can be sure the key expires. Without a TTL being re-attached there's a change the key may persist forever when it otherwise should've been removed.

Kenjiro, can you make that so?

Comment 9 Kenjiro Nakayama 2018-08-30 15:03:39 UTC
Thank you. Sure, I can. But please let me confirm one more thing.

Current controllerLeaseTTL could be set "0", omitted or "-1"[1]. "controllerLeaseTTL: -1" should also be migrated to "1", correct?

[1] https://github.com/openshift/origin/blob/release-3.7/pkg/cmd/server/api/v1/types.go#L216-L223
"This value defaults off (0, or omitted) and controller election can be disabled with -1."

Comment 10 Kenjiro Nakayama 2018-09-19 08:22:24 UTC
Scott, David, is there any update for the fix?

Comment 11 Kenjiro Nakayama 2018-09-28 10:08:33 UTC
I'm sorry again, but can we get any update for this?

Comment 12 Scott Dodson 2018-09-28 12:48:17 UTC
(In reply to Kenjiro Nakayama from comment #9)
> Thank you. Sure, I can. But please let me confirm one more thing.
> 
> Current controllerLeaseTTL could be set "0", omitted or "-1"[1].
> "controllerLeaseTTL: -1" should also be migrated to "1", correct?
> 
> [1]
> https://github.com/openshift/origin/blob/release-3.7/pkg/cmd/server/api/v1/
> types.go#L216-L223
> "This value defaults off (0, or omitted) and controller election can be
> disabled with -1."

Yes, if controllerLeaseTTL > 0 migrate with ttl set to controllerLeaseTTL, else migrate with ttl set to 1s.

Comment 13 Kenjiro Nakayama 2018-09-28 14:02:55 UTC
Okay, then is there any reason why https://github.com/openshift/openshift-ansible/pull/9768 is not merged?

Comment 14 Scott Dodson 2018-09-28 16:22:07 UTC
No, merged it.

Comment 16 Gaoyun Pei 2018-10-09 08:25:02 UTC
QE could reproduce this issue with openshift-ansible-3.7.61-1.git.0.36791ef.el7.noarch.rpm

When master has "controllerLeaseTTL: 0" configured in master-config.yaml, run playbooks/byo/openshift-etcd/migrate.yml to migrate etcd v2 date will fail as below:
failed: [ec2-34-229-247-224.compute-1.amazonaws.com] (item={u'keys': u'/openshift.io/leases/controllers', u'ttl': u'0s'}) => {"changed": true, "cmd": ["oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://172.18.8.214:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x7ffaafe545c8>", "--lease-duration", "0s"], "delta": "0:00:00.549652", "end": "2018-10-09 03:41:07.641047", "failed": true, "item": {"keys": "/openshift.io/leases/controllers", "ttl": "0s"}, "msg": "non-zero return code", "rc": 1, "start": "2018-10-09 03:41:07.091395", "stderr": "error: --lease-duration must be at least one second", "stderr_lines": ["error: --lease-duration must be at least one second"], "stdout": "", "stdout_lines": []}



Verify this bug with openshift-ansible-3.7.65-1.git.0.de90d64.el7.noarch.rpm, the migration playbook run well without such error, it use controllerLeaseTTL as "1" instead.

changed: [ec2-54-157-46-140.compute-1.amazonaws.com] => (item={u'keys': u'/openshift.io/leases/controllers', u'ttl': u'1s'}) => {"changed": true, "cmd": ["oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://172.18.0.129:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x7f141881fb40>", "--lease-duration", "1s"], "delta": "0:00:00.502297", "end": "2018-10-09 04:16:05.808228", "failed": false, "item": {"keys": "/openshift.io/leases/controllers", "ttl": "1s"}, "rc": 0, "start": "2018-10-09 04:16:05.305931", "stderr": "", "stderr_lines": [], "stdout": "info: Lease #8195819424583871782 with TTL 4 created\ninfo: Attaching lease to 0 entries", "stdout_lines": ["info: Lease #8195819424583871782 with TTL 4 created", "info: Attaching lease to 0 entries"]}

Comment 18 errata-xmlrpc 2018-11-21 11:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2906

Comment 19 Red Hat Bugzilla 2023-09-15 00:11:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days