Bug 1622336 - etcd migrate playbook fail if controllerLeaseTTL has 0s in master-config.yaml
Summary: etcd migrate playbook fail if controllerLeaseTTL has 0s in master-config.yaml
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.z
Assignee: Scott Dodson
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-26 12:15 UTC by Kenjiro Nakayama
Modified: 2023-09-15 00:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The etcd v2 to v3 migration playbooks improperly attempted to assign a TTL of 0 seconds to certain migrated keys when the environment was previously configured for a 0 second TTL. Consequence: The v2 to v3 migration would fail. Fix: The migration playbooks now assign a 1 second TTL to migrated keys when those keys had a 0 second TTL configured. Result: This ensures that TTLs are migrated and those keys will immediately expire after 1s. This effectively provides a 0 second TTL because this migration process happens while the API is offline and those keys would expire prior to the API coming back online.
Clone Of:
Environment:
Last Closed: 2018-11-21 11:56:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1475351 0 unspecified CLOSED [3.6] API server results inconsistent after migration to etcdv3 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 3587251 0 None None None 2018-08-27 06:35:35 UTC
Red Hat Product Errata RHSA-2018:2906 0 None None None 2018-11-21 11:56:49 UTC

Internal Links: 1475351

Description Kenjiro Nakayama 2018-08-26 12:15:42 UTC
Description of problem:

- When controllerLeaseTTL has 0s in master-config.yaml, "TASK [etcd : Re-introduce leases (as a replacement for key TTLs)] fails since oc adm migrate etcd-ttl --lease-duration 0 is invalid.

Version-Release number of the following components:

  # rpm -qa |grep ansible
  openshift-ansible-playbooks-3.7.61-1.git.0.36791ef.el7.noarch
  ansible-2.4.2.0-2.el7.noarch
  openshift-ansible-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-filter-plugins-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-callback-plugins-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-roles-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-docs-3.7.61-1.git.0.36791ef.el7.noarch
  openshift-ansible-lookup-plugins-3.7.61-1.git.0.36791ef.el7.noarch

How reproducible: 100%

Steps to Reproduce:
1. Configure controllerLeaseTTL: 0 in master-config.yaml.

  /etc/origin/master/master-config.yaml
  ```
  controllerLeaseTTL: 0
  ```
2. Run /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml

Actual results:

Failed with following error:

  failed: [xxx.example.com] (item={u'keys': u'/openshift.io/leases/controllers', u'ttl': u'0s'}) => {"changed": true, "cmd": ["oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/mast$
  r.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://<etcdip>:2379", "--ttl-keys-prefix", "<built-in$
  , "method", "keys", "of", "dict", "object", "at", "0x7faff2b984b0>", "--lease-duration", "0s"], "delta": "0:00:00.230322", "end": "2018-08-26 14:43:21.794366", "item": {"keys": "/openshift.io/leases/control$
  ers", "ttl": "0s"}, "msg": "non-zero return code", "rc": 1, "start": "2018-08-26 14:43:21.564044", "stderr": "error: --lease-duration must be at least one second", "stderr_lines": ["error: --lease-duration $
  ust be at least one second"], "stdout": "", "stdout_lines": []}

Expected results:

Completed ansible playbook.

Additional info:
- Full logs attached in private

Comment 2 Kenjiro Nakayama 2018-08-26 12:39:49 UTC
Although we skipped the failed task and completed the playbook manually, the customer eagers to know we should set TTL of "/openshift.io/leases/controllers" to 30s (default in playbook) or should leave it.

Could you please advice we should run "oadm migrate etcd-ttl --lease-duration 30" for /openshift.io/leases/controllers" (like [1]) or do not need it when controllerLeaseTTL: 0 was set in master-config.yaml

[1] https://github.com/openshift/openshift-ansible/blob/release-3.6/roles/etcd_migrate/tasks/add_ttls.yml#L11-L33

Comment 3 Kenjiro Nakayama 2018-08-27 02:38:00 UTC
proposal fix: https://github.com/openshift/openshift-ansible/pull/9768

Comment 4 Scott Dodson 2018-08-27 12:43:04 UTC
David,

Can you comment on the validity of setting the controller lease TTL to 0s?

Comment 5 Kenjiro Nakayama 2018-08-28 00:45:03 UTC
David, Jordan

If we had "controllerLeaseTTL: 0" in master-config.yaml for etcdv2, what is the best TTL for "/openshift.io/leases/controllers" during the migration(etcdv2->v3), some default value(30s) or nothing should be set?

Comment 6 David Eads 2018-08-29 15:04:30 UTC
The controllerLeaseTTL is about leader election for a controller (https://github.com/openshift/origin/blob/release-3.7/pkg/cmd/server/api/v1/types.go#L216-L223)  `oc adm migrate etcd-ttl --lease-duration 0` is about migrating from etcd2 to etcd3.  I don't think they're related.  Why does the controllerLeaseTTL affect the `oc adm migrate` call?

Comment 7 Scott Dodson 2018-08-29 18:34:16 UTC
David, 

we were given a list of keys for which we should re-attach TTLs and told to attach TTLs that were inline with master configuration values that would've affected their original TTLs.

To me, the question is when we face a TTL configured for 0s what do we do? Do we skip migrating TTLs for that key? Do we set a TTL of 1s because we need to ensure that some TTL exists to clean up the key?

Comment 8 Scott Dodson 2018-08-30 13:43:32 UTC
Discussed with David,

If we find a configured value of 0s we should migrate with a 1s TTL so that we can be sure the key expires. Without a TTL being re-attached there's a change the key may persist forever when it otherwise should've been removed.

Kenjiro, can you make that so?

Comment 9 Kenjiro Nakayama 2018-08-30 15:03:39 UTC
Thank you. Sure, I can. But please let me confirm one more thing.

Current controllerLeaseTTL could be set "0", omitted or "-1"[1]. "controllerLeaseTTL: -1" should also be migrated to "1", correct?

[1] https://github.com/openshift/origin/blob/release-3.7/pkg/cmd/server/api/v1/types.go#L216-L223
"This value defaults off (0, or omitted) and controller election can be disabled with -1."

Comment 10 Kenjiro Nakayama 2018-09-19 08:22:24 UTC
Scott, David, is there any update for the fix?

Comment 11 Kenjiro Nakayama 2018-09-28 10:08:33 UTC
I'm sorry again, but can we get any update for this?

Comment 12 Scott Dodson 2018-09-28 12:48:17 UTC
(In reply to Kenjiro Nakayama from comment #9)
> Thank you. Sure, I can. But please let me confirm one more thing.
> 
> Current controllerLeaseTTL could be set "0", omitted or "-1"[1].
> "controllerLeaseTTL: -1" should also be migrated to "1", correct?
> 
> [1]
> https://github.com/openshift/origin/blob/release-3.7/pkg/cmd/server/api/v1/
> types.go#L216-L223
> "This value defaults off (0, or omitted) and controller election can be
> disabled with -1."

Yes, if controllerLeaseTTL > 0 migrate with ttl set to controllerLeaseTTL, else migrate with ttl set to 1s.

Comment 13 Kenjiro Nakayama 2018-09-28 14:02:55 UTC
Okay, then is there any reason why https://github.com/openshift/openshift-ansible/pull/9768 is not merged?

Comment 14 Scott Dodson 2018-09-28 16:22:07 UTC
No, merged it.

Comment 16 Gaoyun Pei 2018-10-09 08:25:02 UTC
QE could reproduce this issue with openshift-ansible-3.7.61-1.git.0.36791ef.el7.noarch.rpm

When master has "controllerLeaseTTL: 0" configured in master-config.yaml, run playbooks/byo/openshift-etcd/migrate.yml to migrate etcd v2 date will fail as below:
failed: [ec2-34-229-247-224.compute-1.amazonaws.com] (item={u'keys': u'/openshift.io/leases/controllers', u'ttl': u'0s'}) => {"changed": true, "cmd": ["oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://172.18.8.214:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x7ffaafe545c8>", "--lease-duration", "0s"], "delta": "0:00:00.549652", "end": "2018-10-09 03:41:07.641047", "failed": true, "item": {"keys": "/openshift.io/leases/controllers", "ttl": "0s"}, "msg": "non-zero return code", "rc": 1, "start": "2018-10-09 03:41:07.091395", "stderr": "error: --lease-duration must be at least one second", "stderr_lines": ["error: --lease-duration must be at least one second"], "stdout": "", "stdout_lines": []}



Verify this bug with openshift-ansible-3.7.65-1.git.0.de90d64.el7.noarch.rpm, the migration playbook run well without such error, it use controllerLeaseTTL as "1" instead.

changed: [ec2-54-157-46-140.compute-1.amazonaws.com] => (item={u'keys': u'/openshift.io/leases/controllers', u'ttl': u'1s'}) => {"changed": true, "cmd": ["oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://172.18.0.129:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x7f141881fb40>", "--lease-duration", "1s"], "delta": "0:00:00.502297", "end": "2018-10-09 04:16:05.808228", "failed": false, "item": {"keys": "/openshift.io/leases/controllers", "ttl": "1s"}, "rc": 0, "start": "2018-10-09 04:16:05.305931", "stderr": "", "stderr_lines": [], "stdout": "info: Lease #8195819424583871782 with TTL 4 created\ninfo: Attaching lease to 0 entries", "stdout_lines": ["info: Lease #8195819424583871782 with TTL 4 created", "info: Attaching lease to 0 entries"]}

Comment 18 errata-xmlrpc 2018-11-21 11:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2906

Comment 19 Red Hat Bugzilla 2023-09-15 00:11:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.