Description of problem: After a migration to etcdv3, one OpenShift API server in the HA configuration seems to return old data (the other two API servers return current data when queried). The results of this can vary from a user being unable to access their project sporadically to builds randomly being unable to push to the docker registry. Version-Release number of selected component (if applicable): OCP v3.6.170 etcd 3.1.9 How reproducible: 100% (4 attempts to date) Steps to Reproduce: 1. Migrate etcdv2 to v3 on an HA cluster (3 masters/etcd). We used the openshift-ansible playbook: playbooks/byo/openshift-etcd/migrate.yml 2. Create a new project X as a non-admin user. 3. Run oc new-app to create a new application in the project (if you can). Actual results: Usually, after creating the project, the non-admin user will get permission errors if they try to list resources within the newly created project. Run 'oc get all' several times as the non-admin user: $ oc get all -n X Depending on which master is out of date, the "get all" will return "Forbidden" for at least one of the invocations. The root cause appears to be one server in the API server cluster returning "old" data. e.g. it may not have a record of the user's permissions created for the project. Expected results: Consistent results from all API servers. Additional info: It has been observed that restarting atomic-openshift-master-api temporarily corrects this condition. However, even after doing so, inconsistent results began to crop up. Examples of inconsistency captured in current state of free-int: [root@free-int-master-3c664 ~]# oc project Using project "scd-1" on server "https://internal.api.free-int.openshift.com:443". [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-3 Source Git@855ab2d Failed (PushImageToRegistryFailed) 17 hours ago 34s test15-4 Source Git@855ab2d Complete 16 hours ago 1m15s test15-5 Source Git@855ab2d Complete 16 hours ago 52s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 14 hours ago 13h42m11s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 14 hours ago 13h42m11s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-3 Source Git@855ab2d Failed (PushImageToRegistryFailed) 17 hours ago 34s test15-4 Source Git@855ab2d Complete 16 hours ago 1m15s test15-5 Source Git@855ab2d Complete 16 hours ago 52s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 14 hours ago 13h42m11s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 14 hours ago 13h42m11s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 14 hours ago 13h42m11s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 14 hours ago 13h42m11s [root@free-int-master-3c664 ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION test15-3 Source Git@855ab2d Failed (PushImageToRegistryFailed) 17 hours ago 34s test15-4 Source Git@855ab2d Complete 16 hours ago 1m15s test15-5 Source Git@855ab2d Complete 16 hours ago 52s
In this particular case yesterday we were able to `oc get pods` against each individual API server and two returned expected results while the third returned the same list of pods but all pods were in a 'pending' state rather than running / error as the other two api servers were returning. At this point we restarted the etcd service on the leader and the API server started returning valid results.
https://github.com/coreos/etcd/issues/8305 upstream issue clayton reported
Created attachment 1304820 [details] api server log Log from affected api server, etcd on the leader was restarted at Tue 2017-07-25 21:31:45 UTC
when we are in this state, what do we see running this from each API server? etcdctl -w table endpoint --cluster status
--cluster does not appear to be a valid option. Running without on each: [root@free-int-master-3c664 ~]# etcdctl3 -w table endpoint status +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://ip-172-31-50-177.ec2.internal:2379 | f8647d77edbb333b | 3.1.9 | 105 MB | false | 715 | 1043370179 | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ [root@free-int-master-5470f ~]# etcdctl3 -w table endpoint status +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://ip-172-31-56-130.ec2.internal:2379 | 46c194b7a9bde0fd | 3.1.9 | 742 MB | false | 715 | 1043370500 | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ [root@free-int-master-de987 ~]# etcdctl3 -w table endpoint status +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://ip-172-31-60-182.ec2.internal:2379 | 6bd52a956766015a | 3.1.9 | 105 MB | true | 715 | 1043373103 | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
[root@free-int-master-5470f ~]# etcdctl3 -w table endpoint status --endpoints=ip-172-31-50-177.ec2.internal:2379,ip-172-31-56-130.ec2.internal:2379,ip-172-31-60-182.ec2.internal:2379 2017-07-26 16:33:51.690146 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://ip-172-31-56-130.ec2.internal:2379 | 46c194b7a9bde0fd | 3.1.9 | 754 MB | false | 715 | 1043389842 | | ip-172-31-50-177.ec2.internal:2379 | f8647d77edbb333b | 3.1.9 | 105 MB | false | 715 | 1043389842 | | ip-172-31-56-130.ec2.internal:2379 | 46c194b7a9bde0fd | 3.1.9 | 754 MB | false | 715 | 1043389842 | | ip-172-31-60-182.ec2.internal:2379 | 6bd52a956766015a | 3.1.9 | 105 MB | true | 715 | 1043389842 | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
It looks like the etcd node returning inconsistent data (https://172.31.50.177:2380 in the data below) is returning 'Error: grpc: the client connection is closing' when directly queried with etcdctl3 [root@free-int-master-3c664 ~]# etcdctl3 member list 2017-07-26 19:21:33.618440 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 46c194b7a9bde0fd, started, ip-172-31-56-130.ec2.internal, https://172.31.56.130:2380, https://172.31.56.130:2379 6bd52a956766015a, started, ip-172-31-60-182.ec2.internal, https://172.31.60.182:2380, https://172.31.60.182:2379 f8647d77edbb333b, started, ip-172-31-50-177.ec2.internal, https://172.31.50.177:2380, https://172.31.50.177:2379 [root@free-int-master-3c664 ~]# etcdctl3 get /openshift.io/builds/scd-1 --prefix --keys-only --endpoints=https://172.31.56.130:2380 2017-07-26 19:21:55.719966 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated /openshift.io/builds/scd-1/test15-3 /openshift.io/builds/scd-1/test15-4 /openshift.io/builds/scd-1/test15-5 [root@free-int-master-3c664 ~]# etcdctl3 get /openshift.io/builds/scd-1 --prefix --keys-only --endpoints=https://172.31.60.182:2380 2017-07-26 19:22:13.598223 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated /openshift.io/builds/scd-1/test15-3 /openshift.io/builds/scd-1/test15-4 /openshift.io/builds/scd-1/test15-5 [root@free-int-master-3c664 ~]# etcdctl3 get /openshift.io/builds/scd-1 --prefix --keys-only --endpoints=https://172.31.50.177:2380 2017-07-26 19:22:21.980899 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated Error: grpc: the client connection is closing [root@free-int-master-3c664 ~]# oc get builds --server=https://172.31.60.182 NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 20 hours ago 19h42m11s [root@free-int-master-3c664 ~]# oc get builds --server=https://172.31.56.130 NAME TYPE FROM STATUS STARTED DURATION test15-5 Source Git@master Failed (GenericBuildFailed) 20 hours ago 19h42m11s [root@free-int-master-3c664 ~]# oc get builds --server=https://172.31.50.177 NAME TYPE FROM STATUS STARTED DURATION test15-3 Source Git@855ab2d Failed (PushImageToRegistryFailed) 23 hours ago 34s test15-4 Source Git@855ab2d Complete 22 hours ago 1m15s test15-5 Source Git@855ab2d Complete 22 hours ago 52s
Observations: - Database size increasing on *only* one server over time - Leader staying consistent - All nodes reporting healthy - Raft index increasing in sync [root@free-int-master-de987 ~]# etcdctl7 endpoint status -w table --endpoints=https://ip-172-31-60-182.ec2.internal:2379,https://ip-172-31-56-130.ec2.internal:2379,https://ip-172-31-50-177.ec2.internal:2379 2017-07-26 20:13:59.841880 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://ip-172-31-60-182.ec2.internal:2379 | 6bd52a956766015a | 3.1.9 | 105 MB | true | 715 | 1043474524 | | https://ip-172-31-56-130.ec2.internal:2379 | 46c194b7a9bde0fd | 3.1.9 | 935 MB | false | 715 | 1043474524 | | https://ip-172-31-50-177.ec2.internal:2379 | f8647d77edbb333b | 3.1.9 | 105 MB | false | 715 | 1043474524 | +--------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
Another correlation: The instance with the growing DB is the instance that returns the different results: [root@free-int-master-5470f ~]# etcdctl7 get /openshift.io/builds/scd-1 --prefix --keys-only --endpoints=https://ip-172-31-56-130.ec2.internal:2379 2017-07-26 20:18:42.292502 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated /openshift.io/builds/scd-1/test15-5 [root@free-int-master-de987 ~]# etcdctl7 get /openshift.io/builds/scd-1 --prefix --keys-only --endpoints=https://ip-172-31-60-182.ec2.internal:2379 2017-07-26 20:18:49.607159 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated /openshift.io/builds/scd-1/test15-3 /openshift.io/builds/scd-1/test15-4 /openshift.io/builds/scd-1/test15-5 [root@free-int-master-3c664 ~]# etcdctl7 get /openshift.io/builds/scd-1 --prefix --keys-only --endpoints=https://ip-172-31-50-177.ec2.internal:2379 2017-07-26 20:18:35.608056 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated /openshift.io/builds/scd-1/test15-3 /openshift.io/builds/scd-1/test15-4 using --consistency=s does not change this output. /openshift.io/builds/scd-1/test15-5
Experiments: - Shutdown all openshift components (api, controllers, node) for all openshift masters Observed: still bad results / still inconsistent database sizes - Restarted etcd on ip-172-31-56-130 Observed: still bad results / still inconsistent database sizes - Stopped etcd on all masters and then started them up again Observed: leader changed to ip-172-31-56-130 / still bad results / still inconsistent database sizes - Created a key from each etcd node specifying a particular endpoint and a unique key name. ip-172-31-56-130 was the master at the time. Observed: gets using each respective endpoint returned all expected keys. - Restarted ip-172-31-56-130 etcd to force master change. ip-172-31-50-177 is now master. Observed: gets still returned expected keys created in previous steps. - Created three new keys with ip-172-31-50-177 as master using each respective endpoint specified for the put operation. Observed: puts succeeded and gets from all endpoints returned all expected keys.
related? https://github.com/coreos/etcd/issues/8214
in the free-int environment, we restored v2 data from backup, and re-ran the migration, capturing the following: * v2 keys from all members prior to migrate * v3 keys from all members post migrate * hashes of all v3 data from all members post migrate all members returned identical keys to each other prior to migrate all members returned identical keys to each other post migrate the hash of all data from each member matched post migrate next step is to see if using the free-int environment post-migration triggers the same condition.
Presently unable to build in free-int environment. Builds stay in "Pending" due to "Error syncing pod".
our process for migrating an HA etcd cluster did not work when there was actively expiring TTL data in the store. Despite stopping all writers (controllers/apiservers), and ensuring all etcd members were at the same raft index, TTL data (events, tokens, leases, etc) continues expiring in the etcd v2 stores. That means when we shut down the etcd members, their data stores are likely to be inconsistent. Our migration process ran migrate on each etcd member's data store, moving that inconsistent data into the mvcc store. We then started up the etcd members, and ran a tool to re-establish TTL leases on the TTL keys. That would query one of the etcd members for keys, and run a transaction to assign each one a TTL lease: txnResp, err := client.KV.Txn(ctx).If( clientv3.Compare(clientv3.ModRevision(string(kv.Key)), "=", kv.ModRevision), ).Then( clientv3.OpPut(string(kv.Key), string(kv.Value), clientv3.WithLease(lease.ID)), ).Commit() If this transaction was accepted by the leader, it had the potential to be applied to the mvcc store on some members and not others. If the key was present on one member, it would be applied, increasing the mvcc revision. If it was missing on another member, it would not be applied. In this way, the mvcc revision could get drastically out of sync for the same raft index among the cluster members. On a cluster with thousands of events, we saw mvcc versions off by 1000 or more. Because the mvcc version is used as the resourceVersion by kubernetes, mismatches between cluster members break things like watch... you could list from one etcd member and get a resourceVersion of 1000, then ask to watch from another etcd member who thought the store was still at 500.
We intend to amend our migration process so that we 1. Stop api/controllers 2. backup etcd1, etcd2, etcd3 3. stop etcd1, etcd2, etcd3 4. migrate etcd1 5. purge /var/lib/etcd/member on etcd2 etcd3 6. start etcd1 as a new cluster 7. member add etcd2 8. start etcd2 9. verify cluster is healthy 10. member add etcd3 11. start etcd3 12. verify cluster is healthy 13. perform TTL migration 14. Remove v2 keys 15. reconfig masters 16. start api/controllers
Prior to performing the TTL migration, we should ensure the v3 stores are consistent by running this on each member (the hash, revision, and totalKey fields should all match): ETCDCTL_API=3 etcdctl snapshot status -w json /path/to/snap/db And this on each endpoint (the Status.raftIndex and Status.header.revision from each endpoint should match): ETCDCTL_API=3 etcdctl endpoint status -w json
the etcd migrate is also not re-establishing TTLs on all keys that had TTLs in v2 Additional prefixes that use leases: /openshift.io/leases/controllers set to controllerLeaseTTL /openshift.io/oauth/accesstokens set to oauthConfig.tokenConfig.accessTokenMaxAgeSeconds /openshift.io/oauth/authorizetokens set to oauthConfig.tokenConfig.authorizeTokenMaxAgeSeconds /kubernetes.io/masterleases should not be set to a 1 hour ttl... it is fixed to a 10 second TTL currently.
Reassigning to Jordan since he is the active dev on this issue. Trying to take pressure off Derek's ever growing list. Feel free to adjust the severity if this situation has evolved.
cause is known, remaining work is in the ansible migration task
What is the correct step to migrate the clustered etcd? Shall we only re-establish TTL leases on one etcd member and wait until the data are synced in clustered?
The process outlined in https://bugzilla.redhat.com/show_bug.cgi?id=1475351#c17 will work The important change is to only migrate one member, then rejoin the other two as if they were new members and let them obtain the current data from the one migrated member
Jordan, As you mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1475351#c19. three TTLs are absent and /kubernetes.io/masterleases should be 1 hours. what is the default TTL for different keys? Is there a dictionary?
Anping, Here's my pull request that implements comment 17. I'd like to clean this up but this works at least for 3 node clusters. Please feel free to test it. https://github.com/openshift/openshift-ansible/pull/4980 It should be attaching leases per comment 19 now, it will pull the values from config files if they're set or it will use the defaults.
PR updated with the latest implementation just waiting for review.
1. migrate works on clustered rpm etcd env 2. migrate failed on clustered containerized etcd Env TASK [nickhammond.logrotate : nickhammond.logrotate | Setup logrotate.d scripts] *** RUNNING HANDLER [etcd : restart etcd] ****************************************** skipping: [qe-anlioloh-master-etcd-zone2-1.fixed-001.qe.rhcloud.com] => { "changed": false, "skip_reason": "Conditional check failed", "skipped": true } TASK [Verify cluster is stable] ************************************************ fatal: [qe-anlioloh-master-etcd-zone2-1.fixed-001.qe.rhcloud.com]: FAILED! => { "changed": true, "cmd": [ "/usr/bin/etcdctl", "--cert-file", "/etc/etcd/peer.crt", "--key-file", "/etc/etcd/peer.key", "--ca-file", "/etc/etcd/ca.crt", "-C", "https://qe-anlioloh-master-etcd-zone1-1:2379", "cluster-health" ], "delta": "0:00:00.051522", "end": "2017-08-25 07:39:54.817599", "failed": true, "rc": 5, "start": "2017-08-25 07:39:54.766077", "warnings": [] } STDOUT: member 939e401bfb987941 is unhealthy: got unhealthy result from https://10.240.0.24:2379 member de5cd42de39d68f4 is unreachable: no available published client urls cluster is unhealthy STDERR: 2017-08-25 07:39:54.785019 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 2017-08-25 07:39:54.785697 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated journal log on the failure members Aug 25 05:37:49 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:37:49.142335 I | raft: raft.node: 6a51b64f562ca241 elected leader 16db1390acd1368b at term 2 Aug 25 05:37:49 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:37:49.151729 I | etcdserver: published {Name:qe-anlioloh-master-etcd-zone2-1 ClientURLs:[https: Aug 25 05:37:49 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:37:49.151791 I | embed: ready to serve client requests Aug 25 05:37:49 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:37:49.152138 I | embed: serving client requests on 10.240.0.25:2379 Aug 25 05:37:49 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:37:49.168541 N | etcdserver/membership: set the initial cluster version to 3.1 Aug 25 05:37:49 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:37:49.168599 I | etcdserver/api: enabled capabilities for version 3.1 Aug 25 05:46:44 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:44.401059 W | rafthttp: lost the TCP streaming connection with peer 939e401bfb987941 (stream Aug 25 05:46:44 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:44.401128 W | rafthttp: lost the TCP streaming connection with peer 939e401bfb987941 (stream Aug 25 05:46:44 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:44.514002 E | rafthttp: failed to dial 939e401bfb987941 on stream Message (dial tcp 10.240.0 Aug 25 05:46:44 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:44.514021 I | rafthttp: peer 939e401bfb987941 became inactive Aug 25 05:46:47 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:47.332197 I | rafthttp: peer 939e401bfb987941 became active Aug 25 05:46:47 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:47.332237 W | rafthttp: closed an existing TCP streaming connection with peer 939e401bfb9879 Aug 25 05:46:47 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:47.332247 I | rafthttp: established a TCP streaming connection with peer 939e401bfb987941 (s Aug 25 05:46:47 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:47.341355 W | rafthttp: closed an existing TCP streaming connection with peer 939e401bfb9879 Aug 25 05:46:47 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:47.341379 I | rafthttp: established a TCP streaming connection with peer 939e401bfb987941 (s Aug 25 05:46:47 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:47.360488 I | rafthttp: established a TCP streaming connection with peer 939e401bfb987941 (s Aug 25 05:46:47 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:46:47.361072 I | rafthttp: established a TCP streaming connection with peer 939e401bfb987941 (s Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal systemd[1]: Stopping The Etcd Server container... Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.161750 N | pkg/osutil: received terminated signal, shutting down... Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.162283 I | etcdserver: skipped leadership transfer for stopping non-leader member Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.162379 I | rafthttp: stopping peer 16db1390acd1368b... Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.162711 I | rafthttp: closed the TCP streaming connection with peer 16db1390acd1368b (stre Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.162720 I | rafthttp: stopped streaming with peer 16db1390acd1368b (writer) Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164008 I | rafthttp: closed the TCP streaming connection with peer 16db1390acd1368b (stre Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164019 I | rafthttp: stopped streaming with peer 16db1390acd1368b (writer) Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164067 I | rafthttp: stopped HTTP pipelining with peer 16db1390acd1368b Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164163 W | rafthttp: lost the TCP streaming connection with peer 16db1390acd1368b (stream Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164179 E | rafthttp: failed to read 16db1390acd1368b on stream MsgApp v2 (net/http: reque Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164184 I | rafthttp: peer 16db1390acd1368b became inactive Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164191 I | rafthttp: stopped streaming with peer 16db1390acd1368b (stream MsgApp v2 reade Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164296 W | rafthttp: lost the TCP streaming connection with peer 16db1390acd1368b (stream Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164344 I | rafthttp: stopped streaming with peer 16db1390acd1368b (stream Message reader) Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164361 I | rafthttp: stopped peer 16db1390acd1368b Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.164373 I | rafthttp: stopping peer 939e401bfb987941... Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165024 I | rafthttp: closed the TCP streaming connection with peer 939e401bfb987941 (stre Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165046 I | rafthttp: stopped streaming with peer 939e401bfb987941 (writer) Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165380 I | rafthttp: closed the TCP streaming connection with peer 939e401bfb987941 (stre Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165396 I | rafthttp: stopped streaming with peer 939e401bfb987941 (writer) Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165417 I | rafthttp: stopped HTTP pipelining with peer 939e401bfb987941 Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165496 W | rafthttp: lost the TCP streaming connection with peer 939e401bfb987941 (stream Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165508 E | rafthttp: failed to read 939e401bfb987941 on stream MsgApp v2 (net/http: reque Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165512 I | rafthttp: peer 939e401bfb987941 became inactive Aug 25 05:49:02 qe-anlioloh-master-etcd-zone2-1.c.openshift-gce-devel.internal etcd_container[14941]: 2017-08-25 09:49:02.165520 I | rafthttp: stopped streaming with peer 939e401bfb987941 (stream MsgApp v2 reade
Created attachment 1318151 [details] Part of ansible logs Will provide more detail logs when I reproduce it.
Created attachment 1318263 [details] migrate log and inventory file
https://github.com/openshift/openshift-ansible/pull/5229 proposed fix I need to test this some more and then i'll backport to release-3.6 and produce new builds tonight.
openshift-ansible-3.6.173.0.19-2.git.0.eb719a4.el7 should fix that
The etcd scaleup still failed on the second etcd member. I think we shouldn't add “quotation marks” for ETCD_INITIAL_CLUSTER and ETCD_DEBUG. etcdctl3 member list 2017-08-28 07:01:48.242808 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated 5e28099eb201, started, qe-anlimotr-master-etcd-zone1-1, https://10.240.0.7:2380, https://10.240.0.7:2379 aaa0eb2ebbef33de, unstarted, , https://10.240.0.8:2380, ETCD_NAME=qe-anlimotr-master-etcd-zone2-1 ETCD_LISTEN_PEER_URLS=https://10.240.0.8:2380 ETCD_DATA_DIR=/var/lib/etcd/ ETCD_HEARTBEAT_INTERVAL=500 ETCD_ELECTION_TIMEOUT=2500 ETCD_LISTEN_CLIENT_URLS=https://10.240.0.8:2379 ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.240.0.8:2380 ETCD_INITIAL_CLUSTER="qe-anlimotr-master-etcd-zone1-1=https://10.240.0.7:2380,qe-anlimotr-master-etcd-zone2-1=https://10.240.0.8:2380" ETCD_INITIAL_CLUSTER_STATE=existing ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1 ETCD_ADVERTISE_CLIENT_URLS=https://10.240.0.8:2379 ETCD_CA_FILE=/etc/etcd/ca.crt ETCD_CERT_FILE=/etc/etcd/server.crt ETCD_KEY_FILE=/etc/etcd/server.key ETCD_PEER_CA_FILE=/etc/etcd/ca.crt ETCD_PEER_CERT_FILE=/etc/etcd/peer.crt ETCD_PEER_KEY_FILE=/etc/etcd/peer.key ETCD_DEBUG="False"
Created attachment 1319057 [details] The inventory file migrate log and etcd jouranl file
(In reply to Anping Li from comment #33) > The etcd scaleup still failed on the second etcd member. I think we > shouldn't add “quotation marks” for ETCD_INITIAL_CLUSTER and ETCD_DEBUG. Yeah, that's what my PR fixed. Can you confirm which version of openshift-ansible you used?
Test pass with openshift3/ose-ansible:v3.6.173.0.21 1) Single RPM master/etcd Pass 2) Single containerized master/etcd Pass 3) Clustered RPM master/etcd Pass 4) Clustered Containerized master/etcd Pass 5) Clustered Atomic master/etcd Pass 6) single master with external clustered Containerizedetcd Fail 7) single master with external clustered RPM etcd on_going Note: master/etcd = master and etcd are located in same host.
6) single master with external clustered Containerizedetcd Fail ### Inventory file### [masters] qe-anliayjy-master-1.0829-brg.qe.rhcloud.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=qe-anliayjy-master-1.0829-brg.qe.rhcloud.com openshift_hostname=qe-anliayjy-master-1 [nodes] qe-anliayjy-master-1.0829-brg.qe.rhcloud.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=qe-anliayjy-master-1.0829-brg.qe.rhcloud.com openshift_hostname=qe-anliayjy-master-1 openshift_node_labels="{'role': 'node'}" openshift_schedulable=true qe-anliayjy-node-registry-router-1.0829-brg.qe.rhcloud.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=qe-anliayjy-node-registry-router-1.0829-brg.qe.rhcloud.com openshift_hostname=qe-anliayjy-node-registry-router-1 openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}" [etcd] qe-anliayjy-etcd-1.0829-brg.qe.rhcloud.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=qe-anliayjy-etcd-1.0829-brg.qe.rhcloud.com openshift_hostname=qe-anliayjy-etcd-1 qe-anliayjy-etcd-2.0829-brg.qe.rhcloud.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=qe-anliayjy-etcd-2.0829-brg.qe.rhcloud.com openshift_hostname=qe-anliayjy-etcd-2 qe-anliayjy-etcd-3.0829-brg.qe.rhcloud.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=qe-anliayjy-etcd-3.0829-brg.qe.rhcloud.com openshift_hostname=qe-anliayjy-etcd-3 ##########migrade log############## TASK [etcd_migrate : set_fact] ************************************************* ok: [qe-anliayjy-master-1.0829-brg.qe.rhcloud.com] => { "ansible_facts": { "accessTokenMaxAgeSeconds": "86400", "authroizeTokenMaxAgeSeconds": "500", "controllerLeaseTTL": "30" }, "changed": false } TASK [etcd_migrate : Re-introduce leases (as a replacement for key TTLs)] ****** failed: [qe-anliayjy-master-1.0829-brg.qe.rhcloud.com] (item={u'keys': u'/kubernetes.io/events', u'ttl': u'1h'}) => { "changed": true, "cmd": [ "oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://10.240.0.41:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x3cab9d0>", "--lease-duration", "1h" ], "delta": "0:00:00.177166", "end": "2017-08-29 04:25:45.610543", "failed": true, "item": { "keys": "/kubernetes.io/events", "ttl": "1h" }, "rc": 1, "start": "2017-08-29 04:25:45.433377", "warnings": [] } STDERR: Error: unknown flag: --cert Usage: oadm migrate [options] Available Commands: image-references Update embedded Docker image references storage Update the stored version of API objects Use "oadm <command> --help" for more information about a given command. Use "oadm options" for a list of global command-line options (applies to all commands). failed: [qe-anliayjy-master-1.0829-brg.qe.rhcloud.com] (item={u'keys': u'/kubernetes.io/masterleases', u'ttl': u'10s'}) => { "changed": true, "cmd": [ "oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://10.240.0.41:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x3cb5fa0>", "--lease-duration", "10s" ], "delta": "0:00:00.189811", "end": "2017-08-29 04:25:46.859481", "failed": true, "item": { "keys": "/kubernetes.io/masterleases", "ttl": "10s" }, "rc": 1, "start": "2017-08-29 04:25:46.669670", "warnings": [] } STDERR: Error: unknown flag: --cert Usage: oadm migrate [options] Available Commands: image-references Update embedded Docker image references storage Update the stored version of API objects Use "oadm <command> --help" for more information about a given command. Use "oadm options" for a list of global command-line options (applies to all commands). failed: [qe-anliayjy-master-1.0829-brg.qe.rhcloud.com] (item={u'keys': u'/openshift.io/oauth/accesstokens', u'ttl': u'86400s'}) => { "changed": true, "cmd": [ "oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://10.240.0.41:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x391fd60>", "--lease-duration", "86400s" ], "delta": "0:00:00.191488", "end": "2017-08-29 04:25:48.115620", "failed": true, "item": { "keys": "/openshift.io/oauth/accesstokens", "ttl": "86400s" }, "rc": 1, "start": "2017-08-29 04:25:47.924132", "warnings": [] } STDERR: Error: unknown flag: --cert Usage: oadm migrate [options] Available Commands: image-references Update embedded Docker image references storage Update the stored version of API objects Use "oadm <command> --help" for more information about a given command. Use "oadm options" for a list of global command-line options (applies to all commands). failed: [qe-anliayjy-master-1.0829-brg.qe.rhcloud.com] (item={u'keys': u'/openshift.io/oauth/authorizetokens', u'ttl': u'500s'}) => { "changed": true, "cmd": [ "oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://10.240.0.41:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x3ee3af0>", "--lease-duration", "500s" ], "delta": "0:00:00.166911", "end": "2017-08-29 04:25:49.356639", "failed": true, "item": { "keys": "/openshift.io/oauth/authorizetokens", "ttl": "500s" }, "rc": 1, "start": "2017-08-29 04:25:49.189728", "warnings": [] } STDERR: Error: unknown flag: --cert Usage: oadm migrate [options] Available Commands: image-references Update embedded Docker image references storage Update the stored version of API objects Use "oadm <command> --help" for more information about a given command. Use "oadm options" for a list of global command-line options (applies to all commands). failed: [qe-anliayjy-master-1.0829-brg.qe.rhcloud.com] (item={u'keys': u'/openshift.io/leases/controllers', u'ttl': u'30s'}) => { "changed": true, "cmd": [ "oadm", "migrate", "etcd-ttl", "--cert", "/etc/origin/master/master.etcd-client.crt", "--key", "/etc/origin/master/master.etcd-client.key", "--cacert", "/etc/origin/master/master.etcd-ca.crt", "--etcd-address", "https://10.240.0.41:2379", "--ttl-keys-prefix", "<built-in", "method", "keys", "of", "dict", "object", "at", "0x1df1c10>", "--lease-duration", "30s" ], "delta": "0:00:00.177770", "end": "2017-08-29 04:25:50.593262", "failed": true, "item": { "keys": "/openshift.io/leases/controllers", "ttl": "30s" }, "rc": 1, "start": "2017-08-29 04:25:50.415492", "warnings": [] } STDERR: Error: unknown flag: --cert Usage: oadm migrate [options] Available Commands: image-references Update embedded Docker image references storage Update the stored version of API objects Use "oadm <command> --help" for more information about a given command. Use "oadm options" for a list of global command-line options (applies to all commands).
Anping, That looks like the oadm command wasn't updated prior to running the migration. Can you verify that openshift was upgraded to 3.6 prior to running the migration? Can you gather the output of these commands? which oadm oc openshift oadm version oc version openshift version
Scott, I will retest 6) & 7), will update the comment once finished. 6) single master with external clustered Containerizedetcd 7) single master with external clustered RPM etcd
The migrate works for 6) 7). 6) single master with external clustered Containerizedetcd 7) single master with external clustered RPM etcd
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2639