Description of problem: === - When we added a new etcd node by following the article[1], "etcdctl cluster-health" failed as: [root@master01 ~]# etcdctl -C https://master01.example.com:2379 --ca-file=/etc/etcd/ca.crt --cert-file=/etc/etcd/peer.crt --key-file=/etc/etcd/peer.key cluster-health member 28505b16361f7280 is healthy: got healthy result from https://192.178.27.11:2379 member 9275175d84c8d0f9 is healthy: got healthy result from https://192.178.27.12:2379 member c0852674e1287c28 is healthy: got healthy result from https://192.178.27.13:2379 member f79f565dfa729030 is unhealthy: got unhealthy result from https://192.178.27.14:2379 cluster is healthy Version-Release number of selected component (if applicable): - RHEL 7.2 - OCP 3.3 - etcd 2.3.7 How reproducible: *NOTE* It cannot reproduce 100%. Most of the time it works fine. === Steps to Reproduce: 1. Add new etcd node by following [1]. - When adding the new nodes, other exsiting etcd services keep runninng. Actual results: === - Failed to cluster-health due to above error. - On master04's etcd service log output below: Nov 21 10:51:17 master04.example.com etcd[23570]: publish error: etcdserver: request timed out, possibly due to previous leader failure Expected results: === - No error Additional info: === - In [1], manually adding a new etcd steps + backup restore steps: # etcdctl backup --keep-cluster-id --node-id ${NODE_ID} --data-dir /var/lib/etcd --backup-dir /var/lib/etcd/$NEW_ETCD-backup # tar -cvf $NEW_ETCD-backup.tar.gz -C /var/lib/etcd/$NEW_ETCD-backup/ . # scp $NEW_ETCD-backup.tar.gz $NEW_ETCD:/var/lib/etcd/ # tar -xf /etc/etcd/etcd-1.openshift.com.tgz -C /etc/etcd/ --overwrite - Official doc[2] says "If etcd is running on more than one host, stop it on each host:", but [1] doesn't say it clearly. [1] https://access.redhat.com/articles/2650151 [2] https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#cluster-backup
Sure Derek.
Kenjiro, can you also attach the leader logs when the issue happens?
kenjiro, After reading the 2 articles, it seems what you said is true: 1. From the first article, it says: "Take backup of etcd and transfer contents to "NEW_ETCD" Skip this step if version is lower than etcd-2.3.7-4 and etcd database size is smaller than 700mb." 2. From the 2nd article, it says: "If etcd is running on more than one host, stop it on each host:" So it seems for backup step, it is recommended to stop etcd services on other hosts. Could you please verify what was the database size when the issue happened?
Based on the logs, it seems etcd is colocated with a node. In that case, it seems that this issue could happen, and this issue seems related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1389736. https://bugzilla.redhat.com/show_bug.cgi?id=1389736#c39 has more details.
The error in the logs here: Nov 21 10:46:31 master04 etcd: publish error: etcdserver: request timed out seems similar to the ones in https://bugzilla.redhat.com/show_bug.cgi?id=1389736.
So far my understanding is based on logs: etcd is colocated with a node, something not recommended based on https://bugzilla.redhat.com/show_bug.cgi?id=1389736, and this colocation could trigger this issue. Also as It happened once at the customer site, and also not 100% reproducible, so does not seem like a blocker.
Hi Avesh, I understand that the issue probably hit the bug#1389736. However, after all, adding new nodes (w/ more than 700mb) needs to stop other existing etcd services? The article mentions "Do not start the etcd service" in the first section, but it looks like it mentions like "don't start *new* etcd service yet". Could you please confirm it? [1] https://access.redhat.com/articles/2650151
Kenjiro, Based on the article, for backup step, if db size is more 700mb and version is 2.3.7-4, it says to stop etcd service on other hosts. Yes, new etcd service should be started after steps upto 7 are done.
Hi Avesh, Thank you for your clarification. I think we can close this ticket now. Thanks.