Bug 1398083

Summary: Manually adding a new etcd node sometimes failed when other existing etcd servcies are running
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: NodeAssignee: Avesh Agarwal <avagarwa>
Status: CLOSED NOTABUG QA Contact: DeShuai Ma <dma>
Severity: low Docs Contact:
Priority: medium    
Version: 3.3.0CC: aos-bugs, avagarwa, decarr, jokerman, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 21:21:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kenjiro Nakayama 2016-11-24 05:30:56 UTC
Description of problem:
===
- When we added a new etcd node by following the article[1], "etcdctl cluster-health" failed as:

   [root@master01 ~]# etcdctl -C https://master01.example.com:2379 --ca-file=/etc/etcd/ca.crt     --cert-file=/etc/etcd/peer.crt     --key-file=/etc/etcd/peer.key cluster-health
   member 28505b16361f7280 is healthy: got healthy result from https://192.178.27.11:2379
   member 9275175d84c8d0f9 is healthy: got healthy result from https://192.178.27.12:2379
   member c0852674e1287c28 is healthy: got healthy result from https://192.178.27.13:2379
   member f79f565dfa729030 is unhealthy: got unhealthy result from https://192.178.27.14:2379
   cluster is healthy

Version-Release number of selected component (if applicable):

- RHEL 7.2
- OCP 3.3
- etcd 2.3.7

How reproducible: *NOTE* It cannot reproduce 100%. Most of the time it works fine.
===
Steps to Reproduce:
1. Add new etcd node by following [1].
- When adding the new nodes, other exsiting etcd services keep runninng.

Actual results:
===
- Failed to cluster-health due to above error.
- On master04's etcd service log output below:

  Nov 21 10:51:17 master04.example.com etcd[23570]: publish error: etcdserver: request timed out, possibly due to previous leader failure

Expected results:
===
- No error

Additional info:
===
- In [1], manually adding a new etcd steps + backup restore steps:

  # etcdctl backup --keep-cluster-id --node-id ${NODE_ID} --data-dir /var/lib/etcd --backup-dir /var/lib/etcd/$NEW_ETCD-backup
  # tar -cvf $NEW_ETCD-backup.tar.gz -C /var/lib/etcd/$NEW_ETCD-backup/ .
  # scp $NEW_ETCD-backup.tar.gz $NEW_ETCD:/var/lib/etcd/
  # tar -xf /etc/etcd/etcd-1.openshift.com.tgz -C /etc/etcd/ --overwrite

- Official doc[2] says "If etcd is running on more than one host, stop it on each host:", but [1] doesn't say it clearly. 

[1] https://access.redhat.com/articles/2650151
[2] https://docs.openshift.com/container-platform/3.3/admin_guide/backup_restore.html#cluster-backup

Comment 3 Avesh Agarwal 2016-11-28 23:22:35 UTC
Sure Derek.

Comment 5 Avesh Agarwal 2016-11-29 10:28:45 UTC
Kenjiro, can you also attach the leader logs when the issue happens?

Comment 6 Avesh Agarwal 2016-11-29 13:57:49 UTC
kenjiro,

After reading the 2 articles, it seems what you said is true:

1. From the first article, it says:

"Take backup of etcd and transfer contents to "NEW_ETCD"
Skip this step if version is lower than etcd-2.3.7-4 and etcd database size is smaller than 700mb."

2. From the 2nd article, it says:

"If etcd is running on more than one host, stop it on each host:"

So it seems for backup step, it is recommended to stop etcd services on other hosts.

Could you please verify what was the database size when the issue happened?

Comment 7 Avesh Agarwal 2016-11-29 14:04:49 UTC
Based on the logs, it seems etcd is colocated with a node. In that case, 
it seems that this issue could happen, and this issue seems related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1389736. https://bugzilla.redhat.com/show_bug.cgi?id=1389736#c39 has more details.

Comment 8 Avesh Agarwal 2016-11-29 14:07:34 UTC
The error in the logs here:
Nov 21 10:46:31 master04 etcd: publish error: etcdserver: request timed out

seems similar to the ones in https://bugzilla.redhat.com/show_bug.cgi?id=1389736.

Comment 9 Avesh Agarwal 2016-11-29 15:10:11 UTC
So far my understanding is based on logs:

etcd is colocated with a node, something not recommended based on https://bugzilla.redhat.com/show_bug.cgi?id=1389736, and this colocation could trigger this issue. 

Also as It happened once at the customer site, and also not 100% reproducible, so does not seem like a blocker.

Comment 11 Kenjiro Nakayama 2016-11-30 07:29:18 UTC
Hi Avesh,

I understand that the issue probably hit the bug#1389736.
However, after all, adding new nodes (w/ more than 700mb) needs to stop other existing etcd services?
The article mentions "Do not start the etcd service" in the first section, but it looks like it mentions like "don't start *new* etcd service yet".

Could you please confirm it?

[1] https://access.redhat.com/articles/2650151

Comment 12 Avesh Agarwal 2016-11-30 12:17:53 UTC
Kenjiro,

Based on the article, for backup step, if db size is more 700mb and version is 2.3.7-4, it says to stop etcd service on other hosts.

Yes, new etcd service should be started after steps upto 7 are done.

Comment 13 Kenjiro Nakayama 2016-12-14 11:40:25 UTC
Hi Avesh,

Thank you for your clarification. I think we can close this ticket now. Thanks.