Bug 1504604

Summary: Original ocp does not work after migrate an embedded etcd to a fresh hosts
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: Cluster Version OperatorAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-28 22:18:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liujia 2017-10-20 10:09:27 UTC
Description of problem:
Do embedded etcd migrate against rpm non-ha v3.6 ocp with an embedded, migrate 
playbook run successfully, But after migration, the ocp does not work.
For example:
1) atomic-openshift-master.service restart in loop
# systemctl status atomic-openshift-master.service 
● atomic-openshift-master.service - Atomic OpenShift Master
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2017-10-20 05:57:35 EDT; 3s ago
     Docs: https://github.com/openshift/origin
  Process: 49089 ExecStart=/usr/bin/openshift start master --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
 Main PID: 49089 (code=exited, status=255)

Oct 20 05:57:35 x-embed-master-nfs-1 systemd[1]: atomic-openshift-master.service: main process exited, code=exited, status=255/n/a
Oct 20 05:57:35 x-embed-master-nfs-1 systemd[1]: Failed to start Atomic OpenShift Master.
Oct 20 05:57:35 x-embed-master-nfs-1 systemd[1]: Unit atomic-openshift-master.service entered failed state.
Oct 20 05:57:35 x-embed-master-nfs-1 systemd[1]: atomic-openshift-master.service failed.

2) "oc get" can now get any data
# oc get node
The connection to the server x-embed-master-nfs-1:8443 was refused - did you specify the right host or port?

===============================
Check master log, master try to connect itself(10.240.0.49) but not new etcd host(10.240.0.56)
getsockopt: connection refused"; Reconnecting to {10.240.0.49:2379 <nil>}


# cat /etc/etcd/etcd.conf | grep LISTEN
ETCD_LISTEN_PEER_URLS=https://10.240.0.56:2380
ETCD_LISTEN_CLIENT_URLS=https://10.240.0.56:2379


Version-Release number of the following components:
openshift-ansible-3.7.0-0.167.0.git.0.0e34535.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install v3.6 ocp with embedded etcd
2. Prepare repos on a new host(just install docker on it)
3. Edit hosts file to add etcd group
[OSEv3:children]
...
etcd
...
[etcd]
hostname...
//Specify a new host for etcd.
4. Do etcd migrate
# ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/embedded2external.yml

Actual results:
OCP does not work after migrate

Expected results:
OCP should works well after migrate

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Jan Chaloupka 2017-10-23 13:07:55 UTC
I am able to reproduce it, I know what is wrong, I got a fix for it. I will open a PR in a few.

Comment 5 Jan Chaloupka 2017-10-23 13:32:15 UTC
Upstream PR: https://github.com/openshift/openshift-ansible/pull/5843

Comment 7 liujia 2017-10-26 09:40:08 UTC
Version:
openshift-ansible-3.7.0-0.179.0.git.0.a2641b6.el7.noarch

Steps:
1. Install v3.6 ocp with embedded etcd
2. Prepare repos on a new host(just install docker on it)
3. Edit hosts file to add etcd group
[OSEv3:children]
...
etcd
...
[etcd]
hostname...
//Specify a new host for etcd.
4. Do etcd migrate
# ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/embedded2external.yml

After migrate to external etcd, it works well now.

Comment 10 errata-xmlrpc 2017-11-28 22:18:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188