Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1466638 - ETCD data lost after migration from containerized etcd to system container etcd
ETCD data lost after migration from containerized etcd to system container etcd
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.6.0
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Giuseppe Scrivano
Gaoyun Pei
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-30 02:37 EDT by Gaoyun Pei
Modified: 2017-08-16 15 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:28:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 05:02:50 EDT

  None (edit)
Description Gaoyun Pei 2017-06-30 02:37:06 EDT
Description of problem:
When migrating etcd from previous containerized installation to system container, installer failed when "Wait for Node Registration" due to nodes not found.  

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.126.4-1.git.0.d25d828.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1.Set up a containerized ocp-3.6 environment, make sure it's working well
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get node
NAME                                       STATUS                     AGE       VERSION
qe-gpei-etcd-sc-2-master-1                 Ready,SchedulingDisabled   13m       v1.6.1+5115d708d7
qe-gpei-etcd-sc-2-node-registry-router-1   Ready                      10m       v1.6.1+5115d708d7
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get pod
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-3-3w0x4    1/1       Running   0          7m
registry-console-1-m22cg   1/1       Running   0          8m
router-1-c3s21             1/1       Running   0          10m
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get project
NAME               DISPLAY NAME   STATUS
default                           Active
install-test                      Active
kube-public                       Active
kube-system                       Active
logging                           Active
management-infra                  Active
openshift                         Active
openshift-infra                   Active
test111                           Active

2.Add openshift_use_etcd_system_container=true into ansible inventory file, re-run the byo/config.yml playbook


Actual results:
TASK [openshift_manage_node : Wait for Node Registration] **********************
...
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (1 retries left).
fatal: [qe-gpei-etcd-sc-2-node-registry-router-1.0630-1gj.qe.rhcloud.com -> qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com]: FAILED! => {
    "attempts": 50, 
    "changed": false, 
    "failed": true, 
    "results": {
        "cmd": "/usr/local/bin/oc get node qe-gpei-etcd-sc-2-node-registry-router-1 -o json -n default", 
        "results": [
            {}
        ], 
        "returncode": 0, 
        "stderr": "Error from server (NotFound): nodes \"qe-gpei-etcd-sc-2-node-registry-router-1\" not found\n", 
        "stdout": ""
    }, 
    "state": "list"
}
fatal: [qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com -> qe-gpei-etcd-sc-2-master-1.0630-1gj.qe.rhcloud.com]: FAILED! => {
    "attempts": 50, 
    "changed": false, 
    "failed": true, 
    "results": {
        "cmd": "/usr/local/bin/oc get node qe-gpei-etcd-sc-2-master-1 -o json -n default", 
        "results": [
            {}
        ], 
        "returncode": 0, 
        "stderr": "Error from server (NotFound): nodes \"qe-gpei-etcd-sc-2-master-1\" not found\n", 
        "stdout": ""
    }, 
    "state": "list"
}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/config.retry


On master host:
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get node
No resources found.
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get pod
No resources found.
[root@qe-gpei-etcd-sc-2-master-1 ~]# oc get project
NAME               DISPLAY NAME   STATUS
default                           Active
kube-public                       Active
kube-system                       Active
management-infra                  Active
openshift                         Active
openshift-infra                   Active


On etcd host:
[root@qe-gpei-etcd-sc-2-etcd-1 ~]# ls -R /var/lib/etcd/member/
/var/lib/etcd/member/:
snap  wal

/var/lib/etcd/member/snap:
db

/var/lib/etcd/member/wal:
0000000000000000-0000000000000000.wal

[root@qe-gpei-etcd-sc-2-etcd-1 ~]# 
[root@qe-gpei-etcd-sc-2-etcd-1 ~]# ls -R /var/lib/etcd/etcd.etcd/etcd.etcd/member/
/var/lib/etcd/etcd.etcd/etcd.etcd/member/:
snap  wal

/var/lib/etcd/etcd.etcd/etcd.etcd/member/snap:
db

/var/lib/etcd/etcd.etcd/etcd.etcd/member/wal:
0000000000000000-0000000000000000.wal  0.tmp


Expected results:


Additional info:
Comment 1 Giuseppe Scrivano 2017-07-03 15:52:56 EDT
patch proposed here:

https://github.com/openshift/openshift-ansible/pull/4668
Comment 3 Gaoyun Pei 2017-07-06 04:52:36 EDT
Verify this bug with openshift-ansible-3.6.135-1.git.0.5533fe3.el7.noarch

1. Set up a containerized ocp-3.6 environment

2. Add openshift_use_etcd_system_container=true into ansible inventory file, re-run the byo/config.yml playbook

The playbook finished successfully, after the re-run job, cluster is working well: 
previous project data still exist 
node is available and the old pod is running
etcd service is running via system container
Comment 5 errata-xmlrpc 2017-08-10 01:28:56 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.