Bug 1578934 - etcd certificate re-deploy during Red Hat OpenShift Container Platform 3.6 upgrade is failing
Summary: etcd certificate re-deploy during Red Hat OpenShift Container Platform 3.6 up...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.6.z
Assignee: Vadim Rutkovsky
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks: 1579319 1579320 1579321
TreeView+ depends on / blocked
 
Reported: 2018-05-16 16:03 UTC by Simon Reber
Modified: 2018-10-08 11:46 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1579319 1579320 1579321 (view as bug list)
Environment:
Last Closed: 2018-10-08 11:46:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Simon Reber 2018-05-16 16:03:40 UTC
Description of problem:

While upgrading Red Hat OpenShift Container Platform to version 3.6 (following https://docs.openshift.com/container-platform/3.6/install_config/upgrading/automated_upgrades.html#upgrading-control-plane-nodes-separate-phases) we hit an issue because etcd certificates were re-deployed. The etcd re-deploy is related to https://github.com/openshift/openshift-ansible/commit/6af1fb203c9efc12bebd4455f8ba96736a0a73b2#diff-6e1f944c172a66b8294fa8cc2b081a97R59.

So when running the upgrade we failed in https://github.com/openshift/openshift-ansible/blob/release-3.6/roles/etcd_client_certificates/tasks/main.yml#L75 and https://github.com/openshift/openshift-ansible/blob/release-3.6/roles/etcd_server_certificates/tasks/main.yml#L100 because the file at the destination was already there and thus creating the hard-link would fail.

After adding `force: yes` to this particular task is started working again

Version-Release number of the following components:

 - ansible-2.4.2.0-2.el7.noarch
 - openshift-ansible-3.6.173.0.113-1.git.13.f3b3b1d.el7

How reproducible:

 - Always

Steps to Reproduce:
1. Run upgrade from Red Hat OpenShift Container Platform 3.5 to 3.6 following https://docs.openshift.com/container-platform/3.6/install_config/upgrading/automated_upgrades.html#upgrading-control-plane-nodes-separate-phases and containerized. Installation should be old to reproduce the behavior (started with Red Hat OpenShift Container Platform 3.3 or similar)

Actual results:

Task mentioned in description failed as file was already in place

Expected results:

Not to fail and either have `force: yes` in place or another method to prevent the failure from happening.

Additional info:

Comment 1 Scott Dodson 2018-05-16 19:15:32 UTC
Reported error is same as https://bugzilla.redhat.com/show_bug.cgi?id=1507123

TASK [etcd_server_certificates : file] *****************************************
fatal: [atom0011.example.com -> atom0010.example.com] FAILED! => {
    "changed": false,
    "dest": "/etc/etcd/generated_certs/etcd-atom0011.example.com/ca.crt",
    "failed": true,
    "gid": 0,
    "group": "root",
    "mode": "0644",
    "owner": "root",
    "secontext": "unconfined_u:object_r:etc_t:s0",
    "size": 1895,
    "src": "/etc/etcd/ca/ca.crt",
    "state": "file",
    "uid": 0
}

So either we, or the customer, has placed a real file where we're expecting a file not to be or to be a hardlink.

Since we're starting to see more occurrences of this and using the force option fixes it we should address this. Problem still exists in 3.10 too though the code has been relocated to roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml

We should probably look for otherplaces where we're creating a symlink or hard link and see if we need to fix them as well.

Need to clone the bug back to each release as we backport the fix.

Comment 4 Vadim Rutkovsky 2018-05-17 11:30:07 UTC
Created https://github.com/openshift/openshift-ansible/pull/8405 for 3.6 and cloned bugs to other releases

Comment 5 Vadim Rutkovsky 2018-05-23 07:32:18 UTC
Fix is available in openshift-ansible-3.6.173.0.120-1

Comment 6 Gaoyun Pei 2018-06-15 08:56:12 UTC
Verify this bug with openshift-ansible-3.6.173.0.124-1.git.0.5f3f028.el7.noarch

Run 3.5 to 3.6 upgrade, etcd certificates redeploy playbook would be called for hostnames were missing from etcd serving certificate SANs. During cert redeployment, the related two "file" steps both passed.

TASK [etcd_server_certificates : file] **************************************************************************************************************************************
ok: [qe-gpei-rpm35-etcd-1.0615-3gs.qe.rhcloud.com -> qe-gpei-rpm35-etcd-1.0615-3gs.qe.rhcloud.com] => {
    "changed": false,
    "dest": "/etc/etcd/generated_certs/etcd-qe-gpei-rpm35-etcd-1/ca.crt",
    "failed": false,
    "gid": 0,
    "group": "root",
    "mode": "0644",
    "owner": "root",
    "secontext": "unconfined_u:object_r:etc_t:s0",
    "size": 1895,
    "src": "/etc/etcd/ca/ca.crt",
    "state": "hard",
    "uid": 0
}


TASK [etcd_client_certificates : file] **************************************************************************************************************************************
ok: [qe-gpei-rpm35-master-1.0615-3gs.qe.rhcloud.com -> qe-gpei-rpm35-etcd-1.0615-3gs.qe.rhcloud.com] => {
    "changed": false,
    "dest": "/etc/etcd/generated_certs/openshift-master-qe-gpei-rpm35-master-1/master.etcd-ca.crt",
    "failed": false,
    "gid": 0,
    "group": "root",
    "mode": "0644",
    "owner": "root",
    "secontext": "unconfined_u:object_r:etc_t:s0",
    "size": 1895,
    "src": "/etc/etcd/ca/ca.crt",
    "state": "hard",
    "uid": 0
}


Note You need to log in before you can comment on or make changes to this bug.