Bug 1656526

Summary: 3.9 to 3.10 upgrade results in /etc/etcd/peer.crt: permission denied when certs get redeployed.
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: Cluster Version OperatorAssignee: Patrick Dillon <padillon>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, fhirtz, jkaur, jokerman, mmccomas, padillon
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the oa code uses the wrong type (should be bytes rather than string) when checking for SANs in certs using the pyOpenSSL library. Consequence: oa does not find SANs in the certs (although they are present) and creates new certs. When upgrading from 3.9 to 3.10 this can cause failure because 3.10 certs are created with root ownership, but the certs are recreated before upgrade is complete and the running 3.9 cannot access the root-owned certs, causing failure. Fix: the missing SANs are a false positive. By switching the comparison to type bytes, oa finds that the certs have the SAN. Result: the upgrade can reuse the existing certs and upgrade completes as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-20 10:11:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
upgrade log for failed verification none

Description Ryan Howe 2018-12-05 17:10:14 UTC
Description of problem:

During a 3.9 to 3.10 upgrade if etcds certs get redeployed that permissions will change for /etc/etcd to root:root. Etcd is then restarted and fails will error: 

        connection from "" (error "open /etc/etcd/peer.crt: permission denied", ServerName "")

Version-Release number of the following components:
openshift-ansible v3.10

How reproducible:
100%

Steps to Reproduce:
1. Upgrade a 3.9 hosts control-plain where etcd certificates when hostnames were missing from etcd serving certificate SANs.
  - Certs are generated and owned by root under /etc/etcd 
2. A restart of etcd happens.
3. etcd is still running a systemd unit and running as user etcd, the upgrade has not set up etcd as a static pod yet
4. Upgrade fails due to permission issues with etcd certificates. 


Additional info:

During the upgrade if the SAN is not correct of the etcd certs we redeploy the etcd certs
https://github.com/openshift/openshift-ansible/blob/release-3.10/playbooks/openshift-etcd/private/upgrade_main.yml#L10-L23

When the certs are recreated the get root:root with the 3.10 playbooks because the etcd is running in a static pod as root. This task is hit before we move etcd to a static pod though, and when etcd is restarted it will fail as /etc/etcd is now root owned. 

  https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/etcd/tasks/certificates/deploy_ca.yml#L12-L23


In 3.10+  Etcd static pod 

# oc rsh master-etcd-master-0.310test.com 
sh-4.2# whoami
root
sh-4.2# ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  2.4  2.1 5889000 352588 ?      Ssl  05:44  14:37 etcd
root      36783  1.0  0.0  11816  1660 ?        Ss   15:52   0:00 /bin/sh
root      36799  1.0  0.0  51740  1728 ?        R+   15:52   0:00 ps aux

# ls -la /etc/etcd/peer.crt
-rw-------. 1 root root 6054 Dec  4 05:27 /etc/etcd/peer.crt

In 3.9- etcd systemdunit 

etcd      20932  1.7  1.5 5833988 251320 ?      Ssl  Dec02  65:19 /usr/bin/etcd --name=master-0.sharedocp39.lab.rdu2.cee.redhat.com --data-dir=/var/lib/etcd/ --listen-client-urls=https://10.10.94.76:2

# ls -la /etc/etcd/peer.crt
-rw-------.   1 etcd etcd 6052 Dec  2 19:44 peer.crt

Comment 1 Frank Hirtz 2018-12-05 18:26:29 UTC
Upstream report:

https://github.com/openshift/openshift-ansible/issues/10361

Comment 2 Patrick Dillon 2018-12-18 20:30:05 UTC
The permissions for these certs are actually set here: https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml#L181-L211

Owner/group are removed from 3.10 branch.  I have created a PR that should check if this is an upgrade where the certs are redeployed and change the owner to etcd in that case:

PR: https://github.com/openshift/openshift-ansible/pull/10905

Comment 3 Patrick Dillon 2019-01-02 18:58:12 UTC
There is a bug in how SANs are parsed in certs. The PR below fixes the bug so that SANs can be found in the existing certs. I am closing the previous PR, which is a workaround.

PR: https://github.com/openshift/openshift-ansible/pull/10940

Comment 4 Patrick Dillon 2019-01-04 14:28:28 UTC
Release 3.10 PR: https://github.com/openshift/openshift-ansible/pull/10943

Comment 5 Scott Dodson 2019-01-10 19:14:37 UTC
In openshift-ansible-3.10.97-1

Comment 8 ge liu 2019-01-28 03:51:43 UTC
Created attachment 1524148 [details]
upgrade log for failed verification

Comment 12 ge liu 2019-01-30 06:47:25 UTC
ok, We may low priority about this issue, since 3.10 etcd runs as root in a static pod so it can open certs regardless of owner, thx

Comment 13 ge liu 2019-01-30 06:48:35 UTC
Verified with openshift-ansible-3.10.104-1.git.0.79f87f7.el7.noarch.rpm.

Comment 15 errata-xmlrpc 2019-02-20 10:11:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0328