Description of problem: During a 3.9 to 3.10 upgrade if etcds certs get redeployed that permissions will change for /etc/etcd to root:root. Etcd is then restarted and fails will error: connection from "" (error "open /etc/etcd/peer.crt: permission denied", ServerName "") Version-Release number of the following components: openshift-ansible v3.10 How reproducible: 100% Steps to Reproduce: 1. Upgrade a 3.9 hosts control-plain where etcd certificates when hostnames were missing from etcd serving certificate SANs. - Certs are generated and owned by root under /etc/etcd 2. A restart of etcd happens. 3. etcd is still running a systemd unit and running as user etcd, the upgrade has not set up etcd as a static pod yet 4. Upgrade fails due to permission issues with etcd certificates. Additional info: During the upgrade if the SAN is not correct of the etcd certs we redeploy the etcd certs https://github.com/openshift/openshift-ansible/blob/release-3.10/playbooks/openshift-etcd/private/upgrade_main.yml#L10-L23 When the certs are recreated the get root:root with the 3.10 playbooks because the etcd is running in a static pod as root. This task is hit before we move etcd to a static pod though, and when etcd is restarted it will fail as /etc/etcd is now root owned. https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/etcd/tasks/certificates/deploy_ca.yml#L12-L23 In 3.10+ Etcd static pod # oc rsh master-etcd-master-0.310test.com sh-4.2# whoami root sh-4.2# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 2.4 2.1 5889000 352588 ? Ssl 05:44 14:37 etcd root 36783 1.0 0.0 11816 1660 ? Ss 15:52 0:00 /bin/sh root 36799 1.0 0.0 51740 1728 ? R+ 15:52 0:00 ps aux # ls -la /etc/etcd/peer.crt -rw-------. 1 root root 6054 Dec 4 05:27 /etc/etcd/peer.crt In 3.9- etcd systemdunit etcd 20932 1.7 1.5 5833988 251320 ? Ssl Dec02 65:19 /usr/bin/etcd --name=master-0.sharedocp39.lab.rdu2.cee.redhat.com --data-dir=/var/lib/etcd/ --listen-client-urls=https://10.10.94.76:2 # ls -la /etc/etcd/peer.crt -rw-------. 1 etcd etcd 6052 Dec 2 19:44 peer.crt
Upstream report: https://github.com/openshift/openshift-ansible/issues/10361
The permissions for these certs are actually set here: https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml#L181-L211 Owner/group are removed from 3.10 branch. I have created a PR that should check if this is an upgrade where the certs are redeployed and change the owner to etcd in that case: PR: https://github.com/openshift/openshift-ansible/pull/10905
There is a bug in how SANs are parsed in certs. The PR below fixes the bug so that SANs can be found in the existing certs. I am closing the previous PR, which is a workaround. PR: https://github.com/openshift/openshift-ansible/pull/10940
Release 3.10 PR: https://github.com/openshift/openshift-ansible/pull/10943
In openshift-ansible-3.10.97-1
Created attachment 1524148 [details] upgrade log for failed verification
ok, We may low priority about this issue, since 3.10 etcd runs as root in a static pod so it can open certs regardless of owner, thx
Verified with openshift-ansible-3.10.104-1.git.0.79f87f7.el7.noarch.rpm.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0328