Description of problem:
During a 3.9 to 3.10 upgrade if etcds certs get redeployed that permissions will change for /etc/etcd to root:root. Etcd is then restarted and fails will error:
connection from "" (error "open /etc/etcd/peer.crt: permission denied", ServerName "")
Version-Release number of the following components:
Steps to Reproduce:
1. Upgrade a 3.9 hosts control-plain where etcd certificates when hostnames were missing from etcd serving certificate SANs.
- Certs are generated and owned by root under /etc/etcd
2. A restart of etcd happens.
3. etcd is still running a systemd unit and running as user etcd, the upgrade has not set up etcd as a static pod yet
4. Upgrade fails due to permission issues with etcd certificates.
During the upgrade if the SAN is not correct of the etcd certs we redeploy the etcd certs
When the certs are recreated the get root:root with the 3.10 playbooks because the etcd is running in a static pod as root. This task is hit before we move etcd to a static pod though, and when etcd is restarted it will fail as /etc/etcd is now root owned.
In 3.10+ Etcd static pod
# oc rsh master-etcd-master-0.310test.com
sh-4.2# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 2.4 2.1 5889000 352588 ? Ssl 05:44 14:37 etcd
root 36783 1.0 0.0 11816 1660 ? Ss 15:52 0:00 /bin/sh
root 36799 1.0 0.0 51740 1728 ? R+ 15:52 0:00 ps aux
# ls -la /etc/etcd/peer.crt
-rw-------. 1 root root 6054 Dec 4 05:27 /etc/etcd/peer.crt
In 3.9- etcd systemdunit
etcd 20932 1.7 1.5 5833988 251320 ? Ssl Dec02 65:19 /usr/bin/etcd --name=master-0.sharedocp39.lab.rdu2.cee.redhat.com --data-dir=/var/lib/etcd/ --listen-client-urls=https://10.10.94.76:2
# ls -la /etc/etcd/peer.crt
-rw-------. 1 etcd etcd 6052 Dec 2 19:44 peer.crt
The permissions for these certs are actually set here: https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml#L181-L211
Owner/group are removed from 3.10 branch. I have created a PR that should check if this is an upgrade where the certs are redeployed and change the owner to etcd in that case:
There is a bug in how SANs are parsed in certs. The PR below fixes the bug so that SANs can be found in the existing certs. I am closing the previous PR, which is a workaround.
Release 3.10 PR: https://github.com/openshift/openshift-ansible/pull/10943
Created attachment 1524148 [details]
upgrade log for failed verification
ok, We may low priority about this issue, since 3.10 etcd runs as root in a static pod so it can open certs regardless of owner, thx
Verified with openshift-ansible-3.10.104-1.git.0.79f87f7.el7.noarch.rpm.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.