Description of problem: Sometimes, Master can't be started when the external etcd was installed on master. After restart the etcd service. the master turn back. Version-Release number of selected component (if applicable): atomic-openshift-utils-3.4.16-1.git.0.c846018.el7.noarch How reproducible: sometimes Steps to Reproduce: 1. install OCP-3.3 with external etcd on masters 2. ugprade to OCP-3.3 ansible-playbook /root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade_conrole_panel.yml 3. check the etcd and master service Actual results: MSG: Unable to restart service atomic-openshift-master: Job for atomic-openshift-master.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details. to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade_control_plane.retry PLAY RECAP ********************************************************************* localhost : ok=15 changed=10 unreachable=0 failed=0 openshift-221.lab.eng.nay.redhat.com : ok=40 changed=1 unreachable=0 failed=0 openshift-225.lab.eng.nay.redhat.com : ok=122 changed=8 unreachable=0 failed=1 openshift-226.lab.eng.nay.redhat.com : ok=40 changed=1 unreachable=0 failed=0 [root@rpm-ose33-1 ~]# systemctl status etcd ● etcd.service - Etcd Server Loaded: loaded (/usr/lib/systemd/system/etcd.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2016-11-04 08:25:39 EDT; 1min 25s ago Main PID: 20958 (etcd) Memory: 71.8M CGroup: /system.slice/etcd.service └─20958 /usr/bin/etcd --name=default --data-dir=/var/lib/etcd/ --listen-client-urls=https://192.168.1.112:2379 Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: serving client requests on 192.168.1.112:2379 Nov 04 08:25:39 rpm-ose33-1.novalocal systemd[1]: Started Etcd Server. Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: updated the cluster version from 2.3 to 3.0 Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: enabled capabilities for version 3.0 Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. [root@rpm-ose33-1 ~]# systemctl restart atomic-openshift-master [root@rpm-ose33-1 ~]# systemctl status atomic-openshift-master ● atomic-openshift-master.service - Atomic OpenShift Master Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2016-11-04 08:27:16 EDT; 11s ago Docs: https://github.com/openshift/origin Main PID: 21793 (openshift) Memory: 154.9M CGroup: /system.slice/atomic-openshift-master.service └─21793 /usr/bin/openshift start master --config=/etc/origin/master/master-config.yaml --loglevel=5 Nov 04 08:27:22 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:22.447112 21793 attach_detach_controller.go:520] processVolumesInUse for node "openshift-225.lab.eng.nay.redhat.com" Nov 04 08:27:23 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:23.506399 21793 logs.go:41] skydns: received DNS Request for "logging-es-ops.logging.svc.cluster.local." from "1... with type 1 Nov 04 08:27:23 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:23.506468 21793 serviceresolver.go:90] Answering query logging-es-ops.logging.svc.cluster.local.:false Nov 04 08:27:26 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:26.892710 21793 controller.go:105] Found 0 scheduledjobs Nov 04 08:27:26 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:26.898628 21793 controller.go:113] Found 0 jobs Nov 04 08:27:26 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:26.898646 21793 controller.go:116] Found 0 groups Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:27.259011 21793 nodecontroller.go:816] Node openshift-225.lab.eng.nay.redhat.com ReadyCondition updated. Updatin...l>} s:0 Form Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: b.eng.nay.redhat.com:5000/openshift3/ose-deployer:v3.3.1.4] SizeBytes:599431864} {Names:[virt-openshift-05.lab.eng.nay.redhat.....1.4] SizeBy Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: d:{Dec:<nil>} s:4 Format:DecimalSI} memory:{i:{value:8203067392 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Phase: Conditions:...isk Message: Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: 00a0b3d170f23474fadb7ed94497d5556e48691f931bb619] SizeBytes:491041535} {Names:[virt-openshift-05.lab.eng.nay.redhat.com:5000/o...0549a0cfc1b2 Expected results: Additional info:
I believe this is fixed by https://github.com/openshift/openshift-ansible/pull/2751
I believe this was fixed by https://github.com/openshift/openshift-ansible/pull/2773 so marking a dupe of 1393187 *** This bug has been marked as a duplicate of bug 1393187 ***
The case this is was set as a duplicate of is closed due to errata which includes openshift-ansible.noarch 3.4.44-1.git.0.efa61c6.el7 --- however the messages in that bug seem different than those reported in this case. I have a customer who hit this issue and they already had openshift-ansible.noarch 3.4.44-1.git.0.efa61c6.el7 > Feb 21 22:29:57 l3imas-id1-01 systemd[1]: Started Etcd Server. > Feb 21 22:29:57 l3imas-id1-01 etcd[78334]: Failed to dial 72.163.48.14:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. > Feb 21 22:29:57 l3imas-id1-01 etcd[78334]: Failed to dial 72.163.48.14:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. > Feb 21 22:29:57 l3imas-id1-01 etcd[78334]: Failed to dial 72.163.48.14:2379: connection error: desc = "transport: remote error: bad certificate"; please retry. . . . The bug this is a dupe of does not mention bad certificates. Apologies for reopening but this appears to still be an issue.
These warning messages do not affect operation as far as we know. Closing not a bug.