Bug 1391935 - Master can't be started when the external etcd was installed on master
Summary: Master can't be started when the external etcd was installed on master
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Steve Milner
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-04 12:42 UTC by Anping Li
Modified: 2020-04-15 14:48 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-03-09 15:12:11 UTC
Target Upstream Version:
Embargoed:
sdodson: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Anping Li 2016-11-04 12:42:25 UTC
Description of problem:
Sometimes, Master can't be started  when the external etcd was installed on master. After restart the etcd service. the master turn back. 

Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.4.16-1.git.0.c846018.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. install OCP-3.3 with external etcd on masters
2. ugprade to OCP-3.3
  ansible-playbook /root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade_conrole_panel.yml
3. check the etcd and master service

Actual results:
MSG:

Unable to restart service atomic-openshift-master: Job for atomic-openshift-master.service failed because a timeout was exceeded. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details.

        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_4/upgrade_control_plane.retry

PLAY RECAP *********************************************************************
localhost                  : ok=15   changed=10   unreachable=0    failed=0
openshift-221.lab.eng.nay.redhat.com : ok=40   changed=1    unreachable=0    failed=0
openshift-225.lab.eng.nay.redhat.com : ok=122  changed=8    unreachable=0    failed=1
openshift-226.lab.eng.nay.redhat.com : ok=40   changed=1    unreachable=0    failed=0



[root@rpm-ose33-1 ~]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-11-04 08:25:39 EDT; 1min 25s ago
 Main PID: 20958 (etcd)
   Memory: 71.8M
   CGroup: /system.slice/etcd.service
           └─20958 /usr/bin/etcd --name=default --data-dir=/var/lib/etcd/ --listen-client-urls=https://192.168.1.112:2379

Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: serving client requests on 192.168.1.112:2379
Nov 04 08:25:39 rpm-ose33-1.novalocal systemd[1]: Started Etcd Server.
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: updated the cluster version from 2.3 to 3.0
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: enabled capabilities for version 3.0
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
Nov 04 08:25:39 rpm-ose33-1.novalocal etcd[20958]: Failed to dial 192.168.1.112:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
[root@rpm-ose33-1 ~]# systemctl restart atomic-openshift-master
[root@rpm-ose33-1 ~]# systemctl status atomic-openshift-master
● atomic-openshift-master.service - Atomic OpenShift Master
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-11-04 08:27:16 EDT; 11s ago
     Docs: https://github.com/openshift/origin
 Main PID: 21793 (openshift)
   Memory: 154.9M
   CGroup: /system.slice/atomic-openshift-master.service
           └─21793 /usr/bin/openshift start master --config=/etc/origin/master/master-config.yaml --loglevel=5

Nov 04 08:27:22 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:22.447112   21793 attach_detach_controller.go:520] processVolumesInUse for node "openshift-225.lab.eng.nay.redhat.com"
Nov 04 08:27:23 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:23.506399   21793 logs.go:41] skydns: received DNS Request for "logging-es-ops.logging.svc.cluster.local." from "1... with type 1
Nov 04 08:27:23 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:23.506468   21793 serviceresolver.go:90] Answering query logging-es-ops.logging.svc.cluster.local.:false
Nov 04 08:27:26 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:26.892710   21793 controller.go:105] Found 0 scheduledjobs
Nov 04 08:27:26 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:26.898628   21793 controller.go:113] Found 0 jobs
Nov 04 08:27:26 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:26.898646   21793 controller.go:116] Found 0 groups
Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: I1104 08:27:27.259011   21793 nodecontroller.go:816] Node openshift-225.lab.eng.nay.redhat.com ReadyCondition updated. Updatin...l>} s:0 Form
Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: b.eng.nay.redhat.com:5000/openshift3/ose-deployer:v3.3.1.4] SizeBytes:599431864} {Names:[virt-openshift-05.lab.eng.nay.redhat.....1.4] SizeBy
Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: d:{Dec:<nil>} s:4 Format:DecimalSI} memory:{i:{value:8203067392 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Phase: Conditions:...isk Message:
Nov 04 08:27:27 rpm-ose33-1.novalocal atomic-openshift-master[21793]: 00a0b3d170f23474fadb7ed94497d5556e48691f931bb619] SizeBytes:491041535} {Names:[virt-openshift-05.lab.eng.nay.redhat.com:5000/o...0549a0cfc1b2


Expected results:


Additional info:

Comment 2 Scott Dodson 2016-11-08 16:13:42 UTC
I believe this is fixed by https://github.com/openshift/openshift-ansible/pull/2751

Comment 5 Scott Dodson 2016-11-10 01:05:34 UTC
I believe this was fixed by https://github.com/openshift/openshift-ansible/pull/2773 so marking a dupe of 1393187

*** This bug has been marked as a duplicate of bug 1393187 ***

Comment 6 Steven Walter 2017-02-22 21:26:22 UTC
The case this is was set as a duplicate of is closed due to errata which includes
openshift-ansible.noarch 3.4.44-1.git.0.efa61c6.el7  --- however the messages in that bug seem different than those reported in this case.

I have a customer who hit this issue and they already had openshift-ansible.noarch 3.4.44-1.git.0.efa61c6.el7

> Feb 21 22:29:57 l3imas-id1-01 systemd[1]: Started Etcd Server.
> Feb 21 22:29:57 l3imas-id1-01 etcd[78334]: Failed to dial 72.163.48.14:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
> Feb 21 22:29:57 l3imas-id1-01 etcd[78334]: Failed to dial 72.163.48.14:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
> Feb 21 22:29:57 l3imas-id1-01 etcd[78334]: Failed to dial 72.163.48.14:2379: connection error: desc = "transport: remote error: bad certificate"; please retry.
. . .

The bug this is a dupe of does not mention bad certificates. Apologies for reopening but this appears to still be an issue.

Comment 14 Scott Dodson 2017-03-09 15:12:11 UTC
These warning messages do not affect operation as far as we know. Closing not a bug.


Note You need to log in before you can comment on or make changes to this bug.