Bug 1623145
Summary: | upgrade failed at TASK [etcd : Verify cluster is healthy pre-upgrade] | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Weihua Meng <wmeng> |
Component: | Cluster Version Operator | Assignee: | Scott Dodson <sdodson> |
Status: | CLOSED ERRATA | QA Contact: | Weihua Meng <wmeng> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.11.0 | CC: | aos-bugs, dmace, jiajliu, jialiu, jokerman, mmccomas, wmeng, wsun |
Target Milestone: | --- | ||
Target Release: | 3.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-11-26 15:51:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1624448 | ||
Bug Blocks: |
Description
Weihua Meng
2018-08-28 14:53:54 UTC
This is the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1616840 I've connected to the masters, downgraded the node, run the failing command and etcd service has been restored. We need to get the fix from that bug in and test again. That should be in the next build or if you want to clone master branch from github. fixed. openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch Kernel Version: 3.10.0-862.11.6.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) Still hit it on openshift-ansible-3.11.0-0.25.0.git.0.7497e69.el7.noarch. Re-open it to keep track it. A tough workaround is to "restart dnsmasq service during this task verify etcd cluster". Hit this on openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch, and reproduce ration is very high. After re-run the upgrade job against the last failed job, [etcd : Verify cluster is healthy pre-upgrade] is passed, but failed at the following task: TASK [openshift_node : Approve node certificates when bootstrapping] *********** <--snip--> FAILED - RETRYING: Approve node certificates when bootstrapping (1 retries left). fatal: [qe-jialiu310-master-etcd-1.0906-byy.qe.rhcloud.com -> qe-jialiu310-master-etcd-1.0906-byy.qe.rhcloud.com]: FAILED! => {"attempts": 30, "changed": false, "msg": "The connection to the server qe-jialiu310-master-etcd-1:8443 was refused - did you specify the right host or port?\n", "state": "unknown"} Master api static pod is restart against and again, log as the following: I0906 09:26:12.355155 1 storage_factory.go:285] storing { apiServerIPInfo} in v1, reading as __internal from storagebackend.Config{Type:"etcd3", Prefix:"kubernetes.io", ServerList:[]string{"https://qe-jialiu310-master-etcd-1:2379"}, KeyFile:"/etc/origin/master/master.etcd-client.key", CertFile:"/etc/origin/master/master.etcd-client.crt", CAFile:"/etc/origin/master/master.etcd-ca.crt", Quorum:true, Paging:true, DeserializationCacheSize:0, Codec:runtime.Codec(nil), Transformer:value.Transformer(nil), CompactionInterval:300000000000, CountMetricPollPeriod:60000000000} F0906 09:26:22.356339 1 start_api.go:68] context deadline exceeded That make my whole upgrade harder. CI tests are failing on #9922 for some reason. I'll look into it today. I've added a retry loop around our etcd health check so that it retries every 6 seconds for 180 seconds. https://github.com/openshift/openshift-ansible/pull/10026 If the problem still persists after that and we see signs that it's tied to DNS resolution we should track that as part of https://bugzilla.redhat.com/show_bug.cgi?id=1624448 The PR 10026 has been merged to openshift-ansible-3.11.3-1,please check the bug. blocked by bug 1628730 Fixed. openshift-ansible-3.11.5-1.git.0.5a01a3c.el7_5.noarch Kernel Version: 3.10.0-862.11.6.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo) re-open as hit this issue again. openshift-ansible-3.11.9-1.git.0.63f7970.el7_5.noarch after restart dnsmasq service, cluster works Weihua, I think that's https://bugzilla.redhat.com/show_bug.cgi?id=1624448 which has a fix merged but it hasn't been built yet. Fixed. openshift-ansible-3.11.11-1.git.0.5d4f9d4.el7_5.noarch |