Created attachment 1453782 [details] Ansible output, inventory, journal from failed node Description of problem: While running upgrade_control_plane.yaml to upgrade a 3.9.27 cluster to 3.10, the upgrade failed when one of the node services on a master failed to restart. The journal on the failed master showed: Jun 22 16:26:16 ip-172-31-20-147.us-west-2.compute.internal atomic-openshift-node[36060]: I0622 16:26:16.367655 36060 logs.go:49] http: TLS handshake error from 172.31.52.228:46510: no serving certificate available for the kubelet Jun 22 16:27:09 ip-172-31-20-147.us-west-2.compute.internal atomic-openshift-node[36060]: I0622 16:27:09.755333 36060 logs.go:49] http: TLS handshake error from 172.31.52.228:46686: no serving certificate available for the kubelet Jun 22 16:27:16 ip-172-31-20-147.us-west-2.compute.internal atomic-openshift-node[36060]: I0622 16:27:16.367490 36060 logs.go:49] http: TLS handshake error from 172.31.52.228:46714: no serving certificate available for the kubelet oc get csr: NAME AGE REQUESTOR CONDITION csr-2bgqr 9m system:node:ip-172-31-20-147.us-west-2.compute.internal Pending csr-r4627 12m system:node:ip-172-31-49-131.us-west-2.compute.internal Approved,Issued csr-zk6gc 9m system:node:ip-172-31-20-147.us-west-2.compute.internal Pending node-csr-TfmRLapr36P-wcLn29hvgg_LBG_tzVUoBeeYx8yagUE 12m system:admin Approved,Issued node-csr-X-oaimgu1HO_gMM-Vg7e19MA35GUN4PuglhpjEPFrKw 9m system:admin Approved,Issued Version-Release number of the following components: root@ip-172-31-49-131: ~ # rpm -q openshift-ansible openshift-ansible-3.10.5-1.git.204.3a9cb75.el7.noarch root@ip-172-31-49-131: ~ # rpm -q ansible ansible-2.4.3.0-1.el7ae.noarch root@ip-172-31-49-131: ~ # ansible --version ansible 2.4.3.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Apr 19 2018, 05:40:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] How reproducible: Unknown. First time I've hit this one. Steps to Reproduce: 1. upgrade_control_plane.yaml from 3.9.27 to 3.10 with the inventory attached Cluster is on AWS - 1 lb, 3 masters, 1 infra, 1 compute Ansible output, inventory, journal from failed node attached.
Moving this to pod team to diagnose, we probably need api and controller logs from all masters in order to trace down much. Can that be gathered?
Created attachment 1453821 [details] Ansible log, inventory, failed node journal, master api/controller logs Updated log tarball with master controller/api logs from all masters
Hit similar issue. On some nodes, the pods can be approved. while the other nodes, all pods are pending. # oc rsh docker-registry-2-hv95j Error from server: error dialing backend: remote error: tls: internal error [root@preserve-upgrader6-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-2lj52 43m system:node:preserve-upgrader6-master-etcd-3 Approved,Issued csr-7c646 49m system:node:preserve-upgrader6-master-etcd-1 Approved,Issued csr-87qkj 4m system:node:preserve-upgrader6-nrr-1 Pending csr-87xqw 46m system:node:preserve-upgrader6-master-etcd-2 Approved,Issued csr-8xbh4 9m system:node:preserve-upgrader6-nrr-3 Pending csr-96kzg 26m system:node:preserve-upgrader6-nrr-2 Pending csr-clb8z 22m system:node:preserve-upgrader6-nrr-3 Approved,Issued csr-dvp9f 43m system:node:preserve-upgrader6-master-etcd-3 Approved,Issued csr-fcshk 22m system:node:preserve-upgrader6-nrr-3 Pending csr-gdk6h 46m system:node:preserve-upgrader6-master-etcd-2 Approved,Issued csr-j5d6t 30m system:node:preserve-upgrader6-nrr-1 Approved,Issued csr-mx7mr 29m system:node:preserve-upgrader6-nrr-1 Pending csr-pqzd7 13m system:node:preserve-upgrader6-nrr-2 Pending csr-rn7hz 26m system:node:preserve-upgrader6-nrr-2 Approved,Issued csr-w74tj 51s system:node:preserve-upgrader6-nrr-2 Pending csr-wv2vk 17m system:node:preserve-upgrader6-nrr-1 Pending node-csr-1Wz4ai_URlmdHQvILu7ZesfCJebMfUlgXsex1pmBAO8 30m system:node:preserve-upgrader6-nrr-1 Approved,Issued node-csr-5JUb8p2gGT-frUjwR-IVixyNucfAbyT_-wTZlGO3OYI 46m system:admin Approved,Issued node-csr-DrFGFkv4WoH22L16AH2b0uw4uFCUO_gXuZHGYmHTw5I 22m system:node:preserve-upgrader6-nrr-3 Approved,Issued node-csr-HNWeRNYkQBVzq17-BouZ5-rs9oE6FJfS13hmrnMoui4 26m system:node:preserve-upgrader6-nrr-2 Approved,Issued node-csr-LjDwoNrSIcB3aCDgDYSppY7s9G7hWBW2lFMCqswrj-g 49m system:admin Approved,Issued node-csr-gILSmc29yn0goNQAzm-e5HcoJzYQXuK5rme4BWntB2Y 43m system:admin Approved,Issued
Upgrade from v3.9.31 to v3.10.5 with openshift-ansible:v3.10.5 "atomic-openshift version: v3.9.31", "Operation System: Red Hat Enterprise Linux Atomic Host release 7.4", "Cluster Install Method: system container", "Docker Version: docker-1.13.1-53.git774336d.el7.x86_64", "Docker Storage Driver: overlay2", "OpenvSwitch Version: openvswitch-2.9.0-19.el7fdp.x86_64", "etcd Version: etcd-3.2.18-1.el7.x86_64", "Network Plugin: redhat/openshift-ovs-subnet", "Auth Method: LDAP_IPA", "Registry Deployment Method: deploymentconfig", "Secure Registry: True", "Registry Backend Storage: swift", "Load Balancer: Haproxy", "Docker System Container: False", "CRI-O Enable: False", "Firewall Service: iptables" Kernel Version: 3.10.0-693.21.1.el7.x86_64 Operating System: Red Hat Enterprise Linux Atomic Host 7.4.5 ansible-2.4.4.0-1.el7ae.noarch
Hit the issue on openshift-ansible-3.10.7-1.git.220.50204c4.el7.noarch Upgrade against containerzied ocp on rhel failed at task TASK [openshift_node : Check status of node service] *************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:55 Monday 25 June 2018 02:43:49 +0000 (0:00:02.933) 0:42:19.065 *********** FAILED - RETRYING: Check status of node service (30 retries left). ... fatal: [x]: FAILED! => {"ansible_job_id": "763491179835.69476", "attempts": 26, "changed": false, "failed": true, "finished": 1, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because a timeout was exceeded. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} [root@qe-jliu-autolcys-master-etcd-1 ~]# oc get csr NAME AGE REQUESTOR CONDITION csr-jq6hh 41m system:node:qe-jliu-autolcys-master-etcd-1 Approved,Issued node-csr-ATD8BXOHPFp03uZfy0eIi9g_icXpss_DGVJKQjU5hr8 13m system:node:qe-jliu-autolcys-nrr-1 Pending node-csr-Ji00SP_mLBhEv-XS5DrvFu5MYXzjS05mU-h7ot4TBDo 41m system:admin Approved,Issued [root@qe-jliu-autolcys-master-etcd-1 ~]# oc get node NAME STATUS ROLES AGE VERSION qe-jliu-autolcys-master-etcd-1 Ready master 1h v1.10.0+b81c8f8 qe-jliu-autolcys-nrr-1 NotReady,SchedulingDisabled compute 1h v1.9.1+a0ce1bc657
The original issue for this was HA comment 3 instance is HA comment 5 instance is non-HA comment 7 instance is HA - this cluster is still available in this state through today Not sure if it is more prevalent on HA - that is the only kind of upgrade I am doing, personally.
Same/related to bug 1571515?
You can figure out which node the containers are running on, use docker ps to look at the name and command to determine which container relates to which pod, and use docker exec to exam them even if you can't with oc exec, etc.
This isn't going to be resolved by mid-day today for 3.10. We need to get to the bottom of why the database data is missing, especially in the case of prod instances. Can you deploy 3.9, provision your apb pairs normally, verify on the prod instances that the storage is mounted in the pod and data files exist, and take note of what pods are bound to what pvc's and what pvcs are bound to what claims (oc get pvc --all-namespaces and oc get pv). Then run oc get pvc --all-namespaces and oc get pv again after the upgrade to confirm nothing has flip flopped or changed for some reason, and similarly, especially in the case of prod instances note which ones aren't working, that the pvcs are mounted and that the data is available. I'll try to run through another upgrade with multiple prod instances and do the same and see if I can reproduce it today.
Jason, I think you might have been commenting on the wrong bug.
Seth, I was. Please ignore my comments.
I have a theory on what is happening here: I think that WaitForCertificate() does not expect an already approved CSR to be used. It uses a cache.ListWatchUntil() waiting for a change on the CSR that will never happen if the CSR is already approved. rotateCerts() csr.RequestCertificate() csr.WaitForCertificate() https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/util/certificate/certificate_manager.go#L286-L310 https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/util/certificate/csr/csr.go#L85-L118
Caught it in the act: Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: goroutine 1 [select]: Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/watch.Until(0x34630b8a000, 0x8bffee0, 0xc421246b10, 0xc42121aaf0, 0x1, 0x1, 0x0, 0x0, 0x0) Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: /builddir/build/BUILD/atomic-openshift-git-0.2f94e36/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/watch/until.go:65 +0x1d0 Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache.ListWatchUntil(0x34630b8a000, 0x8c02f60, 0xc42122fa20, 0xc42121aaf0, 0x1, 0x1, 0x0, 0x0, 0x0) Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: /builddir/build/BUILD/atomic-openshift-git-0.2f94e36/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/listwatch.go:182 +0x602 Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/client-go/util/certificate/csr.WaitForCertificate(0x8c45ae0, 0xc420e9b250, 0xc4200a1dc0, 0x34630b8a000, 0x1c0, 0xc420eb23c0, 0x34, 0xc420eb0cf0, 0x3) Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: /builddir/build/BUILD/atomic-openshift-git-0.2f94e36/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/util/certificate/csr/csr.go:124 +0x297 Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/client-go/util/certificate/csr.RequestNodeCertificate(0x8c45ae0, 0xc420e9b250, 0xc420930600, 0xe3, 0x2e3, 0x7ffe6c568769, 0x16, 0x0, 0x0, 0x0, ...)
Man, I'm dense. The CSR is pending, that is why it is blocked there. So everything I said was true... and working as expected. The real problem is that there is nothing approving the CSRs. The autoapprover is not being deployed.
Learning as I go here. The autoapprover is not installed by default and apparently the ansible playbook is supposed to approve the CSRs. Which seems brittle. There might be something going wrong there.
Sending back to Upgrade. It is the role of the playbooks to approve the CSRs and it is not doing that for some reason.
Logs from comment 19 show that the node service was started at Wednesday 27 June 2018 08:16:37 +0000 (0:00:00.032) 0:59:14.337 ******** changed: [qe-jliu-autowvyc-nrr-1.0627-d04.qe.rhcloud.com] => {"ansible_job_id": "989379698424.52969", "changed": true, "failed": false, "failed_when_result": false, "finished": 0, "results_file": "/root/.ansible_async/989379698424.52969", "started": 1} And the call to oc_adm_csr shows that both service and client CSRs are marked as approved at 08:16:51. TASK [openshift_node : Approve the node] *************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:48 Wednesday 27 June 2018 08:16:51 +0000 (0:00:13.486) 0:59:27.823 ******** ok: [qe-jliu-autowvyc-nrr-1.0627-d04.qe.rhcloud.com -> qe-jliu-autowvyc-master-etcd-1.0627-d04.qe.rhcloud.com] => {"changed": false, "failed": false, "finished": true, "nodes": [{"client_accepted": true, "csrs": {}, "denied": false, "name": "qe-jliu-autowvyc-nrr-1", "server_accepted": true}], "results": [], "state": "approve", "timeout": false} But the CSR for this node wasn't created until some time later. # oc describe csr node-csr-fO8ao4Vf2OM49r9NXG6UKdLebzwLvl4P9FtWH30ju88 | grep CreationTimestamp CreationTimestamp: Wed, 27 Jun 2018 08:17:04 +0000 Jun 27 08:16:54 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[52980]: I0627 08:16:54.216017 52989 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file Jun 27 08:16:54 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[52980]: I0627 08:16:54.217911 52989 bootstrap.go:86] No valid private key and/or certificate found, reusing existing private key or creating a new o Then at Jun 27 08:18:21 qe-jliu-autowvyc-nrr-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. For some reason oc_adm_csr said all CSRs were signed before they were actually created.
oc_adm_csr was considering any host that had a node object already as having CSRs approved when they may not have even existed. https://github.com/openshift/openshift-ansible/pull/9020 fixes this to only consider those nodes to have completed if the node is also marked as ready. This leaves a gap where oc_adm_csr likely fails to renew certs that have hit 80% of their lifespan and have created new CSRs.
Should be blocked by new regression bz1597219
I intend to mark this verified unless anyone is still hitting this. On openshift-ansible 3.10.15 this is no longer seen across many upgrades. @liujia let me know if you disagree or have seen this recently.
Hi Mike I think you can verify the bug. I did not hit it on v3.10.15.
Marking verified on 3.10.15 - csr pending and failures to restart atomic-openshift-node on masters are no longer encountered.
Moving back to assigned per comment 39 and comment 40
I only hit this error once recently. remove testblocker keyword.
https://github.com/openshift/openshift-ansible/pull/9238 should resolve this. We found that control plane nodes get restarted during the join process and if they're NotReady then certificate approval fails.
Fix is available in openshift-ansible-3.10.21-1
Verified on openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch
Haven't seen this after PR #9238 has been merged
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2263