Bug 1594341 - HA upgrade from 3.9 to 3.10 fails - control plane node fails to restart with pending csr
Summary: HA upgrade from 3.9 to 3.10 fails - control plane node fails to restart with ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.10.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.10.z
Assignee: Scott Dodson
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-22 16:43 UTC by Mike Fiedler
Modified: 2018-07-30 20:22 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-07-30 20:22:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Ansible output, inventory, journal from failed node (540.02 KB, application/x-gzip)
2018-06-22 16:43 UTC, Mike Fiedler
no flags Details
Ansible log, inventory, failed node journal, master api/controller logs (1.62 MB, application/x-gzip)
2018-06-22 17:43 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2263 0 None None None 2018-07-30 20:22:54 UTC

Description Mike Fiedler 2018-06-22 16:43:08 UTC
Created attachment 1453782 [details]
Ansible output, inventory, journal from failed node

Description of problem:

While running upgrade_control_plane.yaml to upgrade a 3.9.27 cluster to 3.10, the upgrade failed when one of the node services on a master failed to restart.   The journal on the failed master showed:

Jun 22 16:26:16 ip-172-31-20-147.us-west-2.compute.internal atomic-openshift-node[36060]: I0622 16:26:16.367655   36060 logs.go:49] http: TLS handshake error from 172.31.52.228:46510: no serving certificate available for the kubelet                                                 
Jun 22 16:27:09 ip-172-31-20-147.us-west-2.compute.internal atomic-openshift-node[36060]: I0622 16:27:09.755333   36060 logs.go:49] http: TLS handshake error from 172.31.52.228:46686: no serving certificate available for the kubelet                                                 
Jun 22 16:27:16 ip-172-31-20-147.us-west-2.compute.internal atomic-openshift-node[36060]: I0622 16:27:16.367490   36060 logs.go:49] http: TLS handshake error from 172.31.52.228:46714: no serving certificate available for the kubelet

oc get csr:

NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-2bgqr                                              9m        system:node:ip-172-31-20-147.us-west-2.compute.internal   Pending
csr-r4627                                              12m       system:node:ip-172-31-49-131.us-west-2.compute.internal   Approved,Issued
csr-zk6gc                                              9m        system:node:ip-172-31-20-147.us-west-2.compute.internal   Pending
node-csr-TfmRLapr36P-wcLn29hvgg_LBG_tzVUoBeeYx8yagUE   12m       system:admin                                              Approved,Issued
node-csr-X-oaimgu1HO_gMM-Vg7e19MA35GUN4PuglhpjEPFrKw   9m        system:admin                                              Approved,Issued

Version-Release number of the following components:
root@ip-172-31-49-131: ~ # rpm -q openshift-ansible
openshift-ansible-3.10.5-1.git.204.3a9cb75.el7.noarch
root@ip-172-31-49-131: ~ # rpm -q ansible
ansible-2.4.3.0-1.el7ae.noarch
root@ip-172-31-49-131: ~ # ansible --version
ansible 2.4.3.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Apr 19 2018, 05:40:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

How reproducible: Unknown.  First time I've hit this one.

Steps to Reproduce:
1. upgrade_control_plane.yaml from 3.9.27 to 3.10 with the inventory attached

Cluster is on AWS - 1 lb, 3 masters, 1 infra, 1 compute


Ansible output, inventory, journal from failed node attached.

Comment 1 Scott Dodson 2018-06-22 17:03:04 UTC
Moving this to pod team to diagnose, we probably need api and controller logs from all masters in order to trace down much. Can that be gathered?

Comment 2 Mike Fiedler 2018-06-22 17:43:13 UTC
Created attachment 1453821 [details]
Ansible log, inventory, failed node journal, master api/controller logs

Updated log tarball with master controller/api logs from all masters

Comment 3 Anping Li 2018-06-25 02:23:51 UTC
Hit similar issue. On some nodes, the pods  can be approved.  while the other nodes, all pods are pending.


# oc rsh docker-registry-2-hv95j
Error from server: error dialing backend: remote error: tls: internal error


[root@preserve-upgrader6-master-etcd-1 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                      CONDITION
csr-2lj52                                              43m       system:node:preserve-upgrader6-master-etcd-3   Approved,Issued
csr-7c646                                              49m       system:node:preserve-upgrader6-master-etcd-1   Approved,Issued
csr-87qkj                                              4m        system:node:preserve-upgrader6-nrr-1           Pending
csr-87xqw                                              46m       system:node:preserve-upgrader6-master-etcd-2   Approved,Issued
csr-8xbh4                                              9m        system:node:preserve-upgrader6-nrr-3           Pending
csr-96kzg                                              26m       system:node:preserve-upgrader6-nrr-2           Pending
csr-clb8z                                              22m       system:node:preserve-upgrader6-nrr-3           Approved,Issued
csr-dvp9f                                              43m       system:node:preserve-upgrader6-master-etcd-3   Approved,Issued
csr-fcshk                                              22m       system:node:preserve-upgrader6-nrr-3           Pending
csr-gdk6h                                              46m       system:node:preserve-upgrader6-master-etcd-2   Approved,Issued
csr-j5d6t                                              30m       system:node:preserve-upgrader6-nrr-1           Approved,Issued
csr-mx7mr                                              29m       system:node:preserve-upgrader6-nrr-1           Pending
csr-pqzd7                                              13m       system:node:preserve-upgrader6-nrr-2           Pending
csr-rn7hz                                              26m       system:node:preserve-upgrader6-nrr-2           Approved,Issued
csr-w74tj                                              51s       system:node:preserve-upgrader6-nrr-2           Pending
csr-wv2vk                                              17m       system:node:preserve-upgrader6-nrr-1           Pending
node-csr-1Wz4ai_URlmdHQvILu7ZesfCJebMfUlgXsex1pmBAO8   30m       system:node:preserve-upgrader6-nrr-1           Approved,Issued
node-csr-5JUb8p2gGT-frUjwR-IVixyNucfAbyT_-wTZlGO3OYI   46m       system:admin                                   Approved,Issued
node-csr-DrFGFkv4WoH22L16AH2b0uw4uFCUO_gXuZHGYmHTw5I   22m       system:node:preserve-upgrader6-nrr-3           Approved,Issued
node-csr-HNWeRNYkQBVzq17-BouZ5-rs9oE6FJfS13hmrnMoui4   26m       system:node:preserve-upgrader6-nrr-2           Approved,Issued
node-csr-LjDwoNrSIcB3aCDgDYSppY7s9G7hWBW2lFMCqswrj-g   49m       system:admin                                   Approved,Issued
node-csr-gILSmc29yn0goNQAzm-e5HcoJzYQXuK5rme4BWntB2Y   43m       system:admin                                   Approved,Issued

Comment 4 Anping Li 2018-06-25 02:25:44 UTC
Upgrade from v3.9.31 to v3.10.5 with openshift-ansible:v3.10.5

       "atomic-openshift version: v3.9.31", 
        "Operation System: Red Hat Enterprise Linux Atomic Host release 7.4", 
        "Cluster Install Method: system container", 
        "Docker Version: docker-1.13.1-53.git774336d.el7.x86_64", 
        "Docker Storage Driver:  overlay2", 
        "OpenvSwitch Version: openvswitch-2.9.0-19.el7fdp.x86_64", 
        "etcd Version: etcd-3.2.18-1.el7.x86_64", 
        "Network Plugin: redhat/openshift-ovs-subnet", 
        "Auth Method: LDAP_IPA", 
        "Registry Deployment Method: deploymentconfig", 
        "Secure Registry: True", 
        "Registry Backend Storage: swift", 
        "Load Balancer: Haproxy", 
        "Docker System Container: False", 
        "CRI-O Enable: False", 
        "Firewall Service: iptables"
Kernel Version: 3.10.0-693.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Atomic Host 7.4.5
ansible-2.4.4.0-1.el7ae.noarch

Comment 5 liujia 2018-06-25 03:03:05 UTC
Hit the issue on openshift-ansible-3.10.7-1.git.220.50204c4.el7.noarch

Upgrade against containerzied ocp on rhel failed at task TASK [openshift_node : Check status of node service] ***************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:55
Monday 25 June 2018  02:43:49 +0000 (0:00:02.933)       0:42:19.065 *********** 
FAILED - RETRYING: Check status of node service (30 retries left).
...
fatal: [x]: FAILED! => {"ansible_job_id": "763491179835.69476", "attempts": 26, "changed": false, "failed": true, "finished": 1, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because a timeout was exceeded. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}

[root@qe-jliu-autolcys-master-etcd-1 ~]# oc get csr
NAME                                                   AGE       REQUESTOR                                    CONDITION
csr-jq6hh                                              41m       system:node:qe-jliu-autolcys-master-etcd-1   Approved,Issued
node-csr-ATD8BXOHPFp03uZfy0eIi9g_icXpss_DGVJKQjU5hr8   13m       system:node:qe-jliu-autolcys-nrr-1           Pending
node-csr-Ji00SP_mLBhEv-XS5DrvFu5MYXzjS05mU-h7ot4TBDo   41m       system:admin                                 Approved,Issued

[root@qe-jliu-autolcys-master-etcd-1 ~]# oc get node
NAME                             STATUS                        ROLES     AGE       VERSION
qe-jliu-autolcys-master-etcd-1   Ready                         master    1h        v1.10.0+b81c8f8
qe-jliu-autolcys-nrr-1           NotReady,SchedulingDisabled   compute   1h        v1.9.1+a0ce1bc657

Comment 10 Mike Fiedler 2018-06-25 17:51:32 UTC
The original issue for this was HA
comment 3 instance is HA
comment 5 instance is non-HA
comment 7 instance is HA - this cluster is still available in this state through today

Not sure if it is more prevalent on HA - that is the only kind of upgrade I am doing, personally.

Comment 18 Aleksandar Kostadinov 2018-06-27 08:14:30 UTC
Same/related to bug 1571515?

Comment 20 Jason Montleon 2018-06-27 13:37:54 UTC
You can figure out which node the containers are running on, use docker ps to look at the name and command to determine which container relates to which pod, and use docker exec to exam them even if you can't with oc exec, etc.

Comment 22 Jason Montleon 2018-06-27 13:54:39 UTC
This isn't going to be resolved by mid-day today for 3.10.

We need to get to the bottom of why the database data is missing, especially in the case of prod instances.

Can you deploy 3.9, provision your apb pairs normally, verify on the prod instances that the storage is mounted in the pod and data files exist, and take note of what pods are bound to what pvc's and what pvcs are bound to what claims (oc get pvc --all-namespaces and oc get pv).

Then run oc get pvc --all-namespaces and oc get pv again after the upgrade to confirm nothing has flip flopped or changed for some reason, and similarly, especially in the case of prod instances note which ones aren't working, that the pvcs are mounted and that the data is available.

I'll try to run through another upgrade with multiple prod instances and do the same and see if I can reproduce it today.

Comment 23 Seth Jennings 2018-06-27 14:13:03 UTC
Jason, I think you might have been commenting on the wrong bug.

Comment 24 Jason Montleon 2018-06-27 14:56:09 UTC
Seth, I was. Please ignore my comments.

Comment 27 Seth Jennings 2018-06-27 19:26:22 UTC
I have a theory on what is happening here:

I think that WaitForCertificate() does not expect an already approved CSR to be used.  It uses a cache.ListWatchUntil() waiting for a change on the CSR that will never happen if the CSR is already approved.

rotateCerts()
  csr.RequestCertificate()
  csr.WaitForCertificate()

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/util/certificate/certificate_manager.go#L286-L310

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/util/certificate/csr/csr.go#L85-L118

Comment 28 Seth Jennings 2018-06-27 19:35:09 UTC
Caught it in the act:

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: goroutine 1 [select]:

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/watch.Until(0x34630b8a000, 0x8bffee0, 0xc421246b10, 0xc42121aaf0, 0x1, 0x1, 0x0, 0x0, 0x0)

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: /builddir/build/BUILD/atomic-openshift-git-0.2f94e36/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/watch/until.go:65 +0x1d0

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache.ListWatchUntil(0x34630b8a000, 0x8c02f60, 0xc42122fa20, 0xc42121aaf0, 0x1, 0x1, 0x0, 0x0, 0x0)

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: /builddir/build/BUILD/atomic-openshift-git-0.2f94e36/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/listwatch.go:182 +0x602

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/client-go/util/certificate/csr.WaitForCertificate(0x8c45ae0, 0xc420e9b250, 0xc4200a1dc0, 0x34630b8a000, 0x1c0, 0xc420eb23c0, 0x34, 0xc420eb0cf0, 0x3)

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: /builddir/build/BUILD/atomic-openshift-git-0.2f94e36/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/util/certificate/csr/csr.go:124 +0x297

Jun 27 19:32:20 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[109282]: github.com/openshift/origin/vendor/k8s.io/client-go/util/certificate/csr.RequestNodeCertificate(0x8c45ae0, 0xc420e9b250, 0xc420930600, 0xe3, 0x2e3, 0x7ffe6c568769, 0x16, 0x0, 0x0, 0x0, ...)

Comment 29 Seth Jennings 2018-06-27 19:54:15 UTC
Man, I'm dense.  The CSR is pending, that is why it is blocked there.  So everything I said was true... and working as expected.

The real problem is that there is nothing approving the CSRs.  The autoapprover is not being deployed.

Comment 30 Seth Jennings 2018-06-27 20:07:45 UTC
Learning as I go here.

The autoapprover is not installed by default and apparently the ansible playbook is supposed to approve the CSRs.  Which seems brittle.  There might be something going wrong there.

Comment 31 Seth Jennings 2018-06-27 20:28:48 UTC
Sending back to Upgrade.  It is the role of the playbooks to approve the CSRs and it is not doing that for some reason.

Comment 32 Scott Dodson 2018-06-27 22:05:25 UTC
Logs from comment 19 show that the node service was started at 

Wednesday 27 June 2018  08:16:37 +0000 (0:00:00.032)       0:59:14.337 ******** 
changed: [qe-jliu-autowvyc-nrr-1.0627-d04.qe.rhcloud.com] => {"ansible_job_id": "989379698424.52969", "changed": true, "failed": false, "failed_when_result": false, "finished": 0, "results_file": "/root/.ansible_async/989379698424.52969", "started": 1}

And the call to oc_adm_csr shows that both service and client CSRs are marked as approved at 08:16:51.

TASK [openshift_node : Approve the node] ***************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:48
Wednesday 27 June 2018  08:16:51 +0000 (0:00:13.486)       0:59:27.823 ******** 
ok: [qe-jliu-autowvyc-nrr-1.0627-d04.qe.rhcloud.com -> qe-jliu-autowvyc-master-etcd-1.0627-d04.qe.rhcloud.com] => {"changed": false, "failed": false, "finished": true, "nodes": [{"client_accepted": true, "csrs": {}, "denied": false, "name": "qe-jliu-autowvyc-nrr-1", "server_accepted": true}], "results": [], "state": "approve", "timeout": false}

But the CSR for this node wasn't created until some time later. 

# oc describe csr node-csr-fO8ao4Vf2OM49r9NXG6UKdLebzwLvl4P9FtWH30ju88 | grep CreationTimestamp
CreationTimestamp:  Wed, 27 Jun 2018 08:17:04 +0000

Jun 27 08:16:54 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[52980]: I0627 08:16:54.216017   52989 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
Jun 27 08:16:54 qe-jliu-autowvyc-nrr-1 atomic-openshift-node[52980]: I0627 08:16:54.217911   52989 bootstrap.go:86] No valid private key and/or certificate found, reusing existing private key or creating a new o

Then at Jun 27 08:18:21 qe-jliu-autowvyc-nrr-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.

For some reason oc_adm_csr said all CSRs were signed before they were actually created.

Comment 33 Scott Dodson 2018-06-28 19:07:30 UTC
oc_adm_csr was considering any host that had a node object already as having CSRs approved when they may not have even existed.

https://github.com/openshift/openshift-ansible/pull/9020 fixes this to only consider those nodes to have completed if the node is also marked as ready.

This leaves a gap where oc_adm_csr likely fails to renew certs that have hit 80% of their lifespan and have created new CSRs.

Comment 35 liujia 2018-07-02 09:44:00 UTC
Should be blocked by new regression bz1597219

Comment 36 Mike Fiedler 2018-07-10 00:49:33 UTC
I intend to mark this verified unless anyone is still hitting this.   On openshift-ansible 3.10.15 this is no longer seen across many upgrades.   @liujia let me know if you disagree or have seen this recently.

Comment 37 liujia 2018-07-10 01:42:06 UTC
Hi Mike
I think you can verify the bug. I did not hit it on v3.10.15.

Comment 38 Mike Fiedler 2018-07-10 01:47:42 UTC
Marking verified on 3.10.15 - csr pending and failures to restart atomic-openshift-node on masters are no longer encountered.

Comment 41 Mike Fiedler 2018-07-13 18:41:06 UTC
Moving back to assigned per comment 39 and comment 40

Comment 42 Weihua Meng 2018-07-17 05:27:12 UTC
I only hit this error once recently.
remove testblocker keyword.

Comment 43 Scott Dodson 2018-07-18 17:15:10 UTC
https://github.com/openshift/openshift-ansible/pull/9238 should resolve this. We found that control plane nodes get restarted during the join process and if they're NotReady then certificate approval fails.

Comment 46 Vadim Rutkovsky 2018-07-20 07:55:28 UTC
Fix is available in openshift-ansible-3.10.21-1

Comment 47 liujia 2018-07-23 09:02:11 UTC
Verified on openshift-ansible-3.10.21-1.git.0.6446011.el7.noarch

Comment 48 Vadim Rutkovsky 2018-07-23 11:20:14 UTC
Haven't seen this after PR #9238 has been merged

Comment 50 errata-xmlrpc 2018-07-30 20:22:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2263


Note You need to log in before you can comment on or make changes to this bug.