Bug 1689243

Summary:	Upgrade from 3.9 to 3.10 fails on openshift_control_plane: verify API server. Using wrong API port.
Product:	OpenShift Container Platform	Reporter:	Bryn Ellis <bryn.ellis>
Component:	Installer	Assignee:	Vadim Rutkovsky <vrutkovs>
Installer sub component:	openshift-ansible	QA Contact:	Weihua Meng <wmeng>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, ckoep, gpei, mmccomas, mvardhan, shiywang, vrutkovs, wmeng
Version:	3.9.0
Target Milestone:	---
Target Release:	3.10.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: several openshift vars were not used during upgrade Consequence: 3.9 -> 3.10 upgrade would fail if custom api port is set and facts were cleared Fix: api_port and other apiserver-related vars are being read during upgrade Result: upgrade succeeds	Story Points:	---
Clone Of:
Clones:	1699695 1699696 (view as bug list)		Environment:
Last Closed:	2019-06-11 09:30:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1699695, 1699696

Comment 1 Bryn Ellis 2019-03-15 14:11:59 UTC

I don't know if this information helps but these are the versions after the upgrade control plane has failed:

oc get nodes (versions all the same as before the upgrade started)

MBP00478-2:~ bryn.ellis$ oc get nodes                                                                                                                                                                                                                                                                                                                
NAME                                     STATUS                     ROLES            AGE       VERSION
ip-10-xxx-xxx-10.stage.ice.aws            Ready                      compute,master   1y        v1.9.1+a0ce1bc657
ip-10-xxx-xxx-200.stage.ice.aws           Ready                      infra            1y        v1.9.1+a0ce1bc657
ip-10-xxx-xxx-201.stage.ice.aws           Ready                      infra            1y        v1.9.1+a0ce1bc657
ip-10-xxx-xxx-10.stage.ice.aws            Ready                      compute,master   154d      v1.9.1+a0ce1bc657
ip-10-xxx-xxx-200.stage.ice.aws           Ready                      infra            1y        v1.9.1+a0ce1bc657
ip-10-xxx-xxx-10.stage.ice.aws            Ready                      compute,master   154d      v1.9.1+a0ce1bc657

Web console still works but shows k8s master upgraded version:

OpenShift Master:       v3.9.0+ba7faec-1
Kubernetes Master:      v1.10.0+b81c8f8
OpenShift Web Console:  v3.9.0+b600d46-dirty

Comment 2 Bryn Ellis 2019-03-15 17:28:13 UTC

I can confirm that by making changes to roles/openshift_facts/defaults/main.yml (change 8443 to 443)

`openshift_master_api_port: "443"`

....AND roles/openshift_facts/library/openshift_facts.py (change 8443 to 443)

if 'master' in roles:                                                                                                                                                                                                                                                                                                                          
            defaults['master'] = dict(api_use_ssl=True, api_port='443',                                                                                                                                                                                                                                                                                
                                      controllers_port='8444',                                                                                                                                                                                                                                                                                         
                                      console_use_ssl=True,                                                                                                                                                                                                                                                                                            
                                      console_path='/console',                                                                                                                                                                                                                                                                                         
                                      console_port='443',                                                                                                                                                                                                                                                                                              
                                      portal_net='172.30.0.0/16',                                                                                                                                                                                                                                                                                      
                                      bind_addr='0.0.0.0',                                                                                                                                                                                                                                                                                             
                                      session_max_seconds=3600,                                                                                                                                                                                                                                                                                        
                                      session_name='ssn')

...then running the upgrade again it gets further.

Comment 4 Vadim Rutkovsky 2019-03-22 16:39:44 UTC

Contacted the reporter on Slack:
1. Facts were cleaned
2. ansible version used is 2.5.2, which might be the cause here.

Bryn, would you mind trying to reproduce that with ansible 2.4 or 2.6? I'll check if using ansible 2.5.2 makes it reproducible in my case

Comment 5 Bryn Ellis 2019-04-01 12:58:48 UTC

I've downgraded ansible:

ansible --version
ansible 2.4.6.0
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /bin/ansible
  python version = 2.7.5 (default, Aug  4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]


Restored back to original 3.9 and ran openshift_node_group (ran successfully) and upgrade_control_plane again.

Can't decide if upgrade_control_plane got further or not ( I don't think it has) but I have hit a different error now.  This error is already a bug reported on bugzilla - https://bugzilla.redhat.com/show_bug.cgi?id=1656645.  I don't really know how to proceed now because on this new error I've commented out all my metrics inventory variables because I want to leave metrics and upgrade that later and running 'ansible -i inventory_file master -a 'oc whoami' came back with all 3 masters show 'system:admin' which the bug report seems to suggest is correct.

I've done another 'git pull' and am going to restore all instances back to 3.9 and then run again to see if I hit the problem at the same spot.

Comment 6 Bryn Ellis 2019-04-01 15:52:17 UTC

OK, I don't know what happened to get the error related to bug 1656645 mentioned above but it went through this time.  It went through all 30 retries and failed but didn't bomb out of the upgrade, it just continued.  Then it failed at the usual point verifying the API but using the correct port as per Comment 2..  

Looking back through the log in a bit more detail it all seems to be related to the failure of storage upgrade.  Maybe the failure of the storage upgrade is what is causing the API verification to fail because something isn't starting properly.

I'm currently losing the will to live with this upgrade! :-(

Next is to restore back to 3.9 again and try one of the suggestions in Comment 6 of https://bugzilla.redhat.com/show_bug.cgi?id=1591053 to see if I can get the storage upgrade to work.

Comment 7 Vadim Rutkovsky 2019-04-03 13:36:19 UTC

(In reply to Bryn Ellis from comment #5)
> Can't decide if upgrade_control_plane got further or not ( I don't think it
> has) but I have hit a different error now. 

It certainly has progressed further - storage migration doesn't start before we ensure API is up.
Since it has proceeded to that step it means API check has passed.

(In reply to Bryn Ellis from comment #6)
> OK, I don't know what happened to get the error related to bug 1656645
> mentioned above but it went through this time.  It went through all 30
> retries and failed but didn't bomb out of the upgrade, it just continued. 
> Then it failed at the usual point verifying the API but using the correct
> port as per Comment 2..  


This might mean API server (or scheduler) got stuck. Please attach output of `master-logs api api` and `master-logs controllers controllers` from master nodes

Comment 8 Bryn Ellis 2019-04-03 14:11:59 UTC

Sorry, I've been trying all sorts of other things to try to get this to work since then so I don't have those logs anymore.

The current position I'm in is:

a) I still have the 'hacks' in to make sure it uses 443 and not 8443 for the API check.
b) I've added an entry to /etc/hosts to point the fqdn it uses for the API check to the first master.  I've made this /etc/hosts edit on all 3 master plus my ansible server.  The idea behind this was to bypass my AWS ELB to make sure it wasn't causing the API check to fail because of dropping the masters out of the ELB.

Unfortunately, exactly the same issue has occurred.  It is using 443 for the API check but it still fails after 120 retries.

2019-04-03 14:10:07,639 p=2342 u=root |  fatal: [ip-10-160-20-10.stage.ice.aws]: FAILED! => {
    "attempts": 120,
    "changed": false,
    "cmd": [
        "curl",
        "--silent",
        "--tlsv1.2",
        "--max-time",
        "2",
        "--cacert",
        "/etc/origin/master/ca-bundle.crt",
        "https://staging-ocp-cluster.ice-technology.com:443/healthz/ready"
    ],
    "delta": "0:00:00.010035",
    "end": "2019-04-03 14:10:07.619571",
    "failed": true,
    "invocation": {
        "module_args": {
            "_raw_params": "curl --silent --tlsv1.2 --max-time 2 --cacert /etc/origin/master/ca-bundle.crt https://staging-ocp-cluster.ice-technology.com:443/healthz/ready",
            "_uses_shell": false,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": false
        }
    },
    "msg": "non-zero return code",
    "rc": 7,
    "start": "2019-04-03 14:10:07.609536",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "",
    "stdout_lines": []


and...

docker ps
CONTAINER ID        IMAGE                                    COMMAND                  CREATED             STATUS              PORTS               NAMES
7046c175db56        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           6 minutes ago       Up 5 minutes                            k8s_POD_master-controllers-ip-10-160-20-10.stage.ice.aws_kube-system_7258811294607c245c6d909256a719d5_0
bf5f79f17468        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           6 minutes ago       Up 5 minutes                            k8s_POD_master-api-ip-10-160-20-10.stage.ice.aws_kube-system_df602223db6f9a5a5492230ccb5ebce9_0
7c6b4ef25a78        96cf7dd047cb                             "/usr/bin/service-..."   20 minutes ago      Up 20 minutes                           k8s_controller-manager_controller-manager-vmxvm_kube-service-catalog_46fa2bfa-8f91-11e8-a5f4-0ab1f54c2b38_109
69b641629aac        ff5dd2137a4f                             "/bin/sh -c '#!/bi..."   20 minutes ago      Up 20 minutes                           k8s_etcd_master-etcd-ip-10-160-20-10.stage.ice.aws_kube-system_7c4462a08b9f01a3c928e36663e0f1b9_0
88a34b73ed0e        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           20 minutes ago      Up 20 minutes                           k8s_POD_master-etcd-ip-10-160-20-10.stage.ice.aws_kube-system_7c4462a08b9f01a3c928e36663e0f1b9_0
bb610be0bbf1        96cf7dd047cb                             "/usr/bin/service-..."   22 minutes ago      Up 22 minutes                           k8s_apiserver_apiserver-w42pp_kube-service-catalog_44845c61-8f91-11e8-a5f4-0ab1f54c2b38_61
f2c36c3f4041        4f63970098ae                             "sh run.sh"              22 minutes ago      Up 22 minutes                           k8s_fluentd-elasticsearch_logging-fluentd-l7bbm_logging_8d0453ec-560e-11e9-b04f-0ab1f54c2b38_2
41ef39cb25f1        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           22 minutes ago      Up 22 minutes                           k8s_POD_logging-fluentd-l7bbm_logging_8d0453ec-560e-11e9-b04f-0ab1f54c2b38_1
00bb2d7dda92        aa12a2fc57f7                             "/usr/bin/origin-w..."   22 minutes ago      Up 22 minutes                           k8s_webconsole_webconsole-5f649b49b5-bfltf_openshift-web-console_e488cc08-ee60-11e8-950d-0ab1f54c2b38_24
a4d3fd8f9c09        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           22 minutes ago      Up 22 minutes                           k8s_POD_webconsole-5f649b49b5-bfltf_openshift-web-console_e488cc08-ee60-11e8-950d-0ab1f54c2b38_23
39bd01427954        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           22 minutes ago      Up 22 minutes                           k8s_POD_controller-manager-vmxvm_kube-service-catalog_46fa2bfa-8f91-11e8-a5f4-0ab1f54c2b38_56
eca530e715ca        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           22 minutes ago      Up 22 minutes                           k8s_POD_apiserver-w42pp_kube-service-catalog_44845c61-8f91-11e8-a5f4-0ab1f54c2b38_56
806667874dbc        262dbb751d6b                             "/bin/bash -c '#!/..."   23 minutes ago      Up 22 minutes                           k8s_openvswitch_ovs-57rr2_openshift-sdn_0e1ad7d7-560f-11e9-89b8-02684177ea70_0
7001883698ae        262dbb751d6b                             "/bin/bash -c '#!/..."   23 minutes ago      Up 22 minutes                           k8s_sdn_sdn-djk8r_openshift-sdn_0e1ad8a8-560f-11e9-89b8-02684177ea70_0
31a405d5b494        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           23 minutes ago      Up 23 minutes                           k8s_POD_sdn-djk8r_openshift-sdn_0e1ad8a8-560f-11e9-89b8-02684177ea70_0
9c6663fc7a71        docker.io/openshift/origin-pod:v3.10.0   "/usr/bin/pod"           23 minutes ago      Up 23 minutes                           k8s_POD_ovs-57rr2_openshift-sdn_0e1ad7d7-560f-11e9-89b8-02684177ea70_0
ad16320ea2bb        262dbb751d6b                             "/bin/bash -c '#!/..."   24 minutes ago      Up 24 minutes                           k8s_sync_sync-mln2b_openshift-node_aa0c3684-560e-11e9-b04f-0ab1f54c2b38_1
7d94632b0f62        openshift/origin-pod:v3.9.0              "/usr/bin/pod"           26 minutes ago      Up 26 minutes                           k8s_POD_sync-mln2b_openshift-node_aa0c3684-560e-11e9-b04f-0ab1f54c2b38_0

and...

netstat -nalp | grep 443
unix  2      [ ACC ]     STREAM     LISTENING     28443    1339/master          private/proxywrite

So pods are running but nothing listening on 443 hence which the API check fails I assume.

Comment 12 Vadim Rutkovsky 2019-04-09 17:35:13 UTC

We're gonna need a more verbose logs (`ansible-playbook -vvv` output, not the log file ansible creates) to find out why is API port not being set properly

Comment 15 Vadim Rutkovsky 2019-04-12 12:57:10 UTC

Created PR to 3.10 - https://github.com/openshift/openshift-ansible/pull/11487
3.11 PR - https://github.com/openshift/openshift-ansible/pull/11491

Comment 16 Gaoyun Pei 2019-04-15 08:56:50 UTC

*** Bug 1699695 has been marked as a duplicate of this bug. ***

Comment 21 Weihua Meng 2019-04-19 07:31:16 UTC

I could not reproduce the bug.
Could you help with reproduce steps?
Thanks.

Comment 22 Vadim Rutkovsky 2019-04-19 11:06:21 UTC

(In reply to Weihua Meng from comment #21)
> I could not reproduce the bug.
> Could you help with reproduce steps?
> Thanks.

1. Install 3.9 cluster w/ custom api_port to 443
2. Clear local and remote facts
3. Upgrade to 3.10

This worked as expected during upgrade when existing facts were reused

Comment 23 Weihua Meng 2019-04-23 02:00:46 UTC

Fixed.

openshift-ansible-3.10.139-1.git.0.02bc5db.el7.noarch

ansible-2.4.6.0-1.el7ae

Comment 25 errata-xmlrpc 2019-06-11 09:30:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0786