Bug 1410151

Summary: Router not working after OSE upgrade from 3.3 to 3.4
Product: OpenShift Container Platform Reporter: Apeksha <akhakhar>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Status: CLOSED NOTABUG QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: akhakhar, anli, aos-bugs, jokerman, mmccomas, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-05 11:10:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journal logs of master node none

Description Apeksha 2017-01-04 15:25:09 UTC
Created attachment 1237223 [details]
journal logs of master node

Description of problem:
Router not working after OSE upgrade from 3.3 to 3.4


Steps to Reproduce:

1. subscription-manager repos --disable="rhel-7-server-ose-3.3-rpms"
2. yum clean all
3. wget http://file.rdu.redhat.com/tdawson/repo/aos-unsigned-latest.repo -P /etc/yum.repos.d/
4. yum update atomic-openshift-utils

[root@dhcp47-115 ~]# rpm -qa | grep openshift
atomic-openshift-clients-3.3.1.7-1.git.0.0988966.el7.x86_64
atomic-openshift-node-3.3.1.7-1.git.0.0988966.el7.x86_64
openshift-ansible-callback-plugins-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-playbooks-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-3.3.1.7-1.git.0.0988966.el7.x86_64
openshift-ansible-docs-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-lookup-plugins-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-roles-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-utils-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-master-3.3.1.7-1.git.0.0988966.el7.x86_64
tuned-profiles-atomic-openshift-node-3.3.1.7-1.git.0.0988966.el7.x86_64
atomic-openshift-sdn-ovs-3.3.1.7-1.git.0.0988966.el7.x86_64
openshift-ansible-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-filter-plugins-3.4.41-1.git.0.449ee52.el7.noarch

5. yum install atomic-openshift-excluder atomic-openshift-docker-excluder

[root@dhcp47-115 ~]# rpm -qa | grep openshift
atomic-openshift-clients-3.3.1.7-1.git.0.0988966.el7.x86_64
atomic-openshift-node-3.3.1.7-1.git.0.0988966.el7.x86_64
openshift-ansible-callback-plugins-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-playbooks-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-docker-excluder-3.4.0.38-1.git.0.8561cba.el7.noarch
atomic-openshift-3.3.1.7-1.git.0.0988966.el7.x86_64
openshift-ansible-docs-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-lookup-plugins-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-roles-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-utils-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-master-3.3.1.7-1.git.0.0988966.el7.x86_64
tuned-profiles-atomic-openshift-node-3.3.1.7-1.git.0.0988966.el7.x86_64
atomic-openshift-sdn-ovs-3.3.1.7-1.git.0.0988966.el7.x86_64
openshift-ansible-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-filter-plugins-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-excluder-3.4.0.38-1.git.0.8561cba.el7.noarch

[root@dhcp47-115 ~]# rpm -qa | grep excluder
atomic-openshift-docker-excluder-3.4.0.38-1.git.0.8561cba.el7.noarch
atomic-openshift-excluder-3.4.0.38-1.git.0.8561cba.el7.noarch

6. atomic-openshift-excluder unexclude

7. atomic-openshift-installer upgrade
[root@dhcp47-115 ~]# atomic-openshift-installer -u -c openshift-installer.cfg.yml upgrade

        This tool will help you upgrade your existing OpenShift installation.
        Currently running: openshift-enterprise 3.3

(1) Update to latest 3.3
(2) Upgrade to next release: 3.4

Choose an option from above: 2
OpenShift will be upgraded from openshift-enterprise 3.3 to latest openshift-enterprise 3.4 on the following hosts:

  * dhcp47-115.lab.eng.blr.redhat.com
  * dhcp46-130.lab.eng.blr.redhat.com
  * dhcp46-92.lab.eng.blr.redhat.com
  * dhcp46-121.lab.eng.blr.redhat.com

Play 1/61 (Verify Ansible version is greater than or equal to 2.1.0.0)
.
Play 2/61 (localhost)
..
Play 3/61 (l_oo_all_hosts)
.
Play 4/61 (Populate config host groups)
................
Play 5/61 (Set oo_options)
.......
Play 6/61 (Ensure that all non-node hosts are accessible)
.
Play 7/61 (Initialize host facts)
...........
Play 8/61 (l_oo_all_hosts)
..
Play 9/61 (Filter list of nodes to be upgraded if necessary)
.....
Play 10/61 (Update repos and initialize facts on all hosts)
..............
Play 11/61 (Set openshift_no_proxy_internal_hostnames)
..
Play 12/61 (Verify upgrade can proceed on first master)
.....
Play 13/61 (l_oo_all_hosts)
..
Play 14/61 (Determine openshift_version to configure on first master)
..........................................................................................
Play 15/61 (Set openshift_version for all hosts)
..........................................................................................
Play 16/61 (Verify master processes)
............
Play 17/61 (Verify upgrade targets)
.... [WARNING]: Consider using yum module rather than running yum

.....
Play 18/61 (Verify docker upgrade targets)
... [WARNING]: Consider using yum, dnf or zypper module rather than running rpm

..........
Play 19/61 (Flag pre-upgrade checks complete for hosts without errors)
..
Play 20/61 (Cleanup unused Docker images)
......
Play 21/61 (Evaluate additional groups for upgrade)
..
Play 22/61 (Set master embedded_etcd fact)
.........
Play 23/61 (Populate config host groups)
................
Play 24/61 (Evaluate additional groups for etcd)
...
Play 25/61 (Backup etcd)
...................
Play 26/61 (Gate on etcd backup)
....
Play 27/61 (Backup etcd)
...................
Play 28/61 (Gate on etcd backup)
....
Play 29/61 (Upgrade master packages)
..........
Play 30/61 (Determine if service signer cert must be created)
..
Play 31/61 (Create local temp directory for syncing certs)
.
Play 32/61 (Create service signer certificate)
.....
Play 33/61 (Deploy service signer certificate)
..
Play 34/61 (Delete local temp directory)
.
Play 35/61 (Set OpenShift master facts)
...............
Play 36/61 (Upgrade master config and systemd units)
.....................................................
Play 37/61 (Set master update status to complete)
..
Play 38/61 (Gate on master update)
....
Play 39/61 (Populate config host groups)
................
Play 40/61 (Validate configuration for rolling restart)
..........
Play 41/61 (Create temp file on localhost)
.
Play 42/61 (Check if temp file exists on any masters)
..
Play 43/61 (Cleanup temp file on localhost)
.
Play 44/61 (Warn if restarting the system where ansible is running)
...
Play 45/61 (Restart masters)
.......
Play 46/61 (Reconcile Cluster Roles and Cluster Role Bindings and Security Context Constraints)
..................................................................................................................................................................................................................................................................................        
Play 47/61 (Gate on reconcile)
....
Play 48/61 (Upgrade default router and default registry)
..........................................................................................................................................................................................................................................................................................................
Play 49/61 (Check for warnings)
...
Play 50/61 (Evacuate and upgrade nodes)
.........................................................................................
Play 51/61 (Evacuate and upgrade nodes)
.........................................................................................
Play 52/61 (Evacuate and upgrade nodes)
......................................................................
...................
Play 53/61 (Evacuate and upgrade nodes)
........................................


..............................
...................
dhcp46-121.lab.eng.blr.redhat.com : ok=112  changed=14   unreachable=0    failed=0   
dhcp46-130.lab.eng.blr.redhat.com : ok=112  changed=14   unreachable=0    failed=0   
dhcp46-92.lab.eng.blr.redhat.com : ok=112  changed=14   unreachable=0    failed=0   
dhcp47-115.lab.eng.blr.redhat.com : ok=356  changed=56   unreachable=0    failed=0   
localhost                  : ok=38   changed=3    unreachable=0    failed=0   


Installation Complete: Note: Play count is an estimate and some were skipped because your install does not require them

Upgrade completed! Rebooting all hosts is recommended.


[root@dhcp47-115 ~]# rpm -qa | grep openshift
openshift-ansible-callback-plugins-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-playbooks-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-node-3.4.0.38-1.git.0.8561cba.el7.x86_64
atomic-openshift-docker-excluder-3.4.0.38-1.git.0.8561cba.el7.noarch
tuned-profiles-atomic-openshift-node-3.4.0.38-1.git.0.8561cba.el7.x86_64
atomic-openshift-3.4.0.38-1.git.0.8561cba.el7.x86_64
openshift-ansible-docs-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-lookup-plugins-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-roles-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-utils-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-sdn-ovs-3.4.0.38-1.git.0.8561cba.el7.x86_64
openshift-ansible-3.4.41-1.git.0.449ee52.el7.noarch
openshift-ansible-filter-plugins-3.4.41-1.git.0.449ee52.el7.noarch
atomic-openshift-excluder-3.4.0.38-1.git.0.8561cba.el7.noarch
atomic-openshift-clients-3.4.0.38-1.git.0.8561cba.el7.x86_64
atomic-openshift-master-3.4.0.38-1.git.0.8561cba.el7.x86_64
[root@dhcp47-115 ~]# oadm version
oadm v3.4.0.38
kubernetes v1.4.0+776c994

Server https://dhcp47-115.lab.eng.blr.redhat.com:8443
openshift v3.4.0.38
kubernetes v1.4.0+776c994

9. update the /etc/sysconfig/docker add registry to pull the ose3.4 images
10. reboot nodes one after the other.

[root@dhcp47-115 ~]# oc get pods
NAME                                                     READY     STATUS    RESTARTS   AGE
aplo-router-1-ocneo                                      1/1       Running   1          1h
glusterfs-dc-dhcp46-121.lab.eng.blr.redhat.com-1-tbutq   1/1       Running   1          1h
glusterfs-dc-dhcp46-130.lab.eng.blr.redhat.com-1-z87rg   1/1       Running   1          1h
glusterfs-dc-dhcp46-92.lab.eng.blr.redhat.com-1-zhkov    1/1       Running   1          1h
heketi-1-2b2sh                                           1/1       Running   4          1h
mongodb-1-7gp1u                                          1/1       Running   5          1h
[root@dhcp47-115 ~]# oc get dc
NAME                                             REVISION   DESIRED   CURRENT   TRIGGERED BY
aplo-router                                      2          1         0         config
glusterfs-dc-dhcp46-121.lab.eng.blr.redhat.com   1          1         1         config
glusterfs-dc-dhcp46-130.lab.eng.blr.redhat.com   1          1         1         config
glusterfs-dc-dhcp46-92.lab.eng.blr.redhat.com    1          1         1         config
heketi                                           1          1         1         config
jenkins                                          2          1         0         config,image(jenkins:latest)
mongodb                                          1          1         1         config,image(mongodb:3.2)
[root@dhcp47-115 ~]# oc describe dc aplo-router
Name:		aplo-router
Namespace:	aplo
Created:	23 hours ago
Labels:		router=aplo-router
Annotations:	<none>
Latest Version:	2
Selector:	router=aplo-router
Replicas:	1
Triggers:	Config
Strategy:	Rolling
Template:
  Labels:		router=aplo-router
  Service Account:	router
  Containers:
   router:
    Image:	openshift3/ose-haproxy-router:v3.4.0.38
    Ports:	80/TCP, 443/TCP, 1936/TCP
    Requests:
      cpu:	100m
      memory:	256Mi
    Liveness:	http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:	http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /etc/pki/tls/private from server-certificate (ro)
    Environment Variables:
      DEFAULT_CERTIFICATE_DIR:			/etc/pki/tls/private
      ROUTER_EXTERNAL_HOST_HOSTNAME:		
      ROUTER_EXTERNAL_HOST_HTTPS_VSERVER:	
      ROUTER_EXTERNAL_HOST_HTTP_VSERVER:	
      ROUTER_EXTERNAL_HOST_INSECURE:		false
      ROUTER_EXTERNAL_HOST_PARTITION_PATH:	
      ROUTER_EXTERNAL_HOST_PASSWORD:		
      ROUTER_EXTERNAL_HOST_PRIVKEY:		/etc/secret-volume/router.pem
      ROUTER_EXTERNAL_HOST_USERNAME:		
      ROUTER_SERVICE_HTTPS_PORT:		443
      ROUTER_SERVICE_HTTP_PORT:			80
      ROUTER_SERVICE_NAME:			aplo-router
      ROUTER_SERVICE_NAMESPACE:			aplo
      ROUTER_SUBDOMAIN:				
      STATS_PASSWORD:				I2gclLv7ZU
      STATS_PORT:				1936
      STATS_USERNAME:				admin
  Volumes:
   server-certificate:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	aplo-router-certs

Deployment #2 (latest):
	Created:	about an hour ago
	Status:		Failed
	Replicas:	0 current / 0 desired
Deployment #1:
	Name:		aplo-router-1
	Created:	23 hours ago
	Status:		Complete
	Replicas:	1 current / 1 desired
	Selector:	deployment=aplo-router-1,deploymentconfig=aplo-router,router=aplo-router
	Labels:		openshift.io/deployment-config.name=aplo-router,router=aplo-router
	Pods Status:	1 Running / 0 Waiting / 0 Succeeded / 0 Failed

Events:
  FirstSeen	LastSeen	Count	From				SubobjectPath	Type		Reason			Message
  ---------	--------	-----	----				-------------	--------	------			-------
  1h		1h		1	{deploymentconfig-controller }			Normal		DeploymentCreated	Created new replication controller "aplo-router-2" for version 2
  1h		1h		1	{deployments-controller }			Warning		Failed			aplo-router-2: Deployer pod "aplo-router-2-deploy" has gone missing

[root@dhcp47-115 ~]# oc describe pod aplo-router-1-ocneo
Name:			aplo-router-1-ocneo
Namespace:		aplo
Security Policy:	privileged
Node:			dhcp46-130.lab.eng.blr.redhat.com/10.70.46.130
Start Time:		Wed, 04 Jan 2017 19:27:21 +0530
Labels:			deployment=aplo-router-1
			deploymentconfig=aplo-router
			router=aplo-router
Status:			Running
IP:			10.70.46.130
Controllers:		ReplicationController/aplo-router-1
Containers:
  router:
    Container ID:	docker://595674241fa18240dd8e58ef0631bb2e73558879f4bf4f7aa1f8ca396f76ea69
    Image:		openshift3/ose-haproxy-router:v3.3.1.7
    Image ID:		docker-pullable://registry.access.redhat.com/openshift3/ose-haproxy-router@sha256:f2f75cfd2b828c3143ca8022e26593a7491ca040dab6d6472472ed040d1c1b83
    Ports:		80/TCP, 443/TCP, 1936/TCP
    Requests:
      cpu:		100m
      memory:		256Mi
    State:		Running
      Started:		Wed, 04 Jan 2017 20:26:59 +0530
    Last State:		Terminated
      Reason:		Error
      Exit Code:	1
      Started:		Wed, 04 Jan 2017 20:17:57 +0530
      Finished:		Wed, 04 Jan 2017 20:26:13 +0530
    Ready:		True
    Restart Count:	1
    Liveness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:		http-get http://localhost:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /etc/pki/tls/private from server-certificate (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from router-token-94h2m (ro)
    Environment Variables:
      DEFAULT_CERTIFICATE_DIR:			/etc/pki/tls/private
      ROUTER_EXTERNAL_HOST_HOSTNAME:		
      ROUTER_EXTERNAL_HOST_HTTPS_VSERVER:	
      ROUTER_EXTERNAL_HOST_HTTP_VSERVER:	
      ROUTER_EXTERNAL_HOST_INSECURE:		false
      ROUTER_EXTERNAL_HOST_PARTITION_PATH:	
      ROUTER_EXTERNAL_HOST_PASSWORD:		
      ROUTER_EXTERNAL_HOST_PRIVKEY:		/etc/secret-volume/router.pem
      ROUTER_EXTERNAL_HOST_USERNAME:		
      ROUTER_SERVICE_HTTPS_PORT:		443
      ROUTER_SERVICE_HTTP_PORT:			80
      ROUTER_SERVICE_NAME:			aplo-router
      ROUTER_SERVICE_NAMESPACE:			aplo
      ROUTER_SUBDOMAIN:				
      STATS_PASSWORD:				I2gclLv7ZU
      STATS_PORT:				1936
      STATS_USERNAME:				admin
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  server-certificate:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	aplo-router-certs
  router-token-94h2m:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	router-token-94h2m
QoS Class:	Burstable
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From						SubobjectPath	Type		Reason		Message
  ---------	--------	-----	----						-------------	--------	------		-------
  1h		1h		1	{default-scheduler }						Normal		Scheduled	Successfully assigned aplo-router-1-ocneo to dhcp46-130.lab.eng.blr.redhat.com
  51m		51m		1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}			Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with ImageInspectError: "Failed to inspect image \"openshift3/ose-pod:v3.4.0.38\": Cannot connect to the Docker daemon. Is the docker daemon running on this host?"

  1h	35m	14	{kubelet dhcp46-130.lab.eng.blr.redhat.com}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with ErrImagePull: "unauthorized: authentication required"

  1h	34m	199	{kubelet dhcp46-130.lab.eng.blr.redhat.com}		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "POD" with ImagePullBackOff: "Back-off pulling image \"openshift3/ose-pod:v3.4.0.38\""

  34m	34m	1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}	spec.containers{router}	Normal	Pulling		pulling image "openshift3/ose-haproxy-router:v3.3.1.7"
  34m	34m	1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}	spec.containers{router}	Normal	Pulled		Successfully pulled image "openshift3/ose-haproxy-router:v3.3.1.7"
  33m	33m	1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}	spec.containers{router}	Normal	Created		Created container with docker id 1b6c21a5e5a1; Security:[seccomp=unconfined]
  33m	33m	1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}	spec.containers{router}	Normal	Started		Started container with docker id 1b6c21a5e5a1
  29m	29m	4	{kubelet dhcp46-130.lab.eng.blr.redhat.com}				Warning	FailedMount	MountVolume.SetUp failed for volume "kubernetes.io/secret/b61bb2ff-d285-11e6-894c-005056b329c4-router-token-94h2m" (spec.Name: "router-token-94h2m") pod "b61bb2ff-d285-11e6-894c-005056b329c4" (UID: "b61bb2ff-d285-11e6-894c-005056b329c4") with: Get https://dhcp47-115.lab.eng.blr.redhat.com:8443/api/v1/namespaces/aplo/secrets/router-token-94h2m: dial tcp 10.70.47.115:8443: getsockopt: connection refused
  29m	29m	4	{kubelet dhcp46-130.lab.eng.blr.redhat.com}				Warning	FailedMount	MountVolume.SetUp failed for volume "kubernetes.io/secret/b61bb2ff-d285-11e6-894c-005056b329c4-server-certificate" (spec.Name: "server-certificate") pod "b61bb2ff-d285-11e6-894c-005056b329c4" (UID: "b61bb2ff-d285-11e6-894c-005056b329c4") with: Get https://dhcp47-115.lab.eng.blr.redhat.com:8443/api/v1/namespaces/aplo/secrets/aplo-router-certs: dial tcp 10.70.47.115:8443: getsockopt: connection refused
  24m	24m	1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}	spec.containers{router}	Normal	Pulled		Container image "openshift3/ose-haproxy-router:v3.3.1.7" already present on machine
  24m	24m	1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}	spec.containers{router}	Normal	Created		Created container with docker id 595674241fa1; Security:[seccomp=unconfined]
  24m	24m	1	{kubelet dhcp46-130.lab.eng.blr.redhat.com}	spec.containers{router}	Normal	Started		Started container with docker id 595674241fa1


As we can see from the above output router is not working as expected. Router1 still working and having ose3.3 image url and not ose3.4

I have attached the journal logs of master node.

Comment 3 Apeksha 2017-01-05 11:10:10 UTC
It worked with the steps mentioned in comment2.