Bug 1506165 - master api&controllers did not work after split from orignal master during upgrade
Summary: master api&controllers did not work after split from orignal master during up...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.7.0
Assignee: Jan Chaloupka
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-25 10:06 UTC by liujia
Modified: 2017-11-28 22:19 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 22:19:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description liujia 2017-10-25 10:06:47 UTC
Description of problem:
Upgrade v3.6 to v3.7 against non-ha containerized ocp. Upgrade succeed with new master-api and master-controllers containers created and services are running. But it seems just an illusion and in fact it is original master service works. After stop master service, then ocp does not work.

for example, "oc get" can get nothing because 8443 port dost not in "listen" status.


===============after upgrade
# docker ps
CONTAINER ID        IMAGE                                   COMMAND                  CREATED             STATUS              PORTS               NAMES
259c947e2ec0        openshift3/ose:v3.7.0                   "/usr/bin/openshift s"   9 minutes ago       Up 8 minutes                            atomic-openshift-master-controllers
66efdbd1938e        openshift3/node:v3.7.0                  "/usr/local/bin/origi"   9 minutes ago       Up 9 minutes                            atomic-openshift-node
4016451d84d5        openshift3/ose:v3.7.0                   "/usr/bin/openshift s"   10 minutes ago      Up 10 minutes                           atomic-openshift-master-api
cec043f69378        openshift3/openvswitch:v3.7.0           "/usr/local/bin/ovs-r"   10 minutes ago      Up 10 minutes                           openvswitch
9e0d00a1c244        registry.access.redhat.com/rhel7/etcd   "/usr/bin/etcd"          10 minutes ago      Up 10 minutes                           etcd_container
a2c4b8ba4e62        openshift3/ose:v3.6.173.0.59            "/usr/bin/openshift s"   12 minutes ago      Up 12 minutes                           atomic-openshift-master

# netstat -na|grep 8443
tcp        0      0 0.0.0.0:8443            0.0.0.0:*               LISTEN     
tcp        0      0 10.240.0.85:45988       10.240.0.85:8443        ESTABLISHED
...

# oc get node
NAME                                  STATUS                     AGE       VERSION
qe-jliu-con2-master-etcd-1            Ready,SchedulingDisabled   1h        v1.7.6+a08f5eeb62
qe-jliu-con2-node-registry-router-1   Ready                      1h        v1.7.6+a08f5eeb62

===========after stop original master service and just keep api and controllers serivces

# systemctl stop atomic-openshift-master

# docker ps
CONTAINER ID        IMAGE                                   COMMAND                  CREATED             STATUS              PORTS               NAMES
66efdbd1938e        openshift3/node:v3.7.0                  "/usr/local/bin/origi"   14 minutes ago      Up 13 minutes                           atomic-openshift-node
4016451d84d5        openshift3/ose:v3.7.0                   "/usr/bin/openshift s"   14 minutes ago      Up 14 minutes                           atomic-openshift-master-api
cec043f69378        openshift3/openvswitch:v3.7.0           "/usr/local/bin/ovs-r"   14 minutes ago      Up 14 minutes                           openvswitch
9e0d00a1c244        registry.access.redhat.com/rhel7/etcd   "/usr/bin/etcd"          14 minutes ago      Up 14 minutes                           etcd_container


# oc get node
The connection to the server qe-jliu-con2-master-etcd-1:8443 was refused - did you specify the right host or port?

# netstat -na|grep 8443
tcp        0      0 10.240.0.85:45988       10.240.0.85:8443        TIME_WAIT  
tcp        0      0 10.240.0.85:46006       10.240.0.85:8443        TIME_WAIT  



Version-Release number of the following components:
openshift-ansible-docs-3.7.0-0.178.0.git.0.27a1039.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Container install v3.6 for non-ha deployment.
2. Upgrade v3.6 to v3.7
3. 


Actual results:
Master api and master controller services does not work.

Expected results:
Master api and master controller services should work instead of original master service.

Additional info:
When re-run upgrade will hit the issue too even if not stop original master service manually.

Comment 1 Jan Chaloupka 2017-10-25 12:15:23 UTC
Checking my environment in which I tested the upgrade.

#### Listing master services ####
# systemctl list-units atomic-openshift-master*
  UNIT                                        LOAD   ACTIVE SUB     DESCRIPTION
  atomic-openshift-master-api.service         loaded active running Atomic OpenShift Master API
  atomic-openshift-master-controllers.service loaded active running Atomic OpenShift Master Controllers
● atomic-openshift-master.service             loaded failed failed  atomic-openshift-master.service



#### Listing nodes ####
# oc get nodes
NAME           STATUS    AGE       VERSION
172.16.186.5   Ready     2d        v1.7.0+80709908fd


#### Listing master services ####
# systemctl status atomic-openshift-master-api.service atomic-openshift-master-controllers.service atomic-openshift-master
● atomic-openshift-master-api.service - Atomic OpenShift Master API
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-23 10:21:36 EDT; 1 day 21h ago
     Docs: https://github.com/openshift/origin
 Main PID: 29553 (docker-current)
   Memory: 3.2M
   CGroup: /system.slice/atomic-openshift-master-api.service
           └─29553 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master-api --env-file=/etc/sysconfig/atomic-openshift-master-api -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/do...

Oct 25 08:12:19 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: I1025 12:12:19.981958       1 rest.go:349] Starting watch for /api/v1/secrets, rv=180901 labels= fields= timeout=6m1s
Oct 25 08:12:20 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: E1025 12:12:20.007223       1 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Oct 25 08:12:20 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: W1025 12:12:20.007476       1 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/informers/inf...n compacted
Oct 25 08:12:20 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: I1025 12:12:20.484626       1 rest.go:349] Starting watch for /api/v1/services, rv=9402 labels= fields= timeout=5m19s
Oct 25 08:12:21 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: I1025 12:12:21.029228       1 rest.go:349] Starting watch for /api/v1/secrets, rv=181474 labels= fields= timeout=9m9s
Oct 25 08:12:23 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: I1025 12:12:23.476394       1 rest.go:349] Starting watch for /api/v1/endpoints, rv=9402 labels= fields= timeout=7m43s
Oct 25 08:12:27 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: I1025 12:12:27.311980       1 rest.go:349] Starting watch for /api/v1/serviceaccounts, rv=181070 labels= fields= timeout=8m19s
Oct 25 08:12:27 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: E1025 12:12:27.318241       1 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
Oct 25 08:12:27 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: W1025 12:12:27.318430       1 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/informers/inf...n compacted
Oct 25 08:12:28 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-api[29553]: I1025 12:12:28.322543       1 rest.go:349] Starting watch for /api/v1/serviceaccounts, rv=181482 labels= fields= timeout=5m10s

● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-10-23 10:21:46 EDT; 1 day 21h ago
     Docs: https://github.com/openshift/origin
 Main PID: 29651 (docker-current)
   Memory: 3.2M
   CGroup: /system.slice/atomic-openshift-master-controllers.service
           └─29651 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master-controllers --env-file=/etc/sysconfig/atomic-openshift-master-controllers -v /var/lib/origin:/var/lib/origin -v /var/run/docker....

Oct 25 08:12:07 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:07.037379       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"router-1-deploy", UI...
Oct 25 08:12:07 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:07.037392       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"docker-registry-2-de...
Oct 25 08:12:09 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: W1025 12:12:09.088748       1 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/infor...n compacted
Oct 25 08:12:11 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:11.039688       1 scheduler.go:168] Failed to schedule pod: default/docker-registry-2-deploy
Oct 25 08:12:11 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:11.039743       1 factory.go:734] Updating pod condition for default/docker-registry-2-deploy to (PodScheduled==False)
Oct 25 08:12:11 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:11.039920       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"docker-registry-2-de...
Oct 25 08:12:16 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: W1025 12:12:16.501476       1 reflector.go:343] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/infor...n compacted
Oct 25 08:12:19 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:19.042191       1 scheduler.go:168] Failed to schedule pod: default/docker-registry-2-deploy
Oct 25 08:12:19 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:19.042252       1 factory.go:734] Updating pod condition for default/docker-registry-2-deploy to (PodScheduled==False)
Oct 25 08:12:19 jchaloup-openshift-master-vw9dm-r1.localdomain atomic-openshift-master-controllers[29651]: I1025 12:12:19.042277       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"docker-registry-2-de...

● atomic-openshift-master.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2017-10-23 11:17:31 EDT; 1 day 20h ago
 Main PID: 4178 (code=exited, status=2)
Hint: Some lines were ellipsized, use -l to show in full.


The atomic-openshift-master.service does not run, the oc get nodes returns the list of nodes, `oc get all --all-namespaces` returns various resources. Let me check again your summary.

Comment 2 Jan Chaloupka 2017-10-25 12:16:34 UTC
# docker ps
CONTAINER ID        IMAGE                                   COMMAND                  CREATED             STATUS              PORTS               NAMES
674bc4714a7c        openshift3/ose:v3.7.0                   "/usr/bin/openshift s"   45 hours ago        Up 45 hours                             atomic-openshift-master-controllers
41560663bba9        openshift3/ose:v3.7.0                   "/usr/bin/openshift s"   45 hours ago        Up 45 hours                             atomic-openshift-master-api
b6cad80a682a        registry.access.redhat.com/rhel7/etcd   "/usr/bin/etcd"          2 days ago          Up 2 days                               etcd_container

Comment 3 Jan Chaloupka 2017-10-25 12:48:17 UTC
Liujia,

can you share your inventory file?

Comment 6 Jan Chaloupka 2017-10-30 13:42:44 UTC
Upstream PR: https://github.com/openshift/openshift-ansible/pull/5929

Comment 7 openshift-github-bot 2017-10-31 18:33:24 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/fffb5e5e516d018a8d4bd063bc439a0a81447e31
Merge pull request #5929 from ingvagabund/remove-master-service-during-non-ha-to-ha-upgrade

Automatic merge from submit-queue.

remove master.service during the non-ha to ha upgrade

Bug: 1506165

Comment 9 liujia 2017-11-02 07:03:00 UTC
Verified on openshift-ansible-3.7.0-0.189.0.git.0.d497c5e.el7.noarch.

After upgrade, checked that only api and controller service works. Restart docker and still only api and controller service works.

Comment 12 errata-xmlrpc 2017-11-28 22:19:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.