Bug 1331375

Summary: the upgrade should handle aborting with rigorous design
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: Cluster Version OperatorAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED CURRENTRELEASE QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.0CC: aos-bugs, bleanhar, dgoodwin, jokerman, mmccomas, tdawson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously the OpenShift upgrade process could fail mid-way if the current master and node services were not all up and running. Upgrade was modified to verify services are running correctly prior to beginning the actual upgrade. Additionally some preliminary checks were modifying configuration files on the system that could cause the master/node services to flip to using the new image version in a containerized environment, before the upgrade was fully completed. The upgrade playbooks were modified to remove these changes from the preliminary upgrade checks.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-22 22:22:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Anping Li 2016-04-28 12:04:00 UTC
Description of problem:
During upgrade, one master hit the issue "https://bugzilla.redhat.com/show_bug.cgi?id=1310576". the upgrade was abort 'FATAL: all hosts have already failed -- aborting'.  If we check the system status at this point, we can found only partial upgrade had bee done on those working masters&nodes.
atomic-openshift-master-api and atomic-openshift-master-controllers service were upgraded.  For the service atomic-openshift-node, the IMAGE_VERSION had been updated, but the service wasn't restarted. That made the node service in inative status. 

Suggest upgrade tools should detect the broken service in advance before upgrade run, once found broken service, abort the whole upgrade, no touch anything to ask user to fix the broken service.
or
No touch broken service, continue upgrade the working masters/nodes, and finish it, rather than the running masters/nodes become broken too due to partial upgrade.


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.0.85

How reproducible:
Always

Steps to Reproduce:
1. install nativaha OSE 3.1 on Atomic Hosts
2. upgrade the ostree to include docker 1.9
   atomic host upgrade
3. Here modify atomic-openshift-master-api to make service always failed to start to create a dummy scenarios like BZ#1310576
4. run upgrade.
5. check the system status

Actual results:
4. upgrade failed
  TASK: [set_fact ] *************************************************************
ok: [localhost] => {"ansible_facts": {"pre_upgrade_completed": ["atomic1master1.example.com", "atomic1master2.example.com", "atomic1node1.example.com", "atomic1node2.example.com"]}}

TASK: [set_fact ] *************************************************************
ok: [localhost] => {"ansible_facts": {"pre_upgrade_failed": ["atomic1master3.example.com"]}}

TASK: [fail ] *****************************************************************
failed: [localhost] => {"failed": true}
msg: Upgrade cannot continue. The following hosts did not complete pre-upgrade checks: atomic1master3.example.com

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/root/upgrade.retry

atomic1master.example.com  : ok=7    changed=1    unreachable=0    failed=0
atomic1master1.example.com : ok=46   changed=17   unreachable=0    failed=1
atomic1master2.example.com : ok=46   changed=17   unreachable=0    failed=1
atomic1master3.example.com : ok=13   changed=3    unreachable=0    failed=1
atomic1node1.example.com   : ok=73   changed=22   unreachable=0    failed=0
atomic1node2.example.com   : ok=73   changed=22   unreachable=0    failed=0
localhost                  : ok=14   changed=0    unreachable=0    failed=1

5.1 [root@atomic1master atomic1]# oc get nodes
NAME                         LABELS                                                                             STATUS                     AGE
atomic1master1.example.com   kubernetes.io/hostname=atomic1master1.example.com,region=logmetrics,zone=default   Ready,SchedulingDisabled   2h
atomic1master2.example.com   kubernetes.io/hostname=atomic1maste, or it finish all tasks on those working masters&nodes.r2.example.com,region=infra,zone=default        Ready                      2h
atomic1master3.example.com   kubernetes.io/hostname=atomic1master3.example.com,region=infra,zone=default        NotReady                   2h
atomic1node1.example.com     kubernetes.io/hostname=atomic1node1.example.com,region=primary,zone=west           NotReady                   2h
atomic1node2.example.com     kubernetes.io/hostname=atomic1node2.example.com,region=primary,zone=east           NotReady                   2h

5.2 [root@atomic1master atomic1]# oc get nodes

systemctl status atomic-openshift-node
● atomic-openshift-node.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-04-28 09:39:39 UTC; 1h 53min ago
 Main PID: 3321 (docker)
   Memory: 4.3M
   CGroup: /system.slice/atomic-openshift-node.service
           └─3321 /usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host -v /:/rootfs:ro -e CONFIG_FILE=/etc/origin/node/node-config.yaml -e OPTIONS=--loglevel=2 -e HOST=/roo...

Apr 28 11:22:37 atomic1master1.example.com docker[3321]: I0428 11:22:37.627934    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:22:44 atomic1master1.example.com docker[3321]: I0428 11:22:44.918406    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:22:45 atomic1master1.example.com docker[3321]: I0428 11:22:45.502483    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:27:44 atomic1master1.example.com docker[3321]: I0428 11:27:44.925635    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:27:44 atomic1master1.example.com docker[3321]: W0428 11:27:44.929145    3374 reflector.go:224] pkg/kubelet/kubelet.go:224: watch of *api.Service ended with: 401: The event in requested ...7338]) [8474]
Apr 28 11:27:45 atomic1master1.example.com docker[3321]: I0428 11:27:45.580197    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:32:11 atomic1master1.example.com docker[3321]: I0428 11:32:11.663468    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:32:38 atomic1master1.example.com docker[3321]: I0428 11:32:38.628183    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:32:45 atomic1master1.example.com docker[3321]: I0428 11:32:45.587814    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Apr 28 11:32:45 atomic1master1.example.com docker[3321]: I0428 11:32:45.942231    3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF
Warning: atomic-openshift-node.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Hint: Some lines were ellipsized, use -l to show in full.


Expected results:

The upgrade abort before any modification, or it finish all tasks on those working masters&nodes.

Comment 1 Anping Li 2016-04-29 04:03:48 UTC
If the upgrade aborting "Upgrade packages not found'. The systemctl status is as following. I just provide the info, not means that is wrong.

1. The node service configure was updated (The images version 3.1.1.6 was added ), and the service wasn't started
2. No change for those master services

Comment 2 Devan Goodwin 2016-09-08 19:21:34 UTC
I believe this is now fixed if I'm following the bug correctly, for some time we try to make sure that the node/master services are running before we proceed with upgrade. However pre.yml DID include a pass through some systemd setup tasks that did modify the IMAGE_VERSION, which should not be happening in pre.yml. This was removed in the refactoring for 3.2/3.3.

Sending this over to you Anping.

Comment 3 Anping Li 2016-09-14 05:47:08 UTC
Verified and pass