Hide Forgot
Description of problem: During upgrade, one master hit the issue "https://bugzilla.redhat.com/show_bug.cgi?id=1310576". the upgrade was abort 'FATAL: all hosts have already failed -- aborting'. If we check the system status at this point, we can found only partial upgrade had bee done on those working masters&nodes. atomic-openshift-master-api and atomic-openshift-master-controllers service were upgraded. For the service atomic-openshift-node, the IMAGE_VERSION had been updated, but the service wasn't restarted. That made the node service in inative status. Suggest upgrade tools should detect the broken service in advance before upgrade run, once found broken service, abort the whole upgrade, no touch anything to ask user to fix the broken service. or No touch broken service, continue upgrade the working masters/nodes, and finish it, rather than the running masters/nodes become broken too due to partial upgrade. Version-Release number of selected component (if applicable): atomic-openshift-utils-3.0.85 How reproducible: Always Steps to Reproduce: 1. install nativaha OSE 3.1 on Atomic Hosts 2. upgrade the ostree to include docker 1.9 atomic host upgrade 3. Here modify atomic-openshift-master-api to make service always failed to start to create a dummy scenarios like BZ#1310576 4. run upgrade. 5. check the system status Actual results: 4. upgrade failed TASK: [set_fact ] ************************************************************* ok: [localhost] => {"ansible_facts": {"pre_upgrade_completed": ["atomic1master1.example.com", "atomic1master2.example.com", "atomic1node1.example.com", "atomic1node2.example.com"]}} TASK: [set_fact ] ************************************************************* ok: [localhost] => {"ansible_facts": {"pre_upgrade_failed": ["atomic1master3.example.com"]}} TASK: [fail ] ***************************************************************** failed: [localhost] => {"failed": true} msg: Upgrade cannot continue. The following hosts did not complete pre-upgrade checks: atomic1master3.example.com FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/upgrade.retry atomic1master.example.com : ok=7 changed=1 unreachable=0 failed=0 atomic1master1.example.com : ok=46 changed=17 unreachable=0 failed=1 atomic1master2.example.com : ok=46 changed=17 unreachable=0 failed=1 atomic1master3.example.com : ok=13 changed=3 unreachable=0 failed=1 atomic1node1.example.com : ok=73 changed=22 unreachable=0 failed=0 atomic1node2.example.com : ok=73 changed=22 unreachable=0 failed=0 localhost : ok=14 changed=0 unreachable=0 failed=1 5.1 [root@atomic1master atomic1]# oc get nodes NAME LABELS STATUS AGE atomic1master1.example.com kubernetes.io/hostname=atomic1master1.example.com,region=logmetrics,zone=default Ready,SchedulingDisabled 2h atomic1master2.example.com kubernetes.io/hostname=atomic1maste, or it finish all tasks on those working masters&nodes.r2.example.com,region=infra,zone=default Ready 2h atomic1master3.example.com kubernetes.io/hostname=atomic1master3.example.com,region=infra,zone=default NotReady 2h atomic1node1.example.com kubernetes.io/hostname=atomic1node1.example.com,region=primary,zone=west NotReady 2h atomic1node2.example.com kubernetes.io/hostname=atomic1node2.example.com,region=primary,zone=east NotReady 2h 5.2 [root@atomic1master atomic1]# oc get nodes systemctl status atomic-openshift-node ● atomic-openshift-node.service Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2016-04-28 09:39:39 UTC; 1h 53min ago Main PID: 3321 (docker) Memory: 4.3M CGroup: /system.slice/atomic-openshift-node.service └─3321 /usr/bin/docker run --name atomic-openshift-node --rm --privileged --net=host --pid=host -v /:/rootfs:ro -e CONFIG_FILE=/etc/origin/node/node-config.yaml -e OPTIONS=--loglevel=2 -e HOST=/roo... Apr 28 11:22:37 atomic1master1.example.com docker[3321]: I0428 11:22:37.627934 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:22:44 atomic1master1.example.com docker[3321]: I0428 11:22:44.918406 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:22:45 atomic1master1.example.com docker[3321]: I0428 11:22:45.502483 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:27:44 atomic1master1.example.com docker[3321]: I0428 11:27:44.925635 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:27:44 atomic1master1.example.com docker[3321]: W0428 11:27:44.929145 3374 reflector.go:224] pkg/kubelet/kubelet.go:224: watch of *api.Service ended with: 401: The event in requested ...7338]) [8474] Apr 28 11:27:45 atomic1master1.example.com docker[3321]: I0428 11:27:45.580197 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:32:11 atomic1master1.example.com docker[3321]: I0428 11:32:11.663468 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:32:38 atomic1master1.example.com docker[3321]: I0428 11:32:38.628183 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:32:45 atomic1master1.example.com docker[3321]: I0428 11:32:45.587814 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Apr 28 11:32:45 atomic1master1.example.com docker[3321]: I0428 11:32:45.942231 3374 iowatcher.go:102] Unexpected EOF during watch stream event decoding: unexpected EOF Warning: atomic-openshift-node.service changed on disk. Run 'systemctl daemon-reload' to reload units. Hint: Some lines were ellipsized, use -l to show in full. Expected results: The upgrade abort before any modification, or it finish all tasks on those working masters&nodes.
If the upgrade aborting "Upgrade packages not found'. The systemctl status is as following. I just provide the info, not means that is wrong. 1. The node service configure was updated (The images version 3.1.1.6 was added ), and the service wasn't started 2. No change for those master services
I believe this is now fixed if I'm following the bug correctly, for some time we try to make sure that the node/master services are running before we proceed with upgrade. However pre.yml DID include a pass through some systemd setup tasks that did modify the IMAGE_VERSION, which should not be happening in pre.yml. This was removed in the refactoring for 3.2/3.3. Sending this over to you Anping.
Verified and pass