Description of problem: When doing upgrade from 3.10 to 3.11 atomic host,m the upgrade failed on task: TASK [openshift_node : Install or Update node system container] ***************************************************************************************************************************************************************************** fatal: [master1]: FAILED! => {"changed": false, "msg": "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\n", "rc": 1} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.retry The service docker was stopped and I restarted manually. then I relaunched the playbook like this # ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.retry but the service atomic-openshift-node on the first master is lost. Was rebooted. After scale up from 3.10 (as the upgrade failed on first master) the playbook is failing. The next try of the upgrade fails again. Logs will be attached. Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
*** This bug has been marked as a duplicate of bug 1641245 ***
Oops not a dupe, just needs a 3.10 backport.
Actually, bug description says upgrade to 3.11 which would mean the version should be 3.11 not 3.10? Please confirm what your customer is doing.
My hypothesis here is that the atomic command cleans up the old systemd unit file for the node service, then attempts to create a new one and fails due to docker being down. Latest 3.11 would prevent this problem from occurring in the first place due to task re-ordering, but we don't have anything in place to fix it for clusters that are already broken. Curiously, I'm not sure why the atomic command needs to talk to the docker daemon in the first place. Perhaps a regression there? Investigating work-around to restore node service unit.
https://github.com/openshift/openshift-ansible/blob/5155ab9bb8bc0bec530cf52b0fbd00c6f1684be2/roles/lib_openshift/library/oc_atomic_container.py#L103-L105 The logic of the module is to remove any existing systemd service unit if no existing containers are found. This will be the case if docker is stopped, it will think it's a fresh install and remove any existing systemd units. 3.11 task ordering was fixed to prevent this condition: https://github.com/openshift/openshift-ansible/pull/10555 However, anyone that already hit this will be affected and the service unit will have been removed. Unfortunately, all the config changes are in place for 3.11, so copying existing systemd unit file from another host may be problematic. I'll get a patch out with a play that someone in this scenario can run adhoc in order to rectify this condition.
PR Created in master: https://github.com/openshift/openshift-ansible/pull/10579 To utilize this fix, first ensure container runtime (either docker or crio) is running on the first master. Next, run 'playbooks/common/openshift-cluster/upgrades/v3_11/fix_first_master_node.yml' the same way as an upgrade playbook. This should only affect first master. If that playbook completes successfully, re-run upgrade playbook as normal.
Hi, the main issue is in "upgrade from 3.10 to 3.11". However, the node service was lost during upgrade from 3.10 to 3.11. And scale up still on old version still doesn't work. The main issue is the upgrade from 3.10 to 3.11. Thx
The 3.10-3.11 upgrade issue has been fixed and verified in bz1641245. This bug is used to repair the broken cluster. Version: openshift-ansible-3.11.51-2.git.0.51c90a3.el7.noarch Steps: 1. System container install ocp v3.10.83 on atomic hosts. # oc version oc v3.10.83 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-jliu-10-master-etcd-1:8443 openshift v3.10.83 kubernetes v1.10.0+b81c8f8 2. Upgrade above ocp without "system_images_registry" set in hosts file to get a broken cluster, with v3.11.16 installer, which still had the issue in bz1641245. 3. Upgrade failed at task [openshift_node : Install or Update node system container] as expected. Checked that atomic-openshift-node.service service(on master node) can not be restarted due to systemd service unit was not available. # systemctl restart atomic-openshift-node.service Failed to restart atomic-openshift-node.service: Unit not found. # ls -la /etc/systemd/system/ |grep atomic 4. Run restore playbook to get node service back succeed. ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/v3_11/fix_first_master_node.yml -v # ls -la /etc/systemd/system|grep atomic -rw-r--r--. 1 root root 589 Dec 6 03:19 atomic-openshift-node.service # systemctl status atomic-openshift-node.service ● atomic-openshift-node.service Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-12-06 03:22:47 UTC; 12s ago # oc get node NAME STATUS ROLES AGE VERSION qe-jliu-10-master-etcd-1 Ready master 1h v1.11.0+d4cacc0 qe-jliu-10-node-1 Ready compute 1h v1.10.0+b81c8f8 qe-jliu-10-node-registry-router-1 Ready <none> 1h v1.10.0+b81c8f8 5. Re-run upgrade playbook to continue ocp upgrade with latest installer v3.11.51 # oc version oc v3.11.51 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-jliu-10-master-etcd-1:8443 openshift v3.11.51 kubernetes v1.11.0+d4cacc0 Verify the bug, and change target to v3.11 since the pr merged in v3.11.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024