Created attachment 1883999 [details] ansible-playbook output Version: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.12 True False 7d8h Cluster version is 4.10.12 # rpm -q openshift-ansible ansible openshift-ansible-4.10.0-202203112237.p0.g7ca0181.assembly.stream.el7.noarch ansible-2.9.27-1.el8ae.noarch Platform: baremetal UPI What happened? Existing RHEL 8.5 worker node fails to reboot during the "Reboot the host and wait for it to come back" task in the upgrade playbook Full playbook run output and node journal output included in attachments, but here's the interesting bits: --- TASK [openshift_node : Reboot the host and wait for it to come back] ******************************************************************************************* Thursday 26 May 2022 21:16:37 +0000 (0:00:00.489) 0:01:23.325 ********** fatal: [worker2.cluster1.openshift.lan]: FAILED! => {"changed": false, "elapsed": 607, "msg": "Timed out waiting for last boot time check (timeout=600)", "rebooted": true} --- From the node journal: --- May 26 17:16:40 worker2.cluster1.openshift.lan ansible-find[2906993]: Invoked with paths=['/sbin', '/usr/sbin', '/usr/local/sbin'] patterns=['shutdown'] file_type=any age_stamp=mtime recurse=False hidden=False follow=False get_checksum=False use_regex=False excludes=None contains=None age=None size=None depth=None May 26 17:16:40 worker2.cluster1.openshift.lan systemd-logind[989]: Creating /run/nologin, blocking further logins... May 26 17:16:40 worker2.cluster1.openshift.lan systemd-logind[989]: Failed to get load state of reboot.target: Access denied May 26 17:16:40 worker2.cluster1.openshift.lan systemd-logind[989]: Scheduled shutdown to reboot.target failed: Permission denied --- What did you expect to happen? The RHEL worker node is updated and rebooted successfully How to reproduce it (as minimally and precisely as possible)? # cat /root/inventory/hosts [all:vars] ansible_user=root #ansible_become=True openshift_kubeconfig_path="~/kubeconfig" [workers] worker2.cluster1.openshift.lan # cd /usr/share/ansible/openshift-ansible # ansible-playbook -v -i /root/inventory/hosts playbooks/upgrade.yml Anything else we need to know? I am running ansible from a RHEL 8 container, though there's nothing obvious to indicate that it being containerized is part of the problem. The container is built from: # cat ../openshift-ansible-container/Containerfile FROM registry.redhat.io/openshift4/ose-tools-rhel8:v4.10 COPY openshift-ansible-4.10.0-202203112237.p0.g7ca0181.assembly.stream.el7.noarch /root/. RUN yum -y --enablerepo=ansible-2.9-for-rhel-8-x86_64-rpms --enablerepo=rhocp-4.10-for-rhel-8-x86_64-rpms install /root/openshift-ansible-* openshift-clients jq && \ yum clean all && \ rm -rf /var/cache/dnf Node is left cordoned due to the failed upgrade: # oc get nodes -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master0 Ready master 75d v1.23.5+70fb84c 192.168.130.10 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 master1 Ready master 75d v1.23.5+70fb84c 192.168.130.11 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 master2 Ready master 75d v1.23.5+70fb84c 192.168.130.12 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 worker0 Ready worker,worker-custom-storage 75d v1.23.5+70fb84c 192.168.130.20 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 worker1 Ready worker,worker-custom-storage 75d v1.23.5+70fb84c 192.168.130.21 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 worker2.cluster1.openshift.lan Ready,SchedulingDisabled worker 47d v1.23.5+b463d71 192.168.130.22 <none> Red Hat Enterprise Linux 8.5 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.23.2-10.rhaos4.10.gitf5a1c11.el8
Created attachment 1884000 [details] node journal output during the upgrade process
There also seems to be a failure with the Network Manager dispatcher service: May 26 17:16:36 worker2.cluster1.openshift.lan systemd[1]: Started Network Manager Script Dispatcher Service. May 26 17:16:36 worker2.cluster1.openshift.lan NetworkManager[1131]: <info> [1653599796.9910] device (cni-podman0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external') May 26 17:16:36 worker2.cluster1.openshift.lan NetworkManager[1131]: <info> [1653599796.9915] device (cni-podman0): Activation: successful, device activated. May 26 17:16:36 worker2.cluster1.openshift.lan nm-dispatcher[2906656]: ** May 26 17:16:36 worker2.cluster1.openshift.lan nm-dispatcher[2906656]: GLib:ERROR:gspawn.c:1234:do_exec: assertion failed: (read_null != -1) May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: Created slice system-systemd\x2dcoredump.slice. May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: Started Process Core Dump (PID 2906657/UID 0). May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: Started libpod-conmon-0715cba49dc6291032362ee7181fb8294342adc4663ec1dce8407d22bfd6d823.scope. May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: Started libcontainer container 0715cba49dc6291032362ee7181fb8294342adc4663ec1dce8407d22bfd6d823. May 26 17:16:37 worker2.cluster1.openshift.lan systemd-coredump[2906664]: Resource limits disable core dumping for process 2906656 (nm-dispatcher). May 26 17:16:37 worker2.cluster1.openshift.lan systemd-coredump[2906664]: Process 2906656 (nm-dispatcher) of user 0 dumped core. May 26 17:16:37 worker2.cluster1.openshift.lan nm-dispatcher[2906650]: req:2 'up' [cni-podman0], "/usr/lib/NetworkManager/dispatcher.d/04-iscsi": complete: failed with Script '/usr/lib/NetworkManager/dispatcher.d/04-iscsi' killed by signal 6. May 26 17:16:37 worker2.cluster1.openshift.lan nm-dispatcher[2906728]: ** May 26 17:16:37 worker2.cluster1.openshift.lan nm-dispatcher[2906728]: GLib:ERROR:gspawn.c:1234:do_exec: assertion failed: (read_null != -1) May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: systemd-coredump: Succeeded. May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: systemd-coredump: Consumed 226ms CPU time May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: Started Process Core Dump (PID 2906729/UID 0). May 26 17:16:37 worker2.cluster1.openshift.lan hyperkube[1416]: I0526 17:16:37.420377 1416 dynamic_cafile_content.go:210] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent inotify watch for: /etc/kubernetes/kubelet-ca.crt" May 26 17:16:37 worker2.cluster1.openshift.lan systemd[1]: Reloading. May 26 17:16:37 worker2.cluster1.openshift.lan systemd-coredump[2906737]: Resource limits disable core dumping for process 2906728 (nm-dispatcher). May 26 17:16:37 worker2.cluster1.openshift.lan systemd-coredump[2906737]: Process 2906728 (nm-dispatcher) of user 0 dumped core. May 26 17:16:37 worker2.cluster1.openshift.lan nm-dispatcher[2906650]: req:2 'up' [cni-podman0], "/etc/NetworkManager/dispatcher.d/20-chrony-dhcp": complete: failed with Script '/etc/NetworkManager/dispatcher.d/20-chrony-dhcp' killed by signal 6. May 26 17:16:37 worker2.cluster1.openshift.lan nm-dispatcher[2906767]: ** May 26 17:16:37 worker2.cluster1.openshift.lan nm-dispatcher[2906767]: GLib:ERROR:gspawn.c:1234:do_exec: assertion failed: (read_null != -1) Can you do further troubleshooting of the node in question? This doesn't appear to be a bug with the playbooks.
The same issue was caught in QE upgrade ci. After upgrading the cluster(12_Disconnected UPI on Baremetal with RHEL8.5 & OVN IPSEC) from v4.10 to v4.11, the rhel 8.5 upgrade failed. ... TASK [openshift_node : Reboot the host and wait for it to come back] *********** Saturday 11 June 2022 20:19:59 +0800 (0:00:00.435) 0:04:17.108 ********* fatal: [ip-10-0-51-94.us-east-2.compute.internal]: FAILED! => {"changed": false, "elapsed": 609, "msg": "Timed out waiting for last boot time check (timeout=600)", "rebooted": true} PLAY RECAP ********************************************************************* ip-10-0-48-37.us-east-2.compute.internal : ok=3 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0 ip-10-0-51-94.us-east-2.compute.internal : ok=46 changed=20 unreachable=0 failed=1 skipped=6 rescued=0 ignored=0 ip-10-0-55-14.us-east-2.compute.internal : ok=3 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0 localhost : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0 06-11 20:30:12.201 NAME STATUS ROLES AGE VERSION 06-11 20:30:12.201 ip-10-0-48-37.us-east-2.compute.internal Ready worker 3h20m v1.23.5+3afdacb 06-11 20:30:12.201 ip-10-0-51-94.us-east-2.compute.internal Ready,SchedulingDisabled worker 3h20m v1.24.0+cb71478 06-11 20:30:12.201 ip-10-0-55-14.us-east-2.compute.internal Ready worker 3h20m v1.23.5+3afdacb 06-11 20:30:12.201 ip-10-0-62-169 Ready master 4h18m v1.24.0+bb9c2f1 06-11 20:30:12.201 ip-10-0-62-75 Ready master 4h18m v1.24.0+bb9c2f1 06-11 20:30:12.201 ip-10-0-68-241 Ready master 4h18m v1.24.0+bb9c2f1
(In reply to Patrick Dillon from comment #2) > Can you do further troubleshooting of the node in question? This doesn't > appear to be a bug with the playbooks. Sure, I still have the node available for testing. # oc get nodes NAME STATUS ROLES AGE VERSION master0 Ready master 117d v1.23.5+70fb84c master1 Ready master 117d v1.23.5+70fb84c master2 Ready master 117d v1.23.5+70fb84c worker0 Ready worker,worker-custom-storage 117d v1.23.5+70fb84c worker1 Ready worker,worker-custom-storage 117d v1.23.5+70fb84c worker2.cluster1.openshift.lan Ready,SchedulingDisabled worker 88d v1.23.5+b463d71 # ssh worker2.cluster1.openshift.lan shutdown --reboot Shutdown scheduled for Thu 2022-07-07 11:26:02 EDT, use 'shutdown -c' to cancel. # ssh worker2.cluster1.openshift.lan System is going down. Unprivileged users are not permitted to log in anymore. For technical details, see pam_nologin(8). Last login: Thu Jul 7 11:23:53 2022 from 192.168.130.1 # journalctl --since -3m | tail Jul 07 11:25:13 worker2.cluster1.openshift.lan systemd[1087701]: Listening on D-Bus User Message Bus Socket. Jul 07 11:25:13 worker2.cluster1.openshift.lan systemd[1087701]: Reached target Sockets. Jul 07 11:25:13 worker2.cluster1.openshift.lan systemd[1087701]: Reached target Basic System. Jul 07 11:25:13 worker2.cluster1.openshift.lan systemd[1087701]: Reached target Default. Jul 07 11:25:13 worker2.cluster1.openshift.lan systemd[1087701]: Startup finished in 132ms. Jul 07 11:25:13 worker2.cluster1.openshift.lan systemd[1]: Started User Manager for UID 0. Jul 07 11:25:13 worker2.cluster1.openshift.lan systemd[1]: Started Session 14 of user root. Jul 07 11:25:13 worker2.cluster1.openshift.lan sshd[1087695]: pam_unix(sshd:session): session opened for user root by (uid=0) Jul 07 11:26:02 worker2.cluster1.openshift.lan systemd-logind[989]: Failed to get load state of reboot.target: Access denied Jul 07 11:26:02 worker2.cluster1.openshift.lan systemd-logind[989]: Scheduled shutdown to reboot.target failed: Permission denied # systemd-inhibit --list Who: NetworkManager (UID 0/root, PID 1131/NetworkManager) What: sleep Why: NetworkManager needs to turn off networks Mode: delay 1 inhibitors listed. I'm not familiar enough with systemd-inhibit to know how/why this is being set or whether it's actually what is delaying/blocking the shutdown/reboot. I could not find any reported bugs related to nm-dispatcher, CNI, etc. that match this behavior, so that leads me to believe that OCP plays into this somehow (as opposed to just being a generic RHEL bug). Should we move this over to the NM team at this point?
Right after a reboot [root@worker2 ~]# uptime 17:38:36 up 0 min, 1 user, load average: 0.57, 0.14, 0.05 [root@worker2 ~]# systemd-inhibit --list Who: NetworkManager (UID 0/root, PID 1128/NetworkManager) What: sleep Why: NetworkManager needs to turn off networks Mode: delay 1 inhibitors listed. [root@worker2 ~]# journalctl -u NetworkManager --boot --no-pager -- Logs begin at Fri 2022-04-08 16:06:18 EDT, end at Mon 2022-08-01 17:39:15 EDT. -- Aug 01 17:38:24 worker2.cluster1.openshift.lan systemd[1]: Starting Network Manager... Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.5221] NetworkManager (version 1.36.0-4.el8) is starting... (for the first time) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.5227] Read config: /etc/NetworkManager/NetworkManager.conf (etc: 20-keyfiles.conf, 99-keyfile.conf, sdn.conf) Aug 01 17:38:24 worker2.cluster1.openshift.lan systemd[1]: Started Network Manager. Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.5265] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager" Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.5405] manager[0x56193187c040]: monitoring kernel firmware directory '/lib/firmware'. Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6781] hostname: hostname: using hostnamed Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6783] hostname: static hostname changed from (none) to "worker2.cluster1.openshift.lan" Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6788] dns-mgr[0x561931857850]: init: dns=default,systemd-resolved rc-manager=symlink Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6893] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.36.0-4.el8/libnm-device-plugin-team.so) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6916] Loaded device plugin: NMOvsFactory (/usr/lib64/NetworkManager/1.36.0-4.el8/libnm-device-plugin-ovs.so) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6918] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6919] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6920] manager: Networking is enabled by state file Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6929] settings: Loaded settings plugin: keyfile (internal) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6948] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.36.0-4.el8/libnm-settings-plugin-ifcfg-rh.so") Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <warn> [1659389904.6958] ifcfg-rh: load[/etc/sysconfig/network-scripts/ifcfg-enp1s0]: failure to read file: Could not read file '/etc/sysconfig/network-scripts/ifcfg-enp1s0': Permission denied Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6975] dhcp-init: Using DHCP client 'internal' Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6975] device (lo): carrier: link connected Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6979] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.6991] manager: (br0): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/2) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7003] manager: (enp1s0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/3) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7037] settings: (enp1s0): created default wired connection 'Wired connection 1' Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7039] device (enp1s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7119] device (enp1s0): carrier: link connected Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7135] manager: (tun0): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/4) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7145] manager: (vxlan_sys_4789): new Vxlan device (/org/freedesktop/NetworkManager/Devices/5) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7153] device (vxlan_sys_4789): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7241] manager: (veth42f630a0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/6) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7278] manager: (veth02437451): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/7) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7287] manager: (br0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/8) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7293] manager: (vethb4b6c3e5): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/9) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7298] manager: (vethe0d67c26): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/10) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7305] manager: (vxlan0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/11) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7327] manager: (tun0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/12) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7336] manager: (br0): new Open vSwitch Bridge device (/org/freedesktop/NetworkManager/Devices/13) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7363] device (enp1s0): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7371] device (vxlan_sys_4789): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7383] policy: auto-activating connection 'Wired connection 1' (5ff96dd5-e1fc-3287-974e-9364139a4494) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7388] device (enp1s0): Activation: starting connection 'Wired connection 1' (5ff96dd5-e1fc-3287-974e-9364139a4494) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7389] device (enp1s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7391] manager: NetworkManager state is now CONNECTING Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7393] device (enp1s0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7423] device (enp1s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7444] dhcp4 (enp1s0): activation: beginning transaction (timeout in 45 seconds) Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7516] dhcp4 (enp1s0): state changed new lease, address=192.168.130.22 Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7579] device (enp1s0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7597] device (enp1s0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7599] device (enp1s0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed') Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7602] manager: NetworkManager state is now CONNECTED_LOCAL Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7606] manager: NetworkManager state is now CONNECTED_SITE Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7607] policy: set 'Wired connection 1' (enp1s0) as default for IPv4 routing and DNS Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7663] device (enp1s0): Activation: successful, device activated. Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7668] manager: NetworkManager state is now CONNECTED_GLOBAL Aug 01 17:38:24 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389904.7673] manager: startup complete Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <warn> [1659389921.4468] platform-linux: do-delete-link[5]: failure 95 (Operation not supported) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.4702] device (vxlan_sys_4789): state change: disconnected -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed') Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.5074] manager: (br0): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/14) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.5102] manager: (br0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/15) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.5129] manager: (br0): new Open vSwitch Bridge device (/org/freedesktop/NetworkManager/Devices/16) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.5719] manager: (vxlan0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/17) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.5747] manager: (vxlan_sys_4789): new Vxlan device (/org/freedesktop/NetworkManager/Devices/18) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.5879] device (vxlan_sys_4789): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external') Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.5883] device (vxlan_sys_4789): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external') Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.6079] manager: (tun0): new Open vSwitch Interface device (/org/freedesktop/NetworkManager/Devices/19) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.6091] manager: (tun0): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/20) Aug 01 17:38:41 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389921.6993] device (tun0): carrier: link connected Aug 01 17:38:42 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389922.9377] manager: (vethe58e24f1): new Veth device (/org/freedesktop/NetworkManager/Devices/21) Aug 01 17:38:42 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389922.9392] device (vethe58e24f1): carrier: link connected Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.0073] manager: (vethe58e24f1): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/22) Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.5102] manager: (vethdf25e687): new Veth device (/org/freedesktop/NetworkManager/Devices/23) Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.5108] device (vethdf25e687): carrier: link connected Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.6207] manager: (vethdf25e687): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/24) Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.8745] manager: (veth5249e8d3): new Veth device (/org/freedesktop/NetworkManager/Devices/25) Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.8802] device (veth5249e8d3): carrier: link connected Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.9622] manager: (veth5249e8d3): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/26) Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.9826] manager: (veth6a660092): new Veth device (/org/freedesktop/NetworkManager/Devices/27) Aug 01 17:38:43 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389923.9941] device (veth6a660092): carrier: link connected Aug 01 17:38:44 worker2.cluster1.openshift.lan NetworkManager[1128]: <info> [1659389924.2483] manager: (veth6a660092): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/28) [root@worker2 ~]# ssh worker2.cluster1.openshift.lan shutdown --reboot The authenticity of host 'worker2.cluster1.openshift.lan (192.168.130.22)' can't be established. ECDSA key fingerprint is SHA256:kz+0x9COxAs/lTtCQbL1UvRWjEKyGCp+G7H5qXpK3Sg. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'worker2.cluster1.openshift.lan,192.168.130.22' (ECDSA) to the list of known hosts. root.openshift.lan's password: Broadcast message from root.openshift.lan (Mon 2022-08-01 17:44:34 EDT): The system is going down for reboot at Mon 2022-08-01 17:45:34 EDT! Shutdown scheduled for Mon 2022-08-01 17:45:34 EDT, use 'shutdown -c' to cancel. [root@worker2 ~]# journalctl --since -2m -- Logs begin at Fri 2022-04-08 16:06:18 EDT, end at Mon 2022-08-01 17:45:51 EDT. -- Aug 01 17:44:26 worker2.cluster1.openshift.lan systemd[1]: Starting SSSD Kerberos Cache Manager... Aug 01 17:44:26 worker2.cluster1.openshift.lan systemd[1]: Started SSSD Kerberos Cache Manager. Aug 01 17:44:26 worker2.cluster1.openshift.lan sssd_kcm[7835]: Starting up Aug 01 17:44:34 worker2.cluster1.openshift.lan sshd[7800]: Accepted password for root from 192.168.130.22 port 49810 ssh2 Aug 01 17:44:34 worker2.cluster1.openshift.lan systemd[1]: Started Session 3 of user root. Aug 01 17:44:34 worker2.cluster1.openshift.lan systemd-logind[979]: New session 3 of user root. Aug 01 17:44:34 worker2.cluster1.openshift.lan sshd[7800]: pam_unix(sshd:session): session opened for user root by (uid=0) Aug 01 17:44:34 worker2.cluster1.openshift.lan systemd-logind[979]: Creating /run/nologin, blocking further logins... Aug 01 17:44:34 worker2.cluster1.openshift.lan sshd[7930]: Received disconnect from 192.168.130.22 port 49810:11: disconnected by user Aug 01 17:44:34 worker2.cluster1.openshift.lan sshd[7930]: Disconnected from user root 192.168.130.22 port 49810 Aug 01 17:44:34 worker2.cluster1.openshift.lan sshd[7800]: pam_unix(sshd:session): session closed for user root Aug 01 17:44:34 worker2.cluster1.openshift.lan systemd-logind[979]: Session 3 logged out. Waiting for processes to exit. Aug 01 17:44:34 worker2.cluster1.openshift.lan systemd[1]: session-3.scope: Succeeded. Aug 01 17:44:34 worker2.cluster1.openshift.lan systemd[1]: session-3.scope: Consumed 41ms CPU time Aug 01 17:44:34 worker2.cluster1.openshift.lan systemd-logind[979]: Removed session 3. Aug 01 17:45:34 worker2.cluster1.openshift.lan systemd-logind[979]: Failed to get load state of reboot.target: Access denied Aug 01 17:45:34 worker2.cluster1.openshift.lan systemd-logind[979]: Scheduled shutdown to reboot.target failed: Permission denied So, the inhibitors is there at boot, there's no obvious problem with the NM service, and yet systemd continues to block the reboot. I wonder if this is either expected behavior when trying to do a remote reboot or just a systemd bug.
(In reply to Derrick Ornelas from comment #4) > # systemd-inhibit --list > Who: NetworkManager (UID 0/root, PID 1131/NetworkManager) > What: sleep > Why: NetworkManager needs to turn off networks > Mode: delay > > 1 inhibitors listed. > > I'm not familiar enough with systemd-inhibit to know how/why this is being > set or whether it's actually what is delaying/blocking the shutdown/reboot. This inhibitor would only delay the reboot by some time (the default is 5s, IIRC). Also, it's set by NM, not by systemd. But that's irrelevant, because it's not the problem here. The error happens before inhibitors are checked. > Jul 07 11:26:02 worker2.cluster1.openshift.lan systemd-logind[989]: Failed > to get load state of reboot.target: Access denied > Jul 07 11:26:02 worker2.cluster1.openshift.lan systemd-logind[989]: > Scheduled shutdown to reboot.target failed: Permission denied This happens when systemd-logind tries to retrieve property org.freedesktop.systemd1.Unit#LoadState of reboot.target. I very much doubt it's a bug in systemd; most likely the dbus call is blocked by selinux. I cannot reproduce the issue locally, though. Could you try it with selinux set to "permissive"?
[root@worker2 ~]# getenforce Enforcing [root@worker2 ~]# ssh worker2.cluster1.openshift.lan shutdown --reboot root.openshift.lan's password: Broadcast message from root.openshift.lan (Mon 2022-08-08 11:40:44 EDT): The system is going down for reboot at Mon 2022-08-08 11:41:44 EDT! Shutdown scheduled for Mon 2022-08-08 11:41:44 EDT, use 'shutdown -c' to cancel. [root@worker2 ~]# journalctl --since -2m -- Logs begin at Fri 2022-04-08 16:06:18 EDT, end at Mon 2022-08-08 11:41:46 EDT. -- Aug 08 11:40:37 worker2.cluster1.openshift.lan systemd[1]: Starting SSSD Kerberos Cache Manager... Aug 08 11:40:37 worker2.cluster1.openshift.lan systemd[1]: Started SSSD Kerberos Cache Manager. Aug 08 11:40:37 worker2.cluster1.openshift.lan sssd_kcm[2290271]: Starting up Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290267]: Accepted password for root from 192.168.130.22 port 50056 ssh2 Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: New session 4 of user root. Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd[1]: Started Session 4 of user root. Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290267]: pam_unix(sshd:session): session opened for user root by (uid=0) Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: Creating /run/nologin, blocking further logins... Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290352]: Received disconnect from 192.168.130.22 port 50056:11: disconnected by user Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290352]: Disconnected from user root 192.168.130.22 port 50056 Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290267]: pam_unix(sshd:session): session closed for user root Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd[1]: session-4.scope: Succeeded. Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: Session 4 logged out. Waiting for processes to exit. Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd[1]: session-4.scope: Consumed 35ms CPU time Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: Removed session 4. Aug 08 11:41:13 worker2.cluster1.openshift.lan ovs-vswitchd[1073]: ovs|04638|connmgr|INFO|br0<->unix#18658: 2 flow_mods in the last 0 s (2 deletes) Aug 08 11:41:29 worker2.cluster1.openshift.lan ovs-vswitchd[1073]: ovs|04639|connmgr|INFO|br0<->unix#18661: 2 flow_mods in the last 0 s (2 deletes) Aug 08 11:41:45 worker2.cluster1.openshift.lan systemd-logind[979]: Failed to get load state of reboot.target: Access denied Aug 08 11:41:45 worker2.cluster1.openshift.lan systemd-logind[979]: Scheduled shutdown to reboot.target failed: Permission denied [root@worker2 ~]# setenforce 0 [root@worker2 ~]# ssh worker2.cluster1.openshift.lan shutdown --reboot root.openshift.lan's password: Broadcast message from root.openshift.lan (Mon 2022-08-08 11:42:41 EDT): The system is going down for reboot at Mon 2022-08-08 11:43:41 EDT! Shutdown scheduled for Mon 2022-08-08 11:43:41 EDT, use 'shutdown -c' to cancel. [root@worker2 ~]# journalctl --since -3m -- Logs begin at Fri 2022-04-08 16:06:18 EDT, end at Mon 2022-08-08 11:43:31 EDT. -- Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290267]: Accepted password for root from 192.168.130.22 port 50056 ssh2 Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: New session 4 of user root. Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd[1]: Started Session 4 of user root. Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290267]: pam_unix(sshd:session): session opened for user root by (uid=0) Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: Creating /run/nologin, blocking further logins... Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290352]: Received disconnect from 192.168.130.22 port 50056:11: disconnected by user Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290352]: Disconnected from user root 192.168.130.22 port 50056 Aug 08 11:40:44 worker2.cluster1.openshift.lan sshd[2290267]: pam_unix(sshd:session): session closed for user root Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd[1]: session-4.scope: Succeeded. Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: Session 4 logged out. Waiting for processes to exit. Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd[1]: session-4.scope: Consumed 35ms CPU time Aug 08 11:40:44 worker2.cluster1.openshift.lan systemd-logind[979]: Removed session 4. Aug 08 11:41:13 worker2.cluster1.openshift.lan ovs-vswitchd[1073]: ovs|04638|connmgr|INFO|br0<->unix#18658: 2 flow_mods in the last 0 s (2 deletes) Aug 08 11:41:29 worker2.cluster1.openshift.lan ovs-vswitchd[1073]: ovs|04639|connmgr|INFO|br0<->unix#18661: 2 flow_mods in the last 0 s (2 deletes) Aug 08 11:41:45 worker2.cluster1.openshift.lan systemd-logind[979]: Failed to get load state of reboot.target: Access denied Aug 08 11:41:45 worker2.cluster1.openshift.lan systemd-logind[979]: Scheduled shutdown to reboot.target failed: Permission denied Aug 08 11:41:46 worker2.cluster1.openshift.lan systemd[1]: run-runc-1f465ebcfe249781a55f4541e1f1af4fb331717591c66f7fa145e6eb7af23fa4-runc.fiGPdE.mount: Succeed> Aug 08 11:42:00 worker2.cluster1.openshift.lan ovs-vswitchd[1073]: ovs|04640|connmgr|INFO|br0<->unix#18665: 2 flow_mods in the last 0 s (2 deletes) Aug 08 11:42:01 worker2.cluster1.openshift.lan ovs-vswitchd[1073]: ovs|04641|connmgr|INFO|br0<->unix#18668: 2 flow_mods in the last 0 s (2 deletes) Aug 08 11:42:41 worker2.cluster1.openshift.lan sshd[2291607]: Accepted password for root from 192.168.130.22 port 50426 ssh2 Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd-logind[979]: New session 5 of user root. Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd[1]: Started Session 5 of user root. Aug 08 11:42:41 worker2.cluster1.openshift.lan sshd[2291607]: pam_unix(sshd:session): session opened for user root by (uid=0) Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd-logind[979]: Creating /run/nologin, blocking further logins... Aug 08 11:42:41 worker2.cluster1.openshift.lan sshd[2291663]: Received disconnect from 192.168.130.22 port 50426:11: disconnected by user Aug 08 11:42:41 worker2.cluster1.openshift.lan sshd[2291663]: Disconnected from user root 192.168.130.22 port 50426 Aug 08 11:42:41 worker2.cluster1.openshift.lan sshd[2291607]: pam_unix(sshd:session): session closed for user root Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd[1]: session-5.scope: Succeeded. Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd-logind[979]: Session 5 logged out. Waiting for processes to exit. Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd[1]: session-5.scope: Consumed 38ms CPU time Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd-logind[979]: Removed session 5. Aug 08 11:42:41 worker2.cluster1.openshift.lan systemd[1]: run-runc-1f465ebcfe249781a55f4541e1f1af4fb331717591c66f7fa145e6eb7af23fa4-runc.MfcWXH.mount: Succeed> Aug 08 11:42:56 worker2.cluster1.openshift.lan systemd[1]: Starting dnf makecache... Aug 08 11:42:56 worker2.cluster1.openshift.lan dnf[2291848]: Updating Subscription Management repositories. Aug 08 11:43:01 worker2.cluster1.openshift.lan dnf[2291848]: Red Hat Enterprise Linux 8 for x86_64 - BaseOS 20 kB/s | 4.1 kB 00:00 Aug 08 11:43:03 worker2.cluster1.openshift.lan dnf[2291848]: Red Hat Enterprise Linux 8 for x86_64 - BaseOS 36 MB/s | 51 MB 00:01 Aug 08 11:43:06 worker2.cluster1.openshift.lan systemd[1]: run-runc-1f465ebcfe249781a55f4541e1f1af4fb331717591c66f7fa145e6eb7af23fa4-runc.tGaQa6.mount: Succeed> Aug 08 11:43:16 worker2.cluster1.openshift.lan dnf[2291848]: Red Hat OpenShift Container Platform 4.10 for R 25 kB/s | 4.0 kB 00:00 Aug 08 11:43:16 worker2.cluster1.openshift.lan dnf[2291848]: Fast Datapath for RHEL 8 x86_64 (RPMs) 28 kB/s | 4.0 kB 00:00 Aug 08 11:43:17 worker2.cluster1.openshift.lan dnf[2291848]: Red Hat Enterprise Linux 8 for x86_64 - AppStre 31 kB/s | 4.5 kB 00:00 Aug 08 11:43:20 worker2.cluster1.openshift.lan dnf[2291848]: Metadata cache created. Aug 08 11:43:20 worker2.cluster1.openshift.lan systemd[1]: dnf-makecache.service: Succeeded. Aug 08 11:43:20 worker2.cluster1.openshift.lan systemd[1]: Started dnf makecache. lines 1-44 Broadcast message from root.openshift.lan (Mon 2022-08-08 11:43:41 EDT): The system is going down for reboot NOW! Connection to worker2.cluster1.openshift.lan closed by remote host. Connection to worker2.cluster1.openshift.lan closed. Checking audit logs I see: type=USER_AVC msg=audit(1653599800.875:433): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: denied { status } for auid=n/a uid=0 gid=0 path="/usr/lib/systemd/system/reboot.target" cmdline="/usr/lib/systemd/systemd-logind" scontext=system_u:system_r:systemd_logind_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=service permissive=0 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' type=USER_AVC msg=audit(1659973421.876:424): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: denied { status } for auid=n/a uid=0 gid=0 path="/usr/lib/systemd/system/reboot.target" cmdline="/usr/lib/systemd/systemd-logind" scontext=system_u:system_r:systemd_logind_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=service permissive=1 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' [root@worker2 ~]# ls -lZ /usr/lib/systemd/systemd /usr/lib/systemd/systemd-logind /usr/lib/systemd/system/reboot.target -rwxr-xr-x. 1 root root system_u:object_r:init_exec_t:s0 1601312 Feb 8 06:33 /usr/lib/systemd/systemd -rwxr-xr-x. 1 root root system_u:object_r:systemd_logind_exec_t:s0 259272 Feb 8 06:33 /usr/lib/systemd/systemd-logind -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 583 Jun 22 2018 /usr/lib/systemd/system/reboot.target well, reboot.target has the wrong context...
Fixing the context should resolve the issue [root@worker2 ~]# getenforce Enforcing [root@worker2 ~]# ls -lZ /usr/lib/systemd/system/reboot.target -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 583 Jun 22 2018 /usr/lib/systemd/system/reboot.target [root@worker2 ~]# ssh worker2.cluster1.openshift.lan shutdown --reboot root.openshift.lan's password: Broadcast message from root.openshift.lan (Mon 2022-08-08 12:10:23 EDT): The system is going down for reboot at Mon 2022-08-08 12:11:23 EDT! Shutdown scheduled for Mon 2022-08-08 12:11:23 EDT, use 'shutdown -c' to cancel. [root@worker2 ~]# journalctl --boot -u systemd-logind -f -- Logs begin at Fri 2022-04-08 16:06:18 EDT. -- Aug 08 11:44:15 worker2.cluster1.openshift.lan systemd[1]: Starting Login Service... Aug 08 11:44:15 worker2.cluster1.openshift.lan systemd-logind[987]: New seat seat0. Aug 08 11:44:15 worker2.cluster1.openshift.lan systemd-logind[987]: Watching system buttons on /dev/input/event0 (Power Button) Aug 08 11:44:15 worker2.cluster1.openshift.lan systemd-logind[987]: Watching system buttons on /dev/input/event1 (AT Translated Set 2 keyboard) Aug 08 11:44:15 worker2.cluster1.openshift.lan systemd[1]: Started Login Service. Aug 08 11:45:36 worker2.cluster1.openshift.lan systemd-logind[987]: New session 1 of user root. Aug 08 12:10:23 worker2.cluster1.openshift.lan systemd-logind[987]: New session 3 of user root. Aug 08 12:10:23 worker2.cluster1.openshift.lan systemd-logind[987]: Creating /run/nologin, blocking further logins... Aug 08 12:10:23 worker2.cluster1.openshift.lan systemd-logind[987]: Session 3 logged out. Waiting for processes to exit. Aug 08 12:10:23 worker2.cluster1.openshift.lan systemd-logind[987]: Removed session 3. Aug 08 12:11:23 worker2.cluster1.openshift.lan systemd-logind[987]: Failed to get load state of reboot.target: Access denied Aug 08 12:11:23 worker2.cluster1.openshift.lan systemd-logind[987]: Scheduled shutdown to reboot.target failed: Permission denied ^C type=USER_AVC msg=audit(1659975083.651:128): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: denied { status } for auid=n/a uid=0 gid=0 path="/usr/lib/systemd/system/reboot.target" cmdline="/usr/lib/systemd/systemd-logind" scontext=system_u:system_r:systemd_logind_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=service permissive=0 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?' [root@worker2 ~]# restorecon -v /usr/lib/systemd/system/reboot.target /usr/lib/systemd/system/reboot.target not reset as customized by admin to system_u:object_r:container_file_t:s0 [root@worker2 ~]# restorecon -vF /usr/lib/systemd/system/reboot.target Relabeled /usr/lib/systemd/system/reboot.target from system_u:object_r:container_file_t:s0 to system_u:object_r:power_unit_file_t:s0 [root@worker2 ~]# ssh worker2.cluster1.openshift.lan shutdown --reboot root.openshift.lan's password: Broadcast message from root.openshift.lan (Mon 2022-08-08 12:16:36 EDT): The system is going down for reboot at Mon 2022-08-08 12:17:36 EDT! Shutdown scheduled for Mon 2022-08-08 12:17:36 EDT, use 'shutdown -c' to cancel. [root@worker2 ~]# journalctl --since -1m -u systemd-logind -f -- Logs begin at Fri 2022-04-08 16:06:18 EDT. -- Aug 08 12:16:36 worker2.cluster1.openshift.lan systemd-logind[987]: New session 4 of user root. Aug 08 12:16:36 worker2.cluster1.openshift.lan systemd-logind[987]: Creating /run/nologin, blocking further logins... Aug 08 12:16:36 worker2.cluster1.openshift.lan systemd-logind[987]: Session 4 logged out. Waiting for processes to exit. Aug 08 12:16:36 worker2.cluster1.openshift.lan systemd-logind[987]: Removed session 4. Broadcast message from root.openshift.lan (Mon 2022-08-08 12:17:36 EDT): The system is going down for reboot NOW! Connection to worker2.cluster1.openshift.lan closed by remote host. Connection to worker2.cluster1.openshift.lan closed. So, systemd-logind was denied access to reboot.target, because the context for reboot.target was incorrect. The fact that it was manually set to container_file_t means very likely means that it was improperly mounted into a container. I'm sensing a much bigger problem.
After the system rebooted the file context for reboot.target was changed back to container_file_t. Looking at other files in /usr/lib/systemd/ I see the same issue: [root@worker2 ~]# ls -lZ /usr/lib/systemd/system/ | grep container_file_t -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 1512 Jan 8 2020 auditd.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 551 Jun 15 2021 chronyd.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 197 May 6 04:38 crio-subid.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 356 Jun 12 2019 crond.service -r--r--r--. 1 root root system_u:object_r:container_file_t:s0 248 Jan 4 2022 dm-event.socket -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 320 Apr 12 2021 dnf-makecache.timer -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 1975 Feb 8 06:32 getty@.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 441 Feb 15 2021 import-state.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 224 Jan 30 2021 irqbalance.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 175 Mar 11 2021 iscsid.socket -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 366 Aug 9 2021 iscsi-onboot.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 594 Aug 9 2021 iscsi.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 165 Mar 11 2021 iscsiuio.socket -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 388 Nov 2 2021 kdump.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 355 Feb 15 2021 loadmodules.service -r--r--r--. 1 root root system_u:object_r:container_file_t:s0 239 Jan 4 2022 lvm2-lvmpolld.socket -r--r--r--. 1 root root system_u:object_r:container_file_t:s0 575 Jan 4 2022 lvm2-monitor.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 310 Feb 23 23:13 mdmonitor.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 284 Feb 14 14:50 microcode.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 831 Feb 7 20:09 multipathd.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 304 Feb 7 20:09 multipathd.socket -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 532 Jun 22 2018 multi-user.target -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 641 Mar 21 13:16 NetworkManager-dispatcher.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 1351 Mar 21 13:16 NetworkManager.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 1148 Mar 21 13:16 NetworkManager-wait-online.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 413 Mar 12 14:59 nfs-client.target -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 378 Aug 12 2018 nis-domainname.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 401 Feb 15 2021 openvswitch.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 522 Feb 8 19:21 qemu-guest-agent.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 583 Jun 22 2018 reboot.target -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 522 Jun 22 2018 remote-fs.target -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 184 Feb 8 13:49 rhsmcertd.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 544 Nov 24 2020 rpcbind.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 368 Nov 24 2020 rpcbind.socket -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 584 Aug 6 2021 rsyslog.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 406 Feb 23 06:35 selinux-autorelabel-mark.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 456 Oct 26 2021 sshd.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 186 Apr 21 16:02 sssd-kcm.socket -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 641 Apr 21 16:02 sssd.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 238 Nov 7 2017 timedatex.service -rw-r--r--. 1 root root system_u:object_r:container_file_t:s0 346 Apr 30 2021 unbound-anchor.timer I assume either a podman container or a pod is volume mounting some/all of these files in and having them relabeled automatically on boot. I can't find a service or recent pod that is doing this, though. The relabeling of system files, especially under /usr, shouldn't be performed. I would bet this issue (incorrect mounting) exists on all clusters but doesn't manifest on RHCOS because /usr is read-only.
OpenShift Update service is not responsible for upgrading RHEL worker nodes and has no role in it. We are trying to figure-out which component we should move this bug.
Only because running the playbooks seems to be the triggering event which relabels files we'll put the bug there fore now. We need to help Derrick with this research. Derrick, just to confirm based on some of your more recent output this happens when the OS has been updated to 8.6? Does this by chance not happen if the host OS is 8.4? Just trying to figure out which parts of the system we can pin to components we know didn't produce this problem in the past.
This was a fresh RHEL 8.5 install. I did an initial OS update to the latest 8.5.z, but since then I've not done any OS updates manually. The node is still in the state it was when I reported this bug # yum history list Updating Subscription Management repositories. ID | Command line | Date and time | Action(s) | Altered ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 8 | install lsof | 2022-08-08 12:21 | Install | 1 7 | install tree | 2022-07-07 11:48 | Install | 1 6 | | 2022-05-26 17:15 | Upgrade | 1 5 | | 2022-05-19 08:55 | I, U | 101 4 | install runc cri-o podman skopeo | 2022-04-11 15:36 | Install | 15 3 | remove runc | 2022-04-11 13:54 | Removed | 15 EE 2 | | 2022-04-08 16:02 | Install | 101 1 | | 2022-04-08 14:18 | Install | 393 EE I believe 2022-04-11 is when I added the node to the existing cluster, and 2022-05-19 is when I first tried to run the upgrade playbook # rpm -q redhat-release runc cri-o podman skopeo redhat-release-8.5-0.8.el8.x86_64 runc-1.0.3-2.module+el8.6.0+14877+f643d2d6.x86_64 cri-o-1.23.2-10.rhaos4.10.gitf5a1c11.el8.x86_64 podman-4.0.2-6.module+el8.6.0+14877+f643d2d6.x86_64 skopeo-1.6.1-2.module+el8.6.0+14877+f643d2d6.x86_64 # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.12 True False 98d Cluster version is 4.10.12 # ./oc get nodes NAME STATUS ROLES AGE VERSION master0 Ready master 166d v1.23.5+70fb84c master1 Ready master 166d v1.23.5+70fb84c master2 Ready master 166d v1.23.5+70fb84c worker0 Ready worker,worker-custom-storage 166d v1.23.5+70fb84c worker1 Ready worker,worker-custom-storage 166d v1.23.5+70fb84c worker2.cluster1.openshift.lan Ready,SchedulingDisabled worker 137d v1.23.5+3afdacb
This looks like something relabeled /usr/lib/systemd/system/ to be inside of a container, which is not something OpenShift should be doing. Run `sudo restorecon -R -v -F /usr/lib/systemd`
(In reply to Daniel Walsh from comment #18) > This looks like something relabeled /usr/lib/systemd/system/ to be inside of > a container, which is not something OpenShift should be doing. > > Run `sudo restorecon -R -v -F /usr/lib/systemd` Yes, it's possible to restore the correct file context, but it will be changed back to container_file_t every time kubelet is started. Something about starting kubelet.service runs some container with podman, and this relabels the files. As strange as it sounds, I can't find where/how podman is being run. I think we need someone from the Node team to help determine why this is happening. Once we find the offending container we can work with the team that owns that command/container to stop calling for a relabel on their volume mount.
I think this really is https://github.com/openshift/machine-config-operator/blob/release-4.10/templates/common/on-prem/units/nodeip-configuration.service.yaml --- ExecStart=/bin/bash -c " \ until \ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ --volume /etc/systemd/system:/etc/systemd/system:z \ {{ .Images.baremetalRuntimeCfgImage }} \ node-ip \ set --retry-on-failure \ {{ onPremPlatformAPIServerInternalIP . }}; \ do \ sleep 5; \ done" --- That relabels /usr/lib/systemd/system/reboot.target via the /etc/systemd/system/ctrl-alt-del.target symlink # ls -lR /etc/systemd/system/ctrl-alt-del.target lrwxrwxrwx. 1 root root 37 Oct 20 22:15 /etc/systemd/system/ctrl-alt-del.target -> /usr/lib/systemd/system/reboot.target I knew that services like nodeip-configuration were doing bad relabels on /etc/systemd/system, but I was so focused on /usr/lib/systemd that I didn't put the two pieces together until recently. Colin fixed this with https://github.com/openshift/machine-config-operator/commit/10128e8101e40a42f8fd70e391237ca0ed1bb077, so now we have https://github.com/openshift/machine-config-operator/blob/release-4.11/templates/common/on-prem/units/nodeip-configuration.service.yaml --- ExecStart=/bin/bash -c " \ until \ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ --security-opt label=disable \ --volume /etc/systemd/system:/etc/systemd/system \ {{ .Images.baremetalRuntimeCfgImage }} \ node-ip \ set --retry-on-failure \ {{ onPremPlatformAPIServerInternalIP . }}; \ do \ sleep 5; \ done" --- I guess we can close this CURRENTRELEASE, unless there's other templates/services doing relabels of system files that we think still need to be fixed? I haven't tried this on 4.11+, and I don't know when I'll find the time to do that.
Lets mark it ON_QA with a target release of 4.11, once QE has verified it then we can look at backporting the change to 4.10.z if it hasn't been done already. Anyone who can reproduce this on 4.10 and confirm that it doesn't happen in 4.11 is welcome to move this bug to CLOSED CURRENTRELEASE and clone for 4.10.z.
Trying to reproduce the bug but stuck on this task in ansible task: TASK [openshift_node : Get available cri-o RPM versions] ************************************************************************************************************************* Monday 05 December 2022 03:29:48 -0500 (0:00:00.061) 0:01:22.372 ******* fatal: [worker-0-1]: FAILED! => {"changed": false, "msg": "The Python 2 bindings for rpm are needed for this module. If you require Python 3 support use the `dnf` Ansible module instead.. The Python 2 yum module is needed for this module. If you require Python 3 support use the `dnf` Ansible module instead."} [kni@provisionhost-0-0 openshift-ansible]$ rpm -q openshift-ansible ansible openshift-ansible-4.10.0-202207291637.p0.g1e68436.assembly.stream.el7.noarch ansible-2.9.27-1.el8ae.noarch (python39-venv) [# ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.12 True False 114m Cluster version is 4.10.12 @dornelas please advise
Rei, you're running the scaleup playbook, correct? I had to install a new 4.10.12 cluster to run through this again. I have have 3 masters and 2 workers, all running RHCOS: # oc get nodes -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master0 Ready master 23h v1.23.5+70fb84c 192.168.140.10 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 master1 Ready master 23h v1.23.5+70fb84c 192.168.140.11 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 master2 Ready master 23h v1.23.5+70fb84c 192.168.140.12 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 worker0 Ready worker 23h v1.23.5+70fb84c 192.168.140.20 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 worker1 Ready worker 23h v1.23.5+70fb84c 192.168.140.21 <none> Red Hat Enterprise Linux CoreOS 410.84.202204261500-0 (Ootpa) 4.18.0-305.45.1.el8_4.x86_64 cri-o://1.23.2-8.rhaos4.10.git8ad5d25.el8 When adding a RHEL 8.5 node as a 3rd worker I can run the scaleup playbook successfully: --- # ansible-playbook -i /root/inventory/hosts playbooks/scaleup.yml PLAY [Pre-scaleup hostfile checks] ***************************************************************************************************************************** TASK [openshift_node : Ensure [new_workers] group is populated] ************************************************************************************************ Tuesday 06 December 2022 21:59:49 +0000 (0:00:00.065) 0:00:00.065 ****** skipping: [localhost] TASK [openshift_node : Get cluster nodes] ********************************************************************************************************************** Tuesday 06 December 2022 21:59:49 +0000 (0:00:00.069) 0:00:00.135 ****** ok: [localhost] TASK [openshift_node : Check for nodes which are already part of the cluster] ********************************************************************************** Tuesday 06 December 2022 21:59:53 +0000 (0:00:04.117) 0:00:04.253 ****** skipping: [localhost] => (item=worker2.cluster2.openshift.lan) TASK [openshift_node : Fail if new_workers group contains active nodes] **************************************************************************************** Tuesday 06 December 2022 21:59:53 +0000 (0:00:00.056) 0:00:04.309 ****** skipping: [localhost] PLAY [Pre-scaleup checks] ************************************************************************************************************************************** TASK [Gathering Facts] ***************************************************************************************************************************************** Tuesday 06 December 2022 21:59:53 +0000 (0:00:00.048) 0:00:04.357 ****** ok: [worker2.cluster2.openshift.lan] TASK [openshift_node : Get cluster version] ******************************************************************************************************************** Tuesday 06 December 2022 21:59:55 +0000 (0:00:01.678) 0:00:06.036 ****** ok: [worker2.cluster2.openshift.lan -> localhost] TASK [openshift_node : Set fact l_cluster_version] ************************************************************************************************************* Tuesday 06 December 2022 21:59:55 +0000 (0:00:00.381) 0:00:06.418 ****** ok: [worker2.cluster2.openshift.lan] TASK [openshift_node : Fail if not using RHEL8 beginning with version 4.10] ************************************************************************************ Tuesday 06 December 2022 21:59:55 +0000 (0:00:00.057) 0:00:06.475 ****** skipping: [worker2.cluster2.openshift.lan] PLAY [install nodes] ******************************************************************************************************************************************* TASK [openshift_node : include_tasks] ************************************************************************************************************************** Tuesday 06 December 2022 21:59:55 +0000 (0:00:00.065) 0:00:06.540 ****** included: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/install.yml for worker2.cluster2.openshift.lan TASK [openshift_node : Retrieve rendered-worker name] ********************************************************************************************************** Tuesday 06 December 2022 21:59:55 +0000 (0:00:00.099) 0:00:06.640 ****** ok: [worker2.cluster2.openshift.lan -> localhost] TASK [openshift_node : Check cluster FIPS status] ************************************************************************************************************** Tuesday 06 December 2022 21:59:55 +0000 (0:00:00.364) 0:00:07.004 ****** ok: [worker2.cluster2.openshift.lan -> localhost] TASK [openshift_node : Fail if host FIPS status does not match cluster FIPS status] **************************************************************************** Tuesday 06 December 2022 21:59:56 +0000 (0:00:00.394) 0:00:07.399 ****** skipping: [worker2.cluster2.openshift.lan] TASK [openshift_node : Get cluster version] ******************************************************************************************************************** Tuesday 06 December 2022 21:59:56 +0000 (0:00:00.055) 0:00:07.454 ****** ok: [worker2.cluster2.openshift.lan -> localhost] TASK [openshift_node : Set fact l_cluster_version] ************************************************************************************************************* Tuesday 06 December 2022 21:59:56 +0000 (0:00:00.393) 0:00:07.848 ****** ok: [worker2.cluster2.openshift.lan] TASK [openshift_node : Get kubernetes server version] ********************************************************************************************************** Tuesday 06 December 2022 21:59:56 +0000 (0:00:00.062) 0:00:07.911 ****** ok: [worker2.cluster2.openshift.lan -> localhost] TASK [openshift_node : Set fact l_kubernetes_server_version] *************************************************************************************************** Tuesday 06 December 2022 21:59:57 +0000 (0:00:00.421) 0:00:08.332 ****** ok: [worker2.cluster2.openshift.lan] TASK [openshift_node : Get available cri-o RPM versions] ******************************************************************************************************* Tuesday 06 December 2022 21:59:57 +0000 (0:00:00.071) 0:00:08.404 ****** ok: [worker2.cluster2.openshift.lan] TASK [openshift_node : Set fact crio_latest] ******************************************************************************************************************* Tuesday 06 December 2022 22:00:06 +0000 (0:00:09.538) 0:00:17.942 ****** ok: [worker2.cluster2.openshift.lan] TASK [openshift_node : Fail if cri-o is less than current kubernetes server version] *************************************************************************** Tuesday 06 December 2022 22:00:06 +0000 (0:00:00.063) 0:00:18.006 ****** skipping: [worker2.cluster2.openshift.lan] TASK [openshift_node : Install openshift packages] ************************************************************************************************************* Tuesday 06 December 2022 22:00:07 +0000 (0:00:00.064) 0:00:18.070 ****** [...] TASK [openshift_node : Wait for node to report ready] ********************************************************************************************************** Tuesday 06 December 2022 22:09:40 +0000 (0:00:12.052) 0:09:51.513 ****** FAILED - RETRYING: Wait for node to report ready (30 retries left). FAILED - RETRYING: Wait for node to report ready (29 retries left). ok: [worker2.cluster2.openshift.lan -> localhost] PLAY RECAP ***************************************************************************************************************************************************** localhost : ok=1 changed=0 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0 worker2.cluster2.openshift.lan : ok=46 changed=28 unreachable=0 failed=0 skipped=9 rescued=0 ignored=0 Tuesday 06 December 2022 22:10:21 +0000 (0:00:41.202) 0:10:32.716 ****** =============================================================================== openshift_node : Install openshift packages ----------------------------------------------------------------------------------------------------------- 364.75s openshift_node : Reboot the host and wait for it to come back ----------------------------------------------------------------------------------------- 147.84s openshift_node : Wait for node to report ready --------------------------------------------------------------------------------------------------------- 41.20s openshift_node : Approve node CSRs --------------------------------------------------------------------------------------------------------------------- 12.05s openshift_node : Pull release image -------------------------------------------------------------------------------------------------------------------- 10.91s openshift_node : Get available cri-o RPM versions ------------------------------------------------------------------------------------------------------- 9.54s openshift_node : Pull MCD image ------------------------------------------------------------------------------------------------------------------------- 4.99s openshift_node : Get cluster nodes ---------------------------------------------------------------------------------------------------------------------- 4.12s openshift_node : Check for cluster http proxy ----------------------------------------------------------------------------------------------------------- 3.58s openshift_node : Check for cluster https proxy ---------------------------------------------------------------------------------------------------------- 3.52s openshift_node : Apply ignition manifest ---------------------------------------------------------------------------------------------------------------- 3.09s openshift_node : Check for cluster no proxy ------------------------------------------------------------------------------------------------------------- 2.99s openshift_node : Setting sebool container_manage_cgroup ------------------------------------------------------------------------------------------------- 2.24s openshift_node : Restart the CRI-O service -------------------------------------------------------------------------------------------------------------- 2.12s openshift_node : Setting sebool virt_use_samba ---------------------------------------------------------------------------------------------------------- 1.94s openshift_node : Setting sebool container_use_cephfs ---------------------------------------------------------------------------------------------------- 1.86s Gathering Facts ----------------------------------------------------------------------------------------------------------------------------------------- 1.68s openshift_node : Get machine controller daemon image from release image --------------------------------------------------------------------------------- 1.31s openshift_node : Creating NM config file ---------------------------------------------------------------------------------------------------------------- 1.11s openshift_node : Fetch bootstrap ignition file locally -------------------------------------------------------------------------------------------------- 1.04s --- Note that I'm now running openshift-ansible-4.10.0-202210242129.p0.gb1f6fe3.assembly.stream.el8.noarch from the rhocp-4.10-for-rhel-8-x86_64-rpms repo and not the el7 package # rpm -q openshift-ansible ansible openshift-ansible-4.10.0-202210242129.p0.gb1f6fe3.assembly.stream.el8.noarch ansible-2.9.27-1.el8ae.noarch When I first tried this we weren't producing/shipping el8 builds of openshift-ansible yet. I'm not an ansible expert, but my guess is that 1) there is a python version issue either on your client or the node, or 2) you may not be referencing a RHEL 8 host in your inventory
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-12-09-094312 True False 16h Cluster version is 4.11.0-0.nightly-2022-12-09-094312 Baremetal UPI environment Scale up playbook: PLAY RECAP ************************************************************************************************************************************************* localhost : ok=1 changed=0 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0 worker-0-3 : ok=49 changed=29 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0 Monday 12 December 2022 10:39:39 -0500 (0:01:01.370) 0:12:06.047 ******* =============================================================================== openshift_node : Install openshift packages ------------------------------------------------------------------------------------------------------- 424.93s openshift_node : Reboot the host and wait for it to come back ------------------------------------------------------------------------------------- 147.98s openshift_node : Wait for node to report ready ----------------------------------------------------------------------------------------------------- 61.37s openshift_node : Get available cri-o RPM versions -------------------------------------------------------------------------------------------------- 24.63s openshift_node : Approve node CSRs ----------------------------------------------------------------------------------------------------------------- 17.34s openshift_node : Pull release image ---------------------------------------------------------------------------------------------------------------- 11.77s openshift_node : Pull MCD image --------------------------------------------------------------------------------------------------------------------- 8.63s openshift_node : Apply ignition manifest ------------------------------------------------------------------------------------------------------------ 4.31s openshift_node : Restart the CRI-O service ---------------------------------------------------------------------------------------------------------- 2.04s openshift_node : Setting sebool container_manage_cgroup --------------------------------------------------------------------------------------------- 1.54s Gathering Facts ------------------------------------------------------------------------------------------------------------------------------------- 1.49s openshift_node : Setting sebool container_use_cephfs ------------------------------------------------------------------------------------------------ 1.43s openshift_node : Setting sebool virt_use_samba ------------------------------------------------------------------------------------------------------ 1.38s openshift_node : Get machine controller daemon image from release image ----------------------------------------------------------------------------- 1.31s openshift_node : Creating NM config file ------------------------------------------------------------------------------------------------------------ 1.21s openshift_node : Fetch bootstrap ignition file locally ---------------------------------------------------------------------------------------------- 1.02s openshift_node : Write /etc/containers/registries.conf ---------------------------------------------------------------------------------------------- 0.99s openshift_node : Copy additional trust bundle to system CA trust ------------------------------------------------------------------------------------ 0.94s openshift_node : Write pull-secret to file ---------------------------------------------------------------------------------------------------------- 0.94s openshift_node : Enable the CRI-O service ----------------------------------------------------------------------------------------------------------- 0.87s Upgrade playbook: PLAY RECAP ************************************************************************************************************************************************* localhost : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0 worker-0-3 : ok=36 changed=18 unreachable=0 failed=0 skipped=14 rescued=0 ignored=0 Monday 12 December 2022 10:51:05 -0500 (0:00:00.050) 0:02:29.292 ******* =============================================================================== openshift_node : Reboot the host and wait for it to come back -------------------------------------------------------------------------------------- 87.93s openshift_node : Install openshift packages -------------------------------------------------------------------------------------------------------- 31.35s openshift_node : Get available cri-o RPM versions --------------------------------------------------------------------------------------------------- 8.49s openshift_node : Pull MCD image --------------------------------------------------------------------------------------------------------------------- 4.21s Gathering Facts ------------------------------------------------------------------------------------------------------------------------------------- 2.08s openshift_node : Apply machine config --------------------------------------------------------------------------------------------------------------- 2.00s openshift_node : Write worker ignition config to file ----------------------------------------------------------------------------------------------- 1.11s openshift_node : Copy additional trust bundle to system CA trust ------------------------------------------------------------------------------------ 0.90s openshift_node : Drain node prior to upgrade -------------------------------------------------------------------------------------------------------- 0.87s openshift_node : Enable the CRI-O service ----------------------------------------------------------------------------------------------------------- 0.86s openshift_node : Update CA trust -------------------------------------------------------------------------------------------------------------------- 0.64s openshift_node : Get cluster version ---------------------------------------------------------------------------------------------------------------- 0.56s openshift_node : Check for cluster https proxy ------------------------------------------------------------------------------------------------------ 0.51s openshift_node : Check for cluster http proxy ------------------------------------------------------------------------------------------------------- 0.49s openshift_node : Check for cluster no proxy --------------------------------------------------------------------------------------------------------- 0.45s openshift_node : Retrieve rendered-worker name ------------------------------------------------------------------------------------------------------ 0.45s openshift_node : Get worker ignition config --------------------------------------------------------------------------------------------------------- 0.45s openshift_node : Cordon node prior to upgrade ------------------------------------------------------------------------------------------------------- 0.40s openshift_node : Check cluster FIPS status ---------------------------------------------------------------------------------------------------------- 0.40s openshift_node : Create temp directory -------------------------------------------------------------------------------------------------------------- 0.39s oc get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master-0-0 Ready master 3h35m v1.24.6+5658434 192.168. <none> Red Hat Enterprise Linux CoreOS 411.86.202212072103-0 (Ootpa) 4.18.0-372.36.1.el8_6.x86_64 cri-o://1.24.3-7.rhaos4.11.gitae53655.el8 master-0-1 Ready master 3h33m v1.24.6+5658434 192.168. <none> Red Hat Enterprise Linux CoreOS 411.86.202212072103-0 (Ootpa) 4.18.0-372.36.1.el8_6.x86_64 cri-o://1.24.3-7.rhaos4.11.gitae53655.el8 master-0-2 Ready master 3h31m v1.24.6+5658434 192.168. <none> Red Hat Enterprise Linux CoreOS 411.86.202212072103-0 (Ootpa) 4.18.0-372.36.1.el8_6.x86_64 cri-o://1.24.3-7.rhaos4.11.gitae53655.el8 worker-0-0 Ready worker 3h8m v1.24.6+5658434 192.168. <none> Red Hat Enterprise Linux CoreOS 411.86.202212072103-0 (Ootpa) 4.18.0-372.36.1.el8_6.x86_64 cri-o://1.24.3-7.rhaos4.11.gitae53655.el8 worker-0-1 Ready worker 3h8m v1.24.6+5658434 192.168. <none> Red Hat Enterprise Linux CoreOS 411.86.202212072103-0 (Ootpa) 4.18.0-372.36.1.el8_6.x86_64 cri-o://1.24.3-7.rhaos4.11.gitae53655.el8 worker-0-2 Ready worker 46m v1.24.6+5658434 192.168. <none> Red Hat Enterprise Linux 8.5 (Ootpa) 4.18.0-425.3.1.el8.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 worker-0-3 Ready worker 13m v1.24.6+5658434 192.168. <none> Red Hat Enterprise Linux 8.5 (Ootpa) 4.18.0-425.3.1.el8.x86_64 cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8 [kni@provisionhost-0-0 ~]$ rpm -q openshift-ansible ansible openshift-ansible-4.10.0-202207291637.p0.g1e68436.assembly.stream.el7.noarch ansible-2.9.27-1.el8ae.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.11.20 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8893