Created attachment 1441310 [details] undercloud_Update.log Description of problem: OSP10: undercloud update gets stuck when updating from GA(rhel 7.3) to latest(rhel 7.5): Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 GA 2. Minor update to OSP10 latest(including rhel update from 7.3 to 7.5) Actual results: Undercloud update gets stuck. Expected results: Undercloud upate is successful. Additional info: Side note: running 'sysctl -a' gets stuck: [root@undercloud-0 ~]# sysctl -a abi.vsyscall32 = 1 crypto.fips_enabled = 0 debug.exception-trace = 1 debug.kprobes-optimization = 1 dev.hpet.max-user-freq = 64 dev.mac_hid.mouse_button2_keycode = 97 dev.mac_hid.mouse_button3_keycode = 100 dev.mac_hid.mouse_button_emulation = 0 dev.parport.default.spintime = 500 dev.parport.default.timeslice = 200 dev.raid.speed_limit_max = 200000 dev.raid.speed_limit_min = 1000 dev.scsi.logging_level = 0 fs.aio-max-nr = 1048576 fs.aio-nr = 0 Attaching undercloud update output.
Note: trying to generate the sosreport gets stuck as well.
[May29 05:09] INFO: task fuser:32191 blocked for more than 120 seconds. [ +0.003212] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ +0.003369] fuser D ffff8804fbb14500 0 32191 1 0x00000084 [ +0.003227] ffff880458f23a30 0000000000000086 ffff880228283ec0 ffff880458f23fd8 [ +0.003467] ffff880458f23fd8 ffff880458f23fd8 ffff880228283ec0 ffff880512fb0a18 [ +0.003326] ffff880512fb0a20 7fffffffffffffff ffff880228283ec0 ffff8804fbb14500 [ +0.003653] Call Trace: [ +0.002500] [<ffffffff8168c6f9>] schedule+0x29/0x70 [ +0.002937] [<ffffffff8168a139>] schedule_timeout+0x239/0x2c0 [ +0.003023] [<ffffffff810b18e6>] ? finish_wait+0x56/0x70 [ +0.002947] [<ffffffff8168a882>] ? mutex_lock+0x12/0x2f [ +0.003015] [<ffffffff8128a052>] ? autofs4_wait+0x3f2/0x900 [ +0.002952] [<ffffffff8168cad6>] wait_for_completion+0x116/0x170 [ +0.002973] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 [ +0.002821] [<ffffffff8128b1cb>] autofs4_expire_wait+0x6b/0x110 [ +0.002882] [<ffffffff81288282>] do_expire_wait+0x172/0x190 [ +0.002746] [<ffffffff8128847f>] autofs4_d_manage+0x6f/0x170 [ +0.002733] [<ffffffff812092e5>] follow_managed+0xb5/0x300 [ +0.002744] [<ffffffff81209c4b>] lookup_fast+0x19b/0x2e0 [ +0.002722] [<ffffffff8120c535>] path_lookupat+0x165/0x7a0 [ +0.002687] [<ffffffff81686062>] ? avc_alloc_node+0x116/0x125 [ +0.002677] [<ffffffff811de835>] ? kmem_cache_alloc+0x35/0x1e0 [ +0.002715] [<ffffffff8120f48f>] ? getname_flags+0x4f/0x1a0 [ +0.002705] [<ffffffff8120cb9b>] filename_lookup+0x2b/0xc0 [ +0.002507] [<ffffffff812105b7>] user_path_at_empty+0x67/0xc0 [ +0.002541] [<ffffffff81114bb2>] ? from_kgid_munged+0x12/0x20 [ +0.002511] [<ffffffff81203f9f>] ? cp_new_stat+0x14f/0x180 [ +0.002502] [<ffffffff81210621>] user_path_at+0x11/0x20 [ +0.002452] [<ffffffff81203a93>] vfs_fstatat+0x63/0xc0 [ +0.002338] [<ffffffff81203ffe>] SYSC_newstat+0x2e/0x60 [ +0.002361] [<ffffffff8111f486>] ? __audit_syscall_exit+0x1e6/0x280 [ +0.002492] [<ffffffff812042de>] SyS_newstat+0xe/0x10 [ +0.002272] [<ffffffff81697709>] system_call_fastpath+0x16/0x1b
[root@undercloud-0 ~]# ip netns exec qdhcp-8b0d5db6-e61f-435a-bf0f-fb8b91de50a6 ls /proc/sys/fs aio-max-nr binfmt_misc dir-notify-enable file-max inode-nr inotify leases-enable nfs overflowgid pipe-max-size pipe-user-pages-soft protected_symlinks suid_dumpable aio-nr dentry-state epoll file-nr inode-state lease-break-time mqueue nr_open overflowuid pipe-user-pages-hard protected_hardlinks quota xfs [root@undercloud-0 ~]# ls /proc/sys/fs ... HANGS
[stack@undercloud-0 ~]$ mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=9818928k,nr_inodes=2454732,mode=755) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/vda1 on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota) rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime) selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime) debugfs on /sys/kernel/debug type debugfs (rw,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel) mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=300,minproto=5,maxproto=5,direct) nfsd on /proc/fs/nfsd type nfsd (rw,relatime) tmpfs on /run/user/1001 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=1967712k,mode=700,uid=1001,gid=1001) tmpfs on /run/netns type tmpfs (rw,nosuid,nodev,seclabel,mode=755) proc on /run/netns/qdhcp-8b0d5db6-e61f-435a-bf0f-fb8b91de50a6 type proc (rw,nosuid,nodev,noexec,relatime) proc on /run/netns/qdhcp-8b0d5db6-e61f-435a-bf0f-fb8b91de50a6 type proc (rw,nosuid,nodev,noexec,relatime)
[root@undercloud-0 ~]# lsof | grep proc | grep fs systemd 1 root 23r DIR 0,36 0 12511 /proc/sys/fs/binfmt_misc systemd 1 27015 root 23r DIR 0,36 0 12511 /proc/sys/fs/binfmt_misc kdevtmpfs 17 root txt unknown /proc/17/exe fsnotify_ 33 root txt unknown /proc/33/exe xfsalloc 279 root txt unknown /proc/279/exe xfs_mru_c 280 root txt unknown /proc/280/exe xfs-buf/v 281 root txt unknown /proc/281/exe xfs-data/ 282 root txt unknown /proc/282/exe xfs-conv/ 283 root txt unknown /proc/283/exe xfs-cil/v 284 root txt unknown /proc/284/exe xfs-recla 285 root txt unknown /proc/285/exe xfs-log/v 286 root txt unknown /proc/286/exe xfs-eofbl 287 root txt unknown /proc/287/exe xfsaild/v 288 root txt unknown /proc/288/exe ls 11180 stack 3r DIR 0,3 0 8672 /proc/sys/fs ls 20238 root 3r DIR 0,3 0 8672 /proc/sys/fs sysctl 24338 root 4r DIR 0,3 0 8672 /proc/sys/fs sysctl 32648 root 4r DIR 0,3 0 8672 /proc/sys/fs
I realized this is probably something with systemd. I Tried downgrade: yum downgrade systemd* libgudev* and system got unstuck.
Resolving Dependencies --> Running transaction check ---> Package libgudev1.x86_64 0:219-42.el7_4.10 will be a downgrade ---> Package libgudev1.x86_64 0:219-57.el7 will be erased ---> Package systemd.x86_64 0:219-42.el7_4.10 will be a downgrade ---> Package systemd.x86_64 0:219-57.el7 will be erased ---> Package systemd-libs.i686 0:219-42.el7_4.10 will be a downgrade ---> Package systemd-libs.x86_64 0:219-42.el7_4.10 will be a downgrade ---> Package systemd-libs.i686 0:219-57.el7 will be erased ---> Package systemd-libs.x86_64 0:219-57.el7 will be erased ---> Package systemd-sysv.x86_64 0:219-42.el7_4.10 will be a downgrade ---> Package systemd-sysv.x86_64 0:219-57.el7 will be erased --> Finished Dependency Resolution Dependencies Resolved =================================================================================================================================================================================================================== Package Arch Version Repository Size =================================================================================================================================================================================================================== Downgrading: libgudev1 x86_64 219-42.el7_4.10 rhelosp-rhel-7.5-server 85 k systemd x86_64 219-42.el7_4.10 rhelosp-rhel-7.5-server 5.2 M systemd-libs i686 219-42.el7_4.10 rhelosp-rhel-7.5-server 378 k systemd-libs x86_64 219-42.el7_4.10 rhelosp-rhel-7.5-server 378 k systemd-sysv x86_64 219-42.el7_4.10 rhelosp-rhel-7.5-server 72 k Transaction Summary =================================================================================================================================================================================================================== Downgrade 5 Packages
Reproduced the issue on Overcloud too.
for i in `nova list|awk '/ACTIVE/ {print $(NF-1)}' |awk -F"=" '{print $NF}'`; do echo $i; ssh -o StrictHostKeyChecking=no heat-admin@$i "sudo yum versionlock del lib* system* ; sudo yum versionlock add systemd-219-42.el7_4.10 systemd-libs-219-42.el7_4.10 libgudev1-219-42.el7_4.10 systemd-sysv-219-42.el7_4.10; sudo yum -y downgrade libgudev1 systemd*"; done
Workaround that allowed me to complete the undercloud upgrade: sudo yum install -y yum-plugin-versionlock sudo yum versionlock add systemd systemd-libs libgudev1 systemd-sysv rsyslog sudo systemctl stop 'openstack-*' 'neutron-*' httpd sudo yum update python-tripleoclient -y openstack undercloud upgrade ## wait for the upgrade to fail because of: 2018-05-31 21:24:06 - Error: Could not start Service[docker]: Execution of '/bin/systemctl start docker' returned 1: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. 2018-05-31 21:24:06 - Error: /Stage[main]/Tripleo::Profile::Base::Docker_registry/Service[docker]/ensure: change from stopped to running failed: Could not start Service[docker]: Execution of '/bin/systemctl start docker' returned 1: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. ## delete the version lock and update the remaining packages sudo yum versionlock del systemd systemd-libs libgudev1 systemd-sysv rsyslog sudo yum update -y reboot ## re-run undercloud upgrade openstack undercloud upgrade
Instructions for overcloud update: apply https://review.openstack.org/#/c/571482/ ## Before starting the minor update source ~/stackrc for address in $(openstack server list -f json | jq -r -c '.[] | .Networks' | grep -oP '[0-9.]+'); do \ ssh -q -o StrictHostKeyChecking=no heat-admin@$address \ 'sudo yum install -y yum-plugin-versionlock; \ sudo yum versionlock add systemd systemd-libs libgudev1 systemd-sysv rsyslog;' done ## Run the overcloud minor update ## After completing the minor update source ~/stackrc for address in $(openstack server list -f json | jq -r -c '.[] | .Networks' | grep -oP '[0-9.]+'); do \ ssh -q -o StrictHostKeyChecking=no heat-admin@$address \ 'sudo yum versionlock del systemd systemd-libs libgudev1 systemd-sysv rsyslog; sudo yum update -y' done ## Reboot
Note for undercloud upgrade: if undercloud upgrade fails with: 2018-06-07 10:35:36 - Error: Could not start Service[nova-compute]: Execution of '/bin/systemctl start openstack-nova-compute' returned 1: Job for openstack-nova-compute.service failed because the control process exited with error code. See "systemctl status openstack-nova-compute.service" and "journalctl -xe" for details. 2018-06-07 10:35:36 - Error: /Stage[main]/Nova::Compute/Nova::Generic_service[compute]/Service[nova-compute]/ensure: change from stopped to running failed: Could not start Service[nova-compute]: Execution of '/bin/systemctl start openstack-nova-compute' returned 1: Job for openstack-nova-compute.service failed because the control process exited with error code. See "systemctl status openstack-nova-compute.service" and "journalctl -xe" for details. re-running 'openstack undercloud upgrade' after failure allows it to complete.
*** Bug 1557176 has been marked as a duplicate of this bug. ***
This issue was resolved in RHEL7.6.