Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1557176

Summary: [OSP 10] overcloud minor update failed at ControllerDeployment_Step1.1 with RHEL 7.5
Product: Red Hat OpenStack Reporter: Peng Liu <pliu>
Component: openstack-tripleoAssignee: Carlos Camacho <ccamacho>
Status: CLOSED DUPLICATE QA Contact: Arik Chernetsky <achernet>
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: ccamacho, lbezdick, mburns, pliu, rhel-osp-director-maint
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 10.0 (Newton)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-08 11:04:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Peng Liu 2018-03-16 06:51:31 UTC
Description of problem:
When trying to do the OSP10 minor update with RHEL7.5. The update process hung then failed when timeout at ControllerDeployment_Step1.1.

Some findings:
1. In controllor-0 the os-collect-conf hangs after this log
Mar 14 04:29:04 overcloud-controller-0.localdomain os-collect-config[3018]: [2018-03-14 04:29:04,777] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/puppet < /var/lib/heat-config/deployed/7c929c76-1ed5-42ca-ad8b-ab1e9f92ddf1.json

which exciting this puppet script undernethe
puppet apply --detailed-exitcodes --logdest console --modulepath /etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules /var/lib/heat-config/heat-config-puppet/7c929c76-1ed5-42ca-ad8b-ab1e9f92ddf1.pp

2. In the puppet log on controller-0, it stops at

Notice: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Tripleo::Pacemaker::Resource_restart_flag[galera-master]/File[/var/lib/tripleo/pacemaker-restarts]/ensure: created
Notice: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Tripleo::Pacemaker::Resource_restart_flag[galera-master]/Exec[galera-master resource restart flag]: Triggered 'refresh' from 1 events

I found that puppet run hung was caused by it child process '/sbin/sysctl -a'. I think it could be the root cause of the update failure.

How reproducible:
I tried twice. It happens every time.

Steps to Reproduce:
1. Deploy a the OSP 10 overcloud with z3 software
2. Set the Repo to RHEL7.5 with rhos-release
3. Run the minor update.

Actual results:
Update timeout then failed.

Expected results:
Update success.

Additional info:

Comment 1 Peng Liu 2018-03-16 07:45:34 UTC
Manually execute ''/sbin/sysctl -a' on controller-1 and controller-2 hung too, but it works well on compute nodes. p.s controller are virtual nodes, compute are baremetal.

[root@overcloud-controller-1 ~]# sysctl -a
abi.vsyscall32 = 1
crypto.fips_enabled = 0
debug.exception-trace = 1
debug.kprobes-optimization = 1
dev.hpet.max-user-freq = 64
dev.mac_hid.mouse_button2_keycode = 97
dev.mac_hid.mouse_button3_keycode = 100
dev.mac_hid.mouse_button_emulation = 0
dev.parport.default.spintime = 500
dev.parport.default.timeslice = 200
dev.raid.speed_limit_max = 200000
dev.raid.speed_limit_min = 1000
dev.scsi.logging_level = 0
fs.aio-max-nr = 1048576
fs.aio-nr = 2661 

rpm -qa kernel
kernel-3.10.0-693.el7.x86_64
rpm -qa procps-ng
procps-ng-3.3.10-17.el7.x86_64

Comment 2 Carlos Camacho 2018-03-19 13:49:59 UTC
Hey Peng, this looks like a system issue, sysctl shouldnt hang.

Can you confirm this deployment is working fine? 

Do you have an SOS report?

Comment 3 Peng Liu 2018-03-19 13:56:52 UTC
I think the deployment works fine before minor update. I tried update to RHEL7.4 from the same spot several times, it works without any issue.

Sorry I don't have the sos_report, since when I try to get the sos_report it hung too.  I think I can reproduce it in my env if you need.

Comment 4 Peng Liu 2018-03-21 07:30:15 UTC
Found something in dmesg about sysctl when it hung.

[23160.311644] INFO: task sysctl:627160 blocked for more than 120 seconds.
[23160.312987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23160.313781] sysctl          D ffff880426cecb00     0 627160      1 0x00000084
[23160.313785]  ffff880262eeba30 0000000000000086 ffff880260ba6dd0 ffff880262eebfd8
[23160.313788]  ffff880262eebfd8 ffff880262eebfd8 ffff880260ba6dd0 ffff880424e2f098
[23160.313790]  ffff880424e2f0a0 7fffffffffffffff ffff880260ba6dd0 ffff880426cecb00
[23160.313792] Call Trace:
[23160.313800]  [<ffffffff8168c4d9>] schedule+0x29/0x70
[23160.313803]  [<ffffffff81689f19>] schedule_timeout+0x239/0x2c0
[23160.313807]  [<ffffffff810b18e6>] ? finish_wait+0x56/0x70
[23160.313810]  [<ffffffff8168a662>] ? mutex_lock+0x12/0x2f
[23160.313813]  [<ffffffff81289e72>] ? autofs4_wait+0x3f2/0x900
[23160.313816]  [<ffffffff8168c8b6>] wait_for_completion+0x116/0x170
[23160.313819]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
[23160.313822]  [<ffffffff8128afeb>] autofs4_expire_wait+0x6b/0x110
[23160.313824]  [<ffffffff812880a2>] do_expire_wait+0x172/0x190
[23160.313826]  [<ffffffff8128829f>] autofs4_d_manage+0x6f/0x170
[23160.313829]  [<ffffffff81209115>] follow_managed+0xb5/0x300
[23160.313830]  [<ffffffff81209a7b>] lookup_fast+0x19b/0x2e0
[23160.313833]  [<ffffffff8120c365>] path_lookupat+0x165/0x7a0
[23160.313836]  [<ffffffff8118f64e>] ? release_pages+0x24e/0x430
[23160.313840]  [<ffffffff811de665>] ? kmem_cache_alloc+0x35/0x1e0
[23160.313842]  [<ffffffff8120f2bf>] ? getname_flags+0x4f/0x1a0
[23160.313844]  [<ffffffff8120c9cb>] filename_lookup+0x2b/0xc0
[23160.313846]  [<ffffffff812103e7>] user_path_at_empty+0x67/0xc0
[23160.313849]  [<ffffffff811b4944>] ? unmap_region+0xf4/0x140
[23160.313851]  [<ffffffff81210451>] user_path_at+0x11/0x20
[23160.313854]  [<ffffffff812038c3>] vfs_fstatat+0x63/0xc0
[23160.313856]  [<ffffffff81203e2e>] SYSC_newstat+0x2e/0x60
[23160.313859]  [<ffffffff8111f476>] ? __audit_syscall_exit+0x1e6/0x280
[23160.313862]  [<ffffffff8120410e>] SyS_newstat+0xe/0x10
[23160.313864]  [<ffffffff816974c9>] system_call_fastpath+0x16/0x1b

Comment 5 Lukas Bezdicka 2019-04-08 11:04:51 UTC

*** This bug has been marked as a duplicate of bug 1582338 ***