Bug 2087007

Summary: OSP17.0 is failing on overcloud deployment when using FDP repo
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: openstack-neutronAssignee: Slawek Kaplonski <skaplons>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: urgent Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: apevec, chrisw, cjeanner, ctrautma, ihrachys, jiji, jjoyce, lhh, majopela, mmichels, mtomaska, scohen
Target Milestone: ---Keywords: TestOnly, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 12:21:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2016183, 2162194    
Bug Blocks:    

Comment 1 Ihar Hrachyshka 2022-05-17 14:42:30 UTC
I'm not an expert in tripleo deployments, but I see the following errors in the log that suggest the issue is not OVN related, but pacemaker failed to start because of some user / group management errors. See below:

May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Stage[main]/Pacemaker::Service/Service[pacemaker]/enable: enable changed 'false' to 'true'
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: /Service[pacemaker]: The container Class[Pacemaker::Service] will propagate my refresh event
May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Stage[main]/Pacemaker::Corosync/File_line[pcsd_bind_addr]/ensure: created
May 11 21:07:20 controller-2 puppet-user[18910]: Info: /Stage[main]/Pacemaker::Corosync/File_line[pcsd_bind_addr]: Scheduling refresh of Service[pcsd]
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: /Stage[main]/Pacemaker::Corosync/File_line[pcsd_bind_addr]: The container Class[Pacemaker::Corosync] will propagate my refresh event
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Executing: '[redacted]'
May 11 21:07:20 controller-2 puppet-user[18910]: Error: chpasswd said [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: Error: /Stage[main]/Pacemaker::Corosync/User[hacluster]/password: change from [redacted] to [redacted] failed: chpasswd said [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Executing: '/sbin/usermod -G haclient hacluster'
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]/groups: groups changed  to ['haclient']
May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Service[pcsd]: Dependency User[hacluster] has failures: true
May 11 21:07:20 controller-2 puppet-user[18910]: Warning: /Service[pcsd]: Skipping because of failed dependencies
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: /Service[pcsd]: Resource is being skipped, unscheduling all events
May 11 21:07:20 controller-2 puppet-user[18910]: Info: /Service[pcsd]: Unscheduling all events on Service[pcsd]
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Class[Pacemaker::Service]: Resource is being skipped, unscheduling all events
May 11 21:07:20 controller-2 puppet-user[18910]: Info: Class[Pacemaker::Service]: Unscheduling all events on Class[Pacemaker::Service]
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Exec[check-for-local-authentication](provider=posix): Executing check '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Executing: '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
May 11 21:07:21 controller-2 puppet-user[18910]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' won't be executed because of failed check 'onlyif'

etc. etc.

I also see the following AVC denials in selinux logs:

type=AVC msg=audit(1652303544.692:7870): avc:  denied  { getattr } for  pid=33759 comm="sss_cache" path="/var/lib/sss/db/config.ldb" dev="dm-2" ino=732 scontext=unconfined_u:unconfined_r:sssd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0
type=AVC msg=audit(1652303544.692:7871): avc:  denied  { read write } for  pid=33759 comm="sss_cache" name="config.ldb" dev="dm-2" ino=732 scontext=unconfined_u:unconfined_r:sssd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0

It seems that because of selinux denials to sss db, chpasswd failed for hacluster user, which failed the rest of dependency tree, including pcs.

Comment 2 Ihar Hrachyshka 2022-05-18 17:37:50 UTC
OK again, I am not an expert in deployment, but the selinux denials and failure of puppet to chpasswd for hacluster user seems to be the root cause. Why it doesn't happen in no-FDP run, not sure yet.

I wondered if some packages were updated / installed in FDP run but not another. I see the following in dnf.rpm.log in controllers that failed:

2022-05-11T21:12:24+0000 SUBDEBUG Installed: setroubleshoot-server-3.3.28-3.el9_0.x86_64
2022-05-11T21:12:24+0000 INFO [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
[sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
[sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
[sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
This happens after ansible-dnf module is triggered as following:
May 11 21:10:42 controller-2 python3[26004]: ansible-dnf Invoked with name=['setools', 'setroubleshoot'] state=present allow_downgrade=False autoremove=False bugfix=False disable_gpg_check=False disable_plugin=[] disablerepo=[] download_only=False enable_plugin=[] enablerepo=[] exclude=[] installroot=/ install_repoquery=True install_weak_deps=True security=False skip_broken=False update_cache=False update_only=False validate_certs=True lock_timeout=30 conf_file=None disable_excludes=None download_dir=None list=None releasever=None
The same command is issued in non-FDP run but it doesn't trigger RPM installations / upgrades (dnf.rpm.log empty).
Is there any difference between runs in how it gets setools / setroubleshoot installed? Perhaps it's pre-installed from older OSP / RHEL repos in non-FDP repos and we never issue dnf upgrade anywhere to get it bumped? Is there a bug in setroubleshoot-server / sssd selinux policies?

Comment 5 Cédric Jeanneret 2022-05-24 13:19:52 UTC
Thanks to Julie, I think I have the right reason:
it really smells like an image edition without the "--selinux-relabel" option/parameter passed to virt-sysprep or virt-customize or anything from libguestfs.

The right way to squash this issue is to find where that edit happens (apparently when the FDP repos are injected/copied), and add that missing param.

Now, please also have a look at this brand new issue in libguestfs that may hit sooner or later on el9: https://bugzilla.redhat.com/show_bug.cgi?id=2089748

Cheers,

C.

Comment 16 errata-xmlrpc 2022-09-21 12:21:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543