2087007 – OSP17.0 is failing on overcloud deployment when using FDP repo

Bug 2087007 - OSP17.0 is failing on overcloud deployment when using FDP repo

Summary: OSP17.0 is failing on overcloud deployment when using FDP repo

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Slawek Kaplonski
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:	2016183 2162194
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-17 06:09 UTC by Eran Kuris
Modified:	2023-01-19 04:33 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-21 12:21:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1968	None	None	None	2022-05-17 06:24:37 UTC
Red Hat Issue Tracker	OSP-15539	None	None	None	2022-06-06 07:49:55 UTC
Red Hat Product Errata	RHEA-2022:6543	None	None	None	2022-09-21 12:21:57 UTC

Comment 1 Ihar Hrachyshka 2022-05-17 14:42:30 UTC

I'm not an expert in tripleo deployments, but I see the following errors in the log that suggest the issue is not OVN related, but pacemaker failed to start because of some user / group management errors. See below:

May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Stage[main]/Pacemaker::Service/Service[pacemaker]/enable: enable changed 'false' to 'true'
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: /Service[pacemaker]: The container Class[Pacemaker::Service] will propagate my refresh event
May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Stage[main]/Pacemaker::Corosync/File_line[pcsd_bind_addr]/ensure: created
May 11 21:07:20 controller-2 puppet-user[18910]: Info: /Stage[main]/Pacemaker::Corosync/File_line[pcsd_bind_addr]: Scheduling refresh of Service[pcsd]
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: /Stage[main]/Pacemaker::Corosync/File_line[pcsd_bind_addr]: The container Class[Pacemaker::Corosync] will propagate my refresh event
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Executing: '[redacted]'
May 11 21:07:20 controller-2 puppet-user[18910]: Error: chpasswd said [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: Error: /Stage[main]/Pacemaker::Corosync/User[hacluster]/password: change from [redacted] to [redacted] failed: chpasswd said [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
May 11 21:07:20 controller-2 puppet-user[18910]: Could not open available domains
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Executing: '/sbin/usermod -G haclient hacluster'
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Falling back to Puppet::Etc.group: cannot load such file -- ffi
May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]/groups: groups changed  to ['haclient']
May 11 21:07:20 controller-2 puppet-user[18910]: Notice: /Service[pcsd]: Dependency User[hacluster] has failures: true
May 11 21:07:20 controller-2 puppet-user[18910]: Warning: /Service[pcsd]: Skipping because of failed dependencies
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: /Service[pcsd]: Resource is being skipped, unscheduling all events
May 11 21:07:20 controller-2 puppet-user[18910]: Info: /Service[pcsd]: Unscheduling all events on Service[pcsd]
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Class[Pacemaker::Service]: Resource is being skipped, unscheduling all events
May 11 21:07:20 controller-2 puppet-user[18910]: Info: Class[Pacemaker::Service]: Unscheduling all events on Class[Pacemaker::Service]
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Exec[check-for-local-authentication](provider=posix): Executing check '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
May 11 21:07:20 controller-2 puppet-user[18910]: Debug: Executing: '/sbin/pcs status pcsd controller-2 2>&1 | grep 'Unable to authenticate''
May 11 21:07:21 controller-2 puppet-user[18910]: Debug: /Stage[main]/Pacemaker::Corosync/Exec[check-for-local-authentication]: '/bin/echo 'local pcsd auth failed, triggering a reauthentication'' won't be executed because of failed check 'onlyif'

etc. etc.

I also see the following AVC denials in selinux logs:

type=AVC msg=audit(1652303544.692:7870): avc:  denied  { getattr } for  pid=33759 comm="sss_cache" path="/var/lib/sss/db/config.ldb" dev="dm-2" ino=732 scontext=unconfined_u:unconfined_r:sssd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0
type=AVC msg=audit(1652303544.692:7871): avc:  denied  { read write } for  pid=33759 comm="sss_cache" name="config.ldb" dev="dm-2" ino=732 scontext=unconfined_u:unconfined_r:sssd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0

It seems that because of selinux denials to sss db, chpasswd failed for hacluster user, which failed the rest of dependency tree, including pcs.

Comment 2 Ihar Hrachyshka 2022-05-18 17:37:50 UTC

OK again, I am not an expert in deployment, but the selinux denials and failure of puppet to chpasswd for hacluster user seems to be the root cause. Why it doesn't happen in no-FDP run, not sure yet.

I wondered if some packages were updated / installed in FDP run but not another. I see the following in dnf.rpm.log in controllers that failed:

2022-05-11T21:12:24+0000 SUBDEBUG Installed: setroubleshoot-server-3.3.28-3.el9_0.x86_64
2022-05-11T21:12:24+0000 INFO [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
[sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
[sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
[sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Could not open available domains
This happens after ansible-dnf module is triggered as following:
May 11 21:10:42 controller-2 python3[26004]: ansible-dnf Invoked with name=['setools', 'setroubleshoot'] state=present allow_downgrade=False autoremove=False bugfix=False disable_gpg_check=False disable_plugin=[] disablerepo=[] download_only=False enable_plugin=[] enablerepo=[] exclude=[] installroot=/ install_repoquery=True install_weak_deps=True security=False skip_broken=False update_cache=False update_only=False validate_certs=True lock_timeout=30 conf_file=None disable_excludes=None download_dir=None list=None releasever=None
The same command is issued in non-FDP run but it doesn't trigger RPM installations / upgrades (dnf.rpm.log empty).
Is there any difference between runs in how it gets setools / setroubleshoot installed? Perhaps it's pre-installed from older OSP / RHEL repos in non-FDP repos and we never issue dnf upgrade anywhere to get it bumped? Is there a bug in setroubleshoot-server / sssd selinux policies?

Comment 5 Cédric Jeanneret 2022-05-24 13:19:52 UTC

Thanks to Julie, I think I have the right reason:
it really smells like an image edition without the "--selinux-relabel" option/parameter passed to virt-sysprep or virt-customize or anything from libguestfs.

The right way to squash this issue is to find where that edit happens (apparently when the FDP repos are injected/copied), and add that missing param.

Now, please also have a look at this brand new issue in libguestfs that may hit sooner or later on el9: https://bugzilla.redhat.com/show_bug.cgi?id=2089748

Cheers,

C.

Comment 16 errata-xmlrpc 2022-09-21 12:21:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Note You need to log in before you can comment on or make changes to this bug.