Bug 2248873 - [OSP17.1] After FFU, resize doesn't work because /proc/$PID/loginuid already set to 1001.
Summary: [OSP17.1] After FFU, resize doesn't work because /proc/$PID/loginuid already ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 17.1
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-09 12:41 UTC by ggrimaux
Modified: 2024-09-20 04:25 UTC (History)
22 users (show)

Fixed In Version: openstack-tripleo-common-15.4.1-17.1.20230927010821.el9ost
Doc Type: Bug Fix
Doc Text:
With some versions of Pluggable Authentication Modules (PAM), the pam_loginuid module `/proc/self/loginuid` must be writable. This is not the case in the `sshd` container used for migrations. Migrations failed because SSH login between Compute hosts was failing. With this update, the `pam_loginuid` module has been removed from the PAM config and, as a result, SSH login between Compute hosts, and migrations, work again.
Clone Of:
Environment:
Last Closed: 2024-05-22 20:39:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 900525 0 None MERGED Containerized sshd does not work with older PAM 2024-02-09 20:12:06 UTC
Red Hat Issue Tracker OSP-30327 0 None None None 2023-11-09 12:41:59 UTC
Red Hat Product Errata RHBA-2024:2741 0 None None None 2024-05-22 20:39:42 UTC

Description ggrimaux 2023-11-09 12:41:42 UTC
Description of problem:
Client recently FFUed from 16.2 to 17.1.1.

When a resize is attempted it fails with:
2023-10-30 14:58:30.594 2 ERROR oslo_messaging.rpc.server [req-09756528-71cc-426b-a66c-e864618ad6c5 0bce38be6baabd3ba0904aa6a4b3610e5505cc6555c02ab24496b4eea980968e 2d72e751f005438a880df556f80e0b18 - $id $id ] Exception during message handling: nova.exception.ResizeError: Resize error: not able to execute ssh command: Unexpected error while running command.
Command: ssh -o BatchMode=yes $ipv6 mkdir -p /var/lib/nova/instances/fd65d0a9-ba9c-4cd2-a206-881cdd4b4652
Exit code: 254

Oct 30 14:58:30 computesl-1 sshd[198900]: Accepted publickey for nova_migration from $ipv6 port 55538 ssh2: RSA SHA256:hpYcvLMW1x7fU9IwdOzVizOGGVKJwOA7rRMyxGsuG5s
Oct 30 14:58:30 computesl-1 sshd[198900]: pam_loginuid(sshd:session): Error writing /proc/self/loginuid: Operation not permitted
Oct 30 14:58:30 computesl-1 sshd[198900]: pam_loginuid(sshd:session): set_loginuid failed
Oct 30 14:58:30 computesl-1 sshd[198900]: pam_unix(sshd:session): session opened for user nova_migration(uid=989) by (uid=0)
Oct 30 14:58:30 computesl-1 sshd[198900]: pam_lastlog(sshd:session): unable to open /var/log/lastlog: No such file or directory
Oct 30 14:58:30 computesl-1 sshd[198900]: error: PAM: pam_open_session(): Cannot make/remove an entry for the specified session
Oct 30 14:58:30 computesl-1 sshd[198902]: Received disconnect from 2607:f160:a:a02c:cd::118 port 55538:11: disconnected by user
Oct 30 14:58:30 computesl-1 sshd[198902]: Disconnected from user nova_migration $ipv6 port 55538
Oct 30 14:58:30 computesl-1 sshd[198904]: Accepted publickey for nova_migration from $ipv6 port 55544 ssh2: RSA SHA256:hpYcvLMW1x7fU9IwdOzVizOGGVKJwOA7rRMyxGsuG5s
Oct 30 14:58:30 computesl-1 sshd[198904]: pam_loginuid(sshd:session): Error writing /proc/self/loginuid: Operation not permitted
Oct 30 14:58:30 computesl-1 sshd[198904]: pam_loginuid(sshd:session): set_loginuid failed
Oct 30 14:58:30 computesl-1 sshd[198904]: pam_unix(sshd:session): session opened for user nova_migration(uid=989) by (uid=0)
Oct 30 14:58:30 computesl-1 sshd[198904]: pam_lastlog(sshd:session): unable to open /var/log/lastlog: No such file or directory
Oct 30 14:58:30 computesl-1 sshd[198904]: error: PAM: pam_open_session(): Cannot make/remove an entry for the specified session
Oct 30 14:58:30 computesl-1 sshd[198906]: Received disconnect from $ipv6 port 55544:11: disconnected by user
Oct 30 14:58:30 computesl-1 sshd[198906]: Disconnected from user nova_migration $ipv6 port 55544

Commenting out pam_loginuid.so inside the containers /etc/pam.d/sshd allows the connection and related commands to work:
....
session    required     pam_loginuid.so   
....


Further troubleshooting done:
- /proc/self/loginuid can be changed only from 4294967295 (which is -1  or  "unset") to something like 1001, but only once. Any subsequent writes will fail, as designed in kernel. Read [1] for details.
- The moment it is set it's not possible to change it to something else. E.g. [2] to see how it happens. It's not a permission error, but a failed write. UNIX permissions, POSIX ACLs, SELinux are fine.
- When loginuid is set we should have an autit log entry see [3] for details.
- We have all processes from inside the container, but also conmon too (the process which manages the container) having "1001" inside /proc/$PID/loginuid on broken compute machines. This is not ok, as ssh will fail to set loginuid to "989". See [2]. This means that the container was started with /proc/$PID/loginuid allready set to 1001. This also means that the user who did the atrocity has the UID 1001 on the compute host (tripleo-ansible). This points to a bug in the upgrade process (either openstack upgrade, OS upgrade or container upgrade).


From .tripleo/history, this task ran and launched those sshd processes:
2023-10-18 00:59:38.484442 overcloud-upgrade-run limit=compute*, playbook=None, static_inventory=None, ssh_user=tripleo-admin, tags=, skip_tags=, stack=$name, yes=True, ansible_forks=None


We need your help to identify why it happens.

Thank you.


[1] https://www.kernel.org/doc/Documentation/ABI/stable/procfs-audit_loginuid
What:   Audit Login UID
Date:   2005-02-01
KernelVersion:  2.6.11-rc2 1e2d1492e178 ("[PATCH] audit: handle loginuid through proc")
Contact:  linux-audit
Users:    audit and login applications
Description:
    The /proc/$pid/loginuid pseudofile is written to set and
    read to get the audit login UID of process $pid as a
    decimal unsigned int (%u, u32).  If it is unset,
    permissions are not needed to set it.  The accessor must
    have CAP_AUDIT_CONTROL in the initial user namespace to
    write it if it has been set.  It cannot be written again
    if AUDIT_FEATURE_LOGINUID_IMMUTABLE is enabled.  It
    cannot be unset if AUDIT_FEATURE_ONLY_UNSET_LOGINUID is
    enabled.

[2] You can not set the loginuid after it was set:
951255 openat(AT_FDCWD</>, "/proc/self/loginuid", O_RDWR|O_NOFOLLOW) = 3</proc/314908/loginuid>   <--- sshd is allowed to open this file for writing, so it's not a UNIX permission issue.
951255 read(3</proc/314908/loginuid>, "1001", 24) = 4     <--- it is already set, not having anymore 4294967295, but 1001, which is the UID of tripleo-ansible on the compute host.
951255 read(3</proc/314908/loginuid>, "", 20) = 0
951255 lseek(3</proc/314908/loginuid>, 0, SEEK_SET) = 0
951255 ftruncate(3</proc/314908/loginuid>, 0) = 0
951255 write(3</proc/314908/loginuid>, "989", 3) = -1 EPERM (Operation not permitted)   <--- expected behaviour - the kernel is refusing to change it again. So it is refusing this value, as the single permitted value is "1001".
951255 getpid()                         = 314908
951255 socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 5<UNIX:[559936290]>
951255 connect(5<UNIX:[559936290]>, {sa_family=AF_UNIX, sun_path="/dev/log"}, 110) = 0
951255 sendto(5<UNIX:[559936290->21800]>, "<83>Nov  6 22:23:30 sshd[314908]: pam_loginuid(sshd:session): Error writing /proc/self/loginuid: Operation not permitted", 120, MSG_NOSIGNAL, NULL, 0) = 120              <---- ssh is refusing to authenticate, as requested by pam configuration, which has the loginuid are "session required"
951255 close(3</proc/314908/loginuid>)  = 0
951255 getpid()                         = 314908
951255 sendto(5<UNIX:[559936290->21800]>, "<83>Nov  6 22:23:30 sshd[314908]: pam_loginuid(sshd:session): set_loginuid failed", 81, MSG_NOSIGNAL, NULL, 0) = 81
951255 futex(0x7ff7f2e1f644, FUTEX_WAKE_PRIVATE, 2147483647) = 0
951255 openat(AT_FDCWD</>, "/proc/thread-self/attr/exec", O_RDONLY|O_CLOEXEC) = 3</proc/314908/task/314908/attr/exec>
951255 read(3</proc/314908/task/314908/attr/exec>, "", 4095) = 0
951255 close(3</proc/314908/task/314908/attr/exec>) = 0


[3] https://github.com/torvalds/linux/blob/master/kernel/audit.c#L2307
static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid,
           unsigned int oldsessionid,
           unsigned int sessionid, int rc)
{
  struct audit_buffer *ab;
  uid_t uid, oldloginuid, loginuid;
  struct tty_struct *tty;

  if (!audit_enabled)
    return;

  ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_LOGIN);
  if (!ab)
    return;

  uid = from_kuid(&init_user_ns, task_uid(current));
  oldloginuid = from_kuid(&init_user_ns, koldloginuid);
  loginuid = from_kuid(&init_user_ns, kloginuid);
  tty = audit_get_tty();

  audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid);
  audit_log_task_context(ab);
  audit_log_format(ab, " old-auid=%u auid=%u tty=%s old-ses=%u ses=%u res=%d",
       oldloginuid, loginuid, tty ? tty_name(tty) : "(none)",
       oldsessionid, sessionid, !rc);
  audit_put_tty(tty);
  audit_log_end(ab);
}

Version-Release number of selected component (if applicable):


How reproducible:
In this environment 100%

Steps to Reproduce:
1. FFU 16.2 -> 17.1.1
2. Try live migration
3.

Actual results:
Resize fails

Expected results:
Resize succeed

Additional info:
We have sosreport and logs.

Comment 1 smooney 2023-11-09 13:11:01 UTC
this would appar to be the same issue as reported here https://bugs.launchpad.net/kolla/+bug/1651395

in 16.1 and older we used to build our image using kolla images as a base although not that spciric image

https://opendev.org/openstack/kolla/commit/aae9193675ed214f7a4455708409532ea1665255

looking at the tcib defs in triplo-common i do not see a similar workaround in place

my guess is in the past this was optional in the base image either due to a rpm we installed our a customisation in the lower level rhel/oepnstack images
and that has change recently.

we could take the same approach as openstack-ansibel and kolla and set this to optional by adding a sed command here
https://github.com/openstack/tripleo-common/blob/stable/wallaby/container-images/tcib/base/os/nova-base/nova-compute/nova-compute.yaml#L9

we use the nova_compute image to run the ssh migration target container and do not have a dedicated nova-ssh contienr as kolla did.

I'm not sure why this is now broken but presumabel either rheldel removed a downstream customiastion, a package update change this to required or 
something else we have inherited for the generic container building workflow.

Comment 2 smooney 2023-11-09 13:13:33 UTC
it looks like they recently fixed this for chrond as part of 2219765

https://github.com/openstack/tripleo-common/commit/dfd9325672f24ed4954b248815eef2776f2605c4 ill comment there but
it looks like they missed fixing this in https://bugzilla.redhat.com/show_bug.cgi?id=2219765

Comment 16 Joanne O'Flynn 2024-05-02 14:56:48 UTC
Updated doc text: 

With some versions of Pluggable Authentication Modules (PAM), the pam_loginuid module must /proc/self/loginuid be writable. This is not the case in our sshd container used for migrations. Migrations failed because SSH login between Compute hosts was failing.With this update, the pam_loginuid module has been removed from the PAM config and, as a result, SSH login between Compute hosts, and migrations, work again.


=====================================================
Previous doc text:

Cause: 

With some version of PAM, the pam_loginuid module /proc/self/loginuid needs to be writable. This is not the case in our sshd container used for migrations.

Consequence:

Migrations would fail because SSH login between compute host was failing.

Fix:

Remove the pam_loginuid module from the PAM config.

Result:

SSH login between compute hosts, and migrations, work again.

Comment 23 errata-xmlrpc 2024-05-22 20:39:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2741

Comment 24 Red Hat Bugzilla 2024-09-20 04:25:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.