Description of problem: Client recently FFUed from 16.2 to 17.1.1. When a resize is attempted it fails with: 2023-10-30 14:58:30.594 2 ERROR oslo_messaging.rpc.server [req-09756528-71cc-426b-a66c-e864618ad6c5 0bce38be6baabd3ba0904aa6a4b3610e5505cc6555c02ab24496b4eea980968e 2d72e751f005438a880df556f80e0b18 - $id $id ] Exception during message handling: nova.exception.ResizeError: Resize error: not able to execute ssh command: Unexpected error while running command. Command: ssh -o BatchMode=yes $ipv6 mkdir -p /var/lib/nova/instances/fd65d0a9-ba9c-4cd2-a206-881cdd4b4652 Exit code: 254 Oct 30 14:58:30 computesl-1 sshd[198900]: Accepted publickey for nova_migration from $ipv6 port 55538 ssh2: RSA SHA256:hpYcvLMW1x7fU9IwdOzVizOGGVKJwOA7rRMyxGsuG5s Oct 30 14:58:30 computesl-1 sshd[198900]: pam_loginuid(sshd:session): Error writing /proc/self/loginuid: Operation not permitted Oct 30 14:58:30 computesl-1 sshd[198900]: pam_loginuid(sshd:session): set_loginuid failed Oct 30 14:58:30 computesl-1 sshd[198900]: pam_unix(sshd:session): session opened for user nova_migration(uid=989) by (uid=0) Oct 30 14:58:30 computesl-1 sshd[198900]: pam_lastlog(sshd:session): unable to open /var/log/lastlog: No such file or directory Oct 30 14:58:30 computesl-1 sshd[198900]: error: PAM: pam_open_session(): Cannot make/remove an entry for the specified session Oct 30 14:58:30 computesl-1 sshd[198902]: Received disconnect from 2607:f160:a:a02c:cd::118 port 55538:11: disconnected by user Oct 30 14:58:30 computesl-1 sshd[198902]: Disconnected from user nova_migration $ipv6 port 55538 Oct 30 14:58:30 computesl-1 sshd[198904]: Accepted publickey for nova_migration from $ipv6 port 55544 ssh2: RSA SHA256:hpYcvLMW1x7fU9IwdOzVizOGGVKJwOA7rRMyxGsuG5s Oct 30 14:58:30 computesl-1 sshd[198904]: pam_loginuid(sshd:session): Error writing /proc/self/loginuid: Operation not permitted Oct 30 14:58:30 computesl-1 sshd[198904]: pam_loginuid(sshd:session): set_loginuid failed Oct 30 14:58:30 computesl-1 sshd[198904]: pam_unix(sshd:session): session opened for user nova_migration(uid=989) by (uid=0) Oct 30 14:58:30 computesl-1 sshd[198904]: pam_lastlog(sshd:session): unable to open /var/log/lastlog: No such file or directory Oct 30 14:58:30 computesl-1 sshd[198904]: error: PAM: pam_open_session(): Cannot make/remove an entry for the specified session Oct 30 14:58:30 computesl-1 sshd[198906]: Received disconnect from $ipv6 port 55544:11: disconnected by user Oct 30 14:58:30 computesl-1 sshd[198906]: Disconnected from user nova_migration $ipv6 port 55544 Commenting out pam_loginuid.so inside the containers /etc/pam.d/sshd allows the connection and related commands to work: .... session required pam_loginuid.so .... Further troubleshooting done: - /proc/self/loginuid can be changed only from 4294967295 (which is -1 or "unset") to something like 1001, but only once. Any subsequent writes will fail, as designed in kernel. Read [1] for details. - The moment it is set it's not possible to change it to something else. E.g. [2] to see how it happens. It's not a permission error, but a failed write. UNIX permissions, POSIX ACLs, SELinux are fine. - When loginuid is set we should have an autit log entry see [3] for details. - We have all processes from inside the container, but also conmon too (the process which manages the container) having "1001" inside /proc/$PID/loginuid on broken compute machines. This is not ok, as ssh will fail to set loginuid to "989". See [2]. This means that the container was started with /proc/$PID/loginuid allready set to 1001. This also means that the user who did the atrocity has the UID 1001 on the compute host (tripleo-ansible). This points to a bug in the upgrade process (either openstack upgrade, OS upgrade or container upgrade). From .tripleo/history, this task ran and launched those sshd processes: 2023-10-18 00:59:38.484442 overcloud-upgrade-run limit=compute*, playbook=None, static_inventory=None, ssh_user=tripleo-admin, tags=, skip_tags=, stack=$name, yes=True, ansible_forks=None We need your help to identify why it happens. Thank you. [1] https://www.kernel.org/doc/Documentation/ABI/stable/procfs-audit_loginuid What: Audit Login UID Date: 2005-02-01 KernelVersion: 2.6.11-rc2 1e2d1492e178 ("[PATCH] audit: handle loginuid through proc") Contact: linux-audit Users: audit and login applications Description: The /proc/$pid/loginuid pseudofile is written to set and read to get the audit login UID of process $pid as a decimal unsigned int (%u, u32). If it is unset, permissions are not needed to set it. The accessor must have CAP_AUDIT_CONTROL in the initial user namespace to write it if it has been set. It cannot be written again if AUDIT_FEATURE_LOGINUID_IMMUTABLE is enabled. It cannot be unset if AUDIT_FEATURE_ONLY_UNSET_LOGINUID is enabled. [2] You can not set the loginuid after it was set: 951255 openat(AT_FDCWD</>, "/proc/self/loginuid", O_RDWR|O_NOFOLLOW) = 3</proc/314908/loginuid> <--- sshd is allowed to open this file for writing, so it's not a UNIX permission issue. 951255 read(3</proc/314908/loginuid>, "1001", 24) = 4 <--- it is already set, not having anymore 4294967295, but 1001, which is the UID of tripleo-ansible on the compute host. 951255 read(3</proc/314908/loginuid>, "", 20) = 0 951255 lseek(3</proc/314908/loginuid>, 0, SEEK_SET) = 0 951255 ftruncate(3</proc/314908/loginuid>, 0) = 0 951255 write(3</proc/314908/loginuid>, "989", 3) = -1 EPERM (Operation not permitted) <--- expected behaviour - the kernel is refusing to change it again. So it is refusing this value, as the single permitted value is "1001". 951255 getpid() = 314908 951255 socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 5<UNIX:[559936290]> 951255 connect(5<UNIX:[559936290]>, {sa_family=AF_UNIX, sun_path="/dev/log"}, 110) = 0 951255 sendto(5<UNIX:[559936290->21800]>, "<83>Nov 6 22:23:30 sshd[314908]: pam_loginuid(sshd:session): Error writing /proc/self/loginuid: Operation not permitted", 120, MSG_NOSIGNAL, NULL, 0) = 120 <---- ssh is refusing to authenticate, as requested by pam configuration, which has the loginuid are "session required" 951255 close(3</proc/314908/loginuid>) = 0 951255 getpid() = 314908 951255 sendto(5<UNIX:[559936290->21800]>, "<83>Nov 6 22:23:30 sshd[314908]: pam_loginuid(sshd:session): set_loginuid failed", 81, MSG_NOSIGNAL, NULL, 0) = 81 951255 futex(0x7ff7f2e1f644, FUTEX_WAKE_PRIVATE, 2147483647) = 0 951255 openat(AT_FDCWD</>, "/proc/thread-self/attr/exec", O_RDONLY|O_CLOEXEC) = 3</proc/314908/task/314908/attr/exec> 951255 read(3</proc/314908/task/314908/attr/exec>, "", 4095) = 0 951255 close(3</proc/314908/task/314908/attr/exec>) = 0 [3] https://github.com/torvalds/linux/blob/master/kernel/audit.c#L2307 static void audit_log_set_loginuid(kuid_t koldloginuid, kuid_t kloginuid, unsigned int oldsessionid, unsigned int sessionid, int rc) { struct audit_buffer *ab; uid_t uid, oldloginuid, loginuid; struct tty_struct *tty; if (!audit_enabled) return; ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_LOGIN); if (!ab) return; uid = from_kuid(&init_user_ns, task_uid(current)); oldloginuid = from_kuid(&init_user_ns, koldloginuid); loginuid = from_kuid(&init_user_ns, kloginuid); tty = audit_get_tty(); audit_log_format(ab, "pid=%d uid=%u", task_tgid_nr(current), uid); audit_log_task_context(ab); audit_log_format(ab, " old-auid=%u auid=%u tty=%s old-ses=%u ses=%u res=%d", oldloginuid, loginuid, tty ? tty_name(tty) : "(none)", oldsessionid, sessionid, !rc); audit_put_tty(tty); audit_log_end(ab); } Version-Release number of selected component (if applicable): How reproducible: In this environment 100% Steps to Reproduce: 1. FFU 16.2 -> 17.1.1 2. Try live migration 3. Actual results: Resize fails Expected results: Resize succeed Additional info: We have sosreport and logs.
this would appar to be the same issue as reported here https://bugs.launchpad.net/kolla/+bug/1651395 in 16.1 and older we used to build our image using kolla images as a base although not that spciric image https://opendev.org/openstack/kolla/commit/aae9193675ed214f7a4455708409532ea1665255 looking at the tcib defs in triplo-common i do not see a similar workaround in place my guess is in the past this was optional in the base image either due to a rpm we installed our a customisation in the lower level rhel/oepnstack images and that has change recently. we could take the same approach as openstack-ansibel and kolla and set this to optional by adding a sed command here https://github.com/openstack/tripleo-common/blob/stable/wallaby/container-images/tcib/base/os/nova-base/nova-compute/nova-compute.yaml#L9 we use the nova_compute image to run the ssh migration target container and do not have a dedicated nova-ssh contienr as kolla did. I'm not sure why this is now broken but presumabel either rheldel removed a downstream customiastion, a package update change this to required or something else we have inherited for the generic container building workflow.
it looks like they recently fixed this for chrond as part of 2219765 https://github.com/openstack/tripleo-common/commit/dfd9325672f24ed4954b248815eef2776f2605c4 ill comment there but it looks like they missed fixing this in https://bugzilla.redhat.com/show_bug.cgi?id=2219765
Updated doc text: With some versions of Pluggable Authentication Modules (PAM), the pam_loginuid module must /proc/self/loginuid be writable. This is not the case in our sshd container used for migrations. Migrations failed because SSH login between Compute hosts was failing.With this update, the pam_loginuid module has been removed from the PAM config and, as a result, SSH login between Compute hosts, and migrations, work again. ===================================================== Previous doc text: Cause: With some version of PAM, the pam_loginuid module /proc/self/loginuid needs to be writable. This is not the case in our sshd container used for migrations. Consequence: Migrations would fail because SSH login between compute host was failing. Fix: Remove the pam_loginuid module from the PAM config. Result: SSH login between compute hosts, and migrations, work again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 17.1.3 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:2741
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days