Bug 1463593
Summary: | [ESXi][RHEL7.5]unknown delay on pre-freeze script via vm-tool | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jaroslav Spanko <jspanko> | ||||||||||||||||
Component: | selinux-policy | Assignee: | Lukas Vrabec <lvrabec> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Milos Malik <mmalik> | ||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||
Priority: | urgent | ||||||||||||||||||
Version: | 7.3 | CC: | ailan, apanagio, boyang, cavery, cww, dapospis, hhei, jsavanyo, jspanko, ldu, leiwang, lvrabec, mgrepl, mmalik, plautrba, rainer.traut, ravindrakumar, ribarry, rjones, rmetrich, rpalco, ssekidde, yacao, ybhasin | ||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||
Target Release: | 7.6 | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2018-10-30 10:00:43 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 1477664 | ||||||||||||||||||
Attachments: |
|
Description
Jaroslav Spanko
2017-06-21 09:30:22 UTC
This is the mentioned delays [Jun 20 09:57:59.808] [ debug] [vmbackup] *** VmBackupStartScripts [Jun 20 09:57:59.808] [ debug] [vmbackup] Trying to run scripts from /etc/vmware-tools/backupScripts.d [Jun 20 09:57:59.808] [ debug] [vmbackup] Running script: "/usr/sbin/pre-freeze-script" freeze [Jun 20 09:57:59.808] [ debug] [vmsvc] Executing async command: '"/usr/sbin/pre-freeze-script" freeze' in working dir '' [Jun 20 09:57:59.811] [ debug] [vmsvc] VMTools_ConfigGetInteger: Returning default value for '[vmbackup] timeout'=900. [Jun 20 10:01:30.998] [ debug] [vmbackup] *** VmBackupAsyncCallback [Jun 20 10:01:30.998] [ debug] [vmbackup] *** VmBackupPostProcessCurrentOp [Jun 20 10:01:30.998] [ debug] [vmbackup] VmBackupPostProcessCurrentOp: checking VmBackupOnFreeze [Jun 20 10:01:30.998] [ debug] [vmsvc] Child w/ fd 8 exited with code=0 [Jun 20 10:01:30.998] [ debug] [vmsvc] Waiting on pid 9367 to de-zombify it [Jun 20 10:01:30.998] [ debug] [vmbackup] Async request 'VmBackupOnFreeze' completed [Jun 20 10:01:30.998] [ debug] [vmbackup] *** VmBackupEnableSyncWait Is there possibility to debug it more to investigate what cause the delays ? Thx If there's nothing sensitive in the debug log, please make it public so that Ravindra can see it. ack, done (In reply to Jaroslav Spanko from comment #0) > The "pre-freeze" and "post-thaw" script take about 5 minutes each to > complete. When you run the scripts manually, they finish in 20 seconds. The time taken by the scripts will also depend on the activity that is going in the DB at that time. Have you confirmed that the script executes within 20 seconds around the time of failure? And, you need to try it without attempting quisced snapshot because that will alter the system state. > On RHEL6 with the native vmware-tools (OSP),the scripts only take about > 20/30 seconds. Could you please confirm the VMware Tools versions tried by the customer? For the same version the code behavior should be the same. > Version-Release number of selected component (if applicable): > Tested with > open-vm-tools-10.0.5-4.el7_3.x86_64 > open-vm-tools-10.1.5-3.el7.x86_64.rpm Is it reproducible with open-vm-tools 10.0.5? > Actual results: > There are 4/5 minutes time-outs in compare of manually running ... [Jun 20 09:57:59.808] [ debug] [vmbackup] *** VmBackupStartScripts [Jun 20 09:57:59.808] [ debug] [vmbackup] Trying to run scripts from /etc/vmware-tools/backupScripts.d [Jun 20 09:57:59.808] [ debug] [vmbackup] Running script: "/usr/sbin/pre-freeze-script" freeze [Jun 20 09:57:59.808] [ debug] [vmsvc] Executing async command: '"/usr/sbin/pre-freeze-script" freeze' in working dir '' ... [Jun 20 10:01:30.998] [ debug] [vmbackup] VmBackupPostProcessCurrentOp: checking VmBackupOnFreeze [Jun 20 10:01:30.998] [ debug] [vmsvc] Child w/ fd 8 exited with code=0 [Jun 20 10:01:30.998] [ debug] [vmsvc] Waiting on pid 9367 to de-zombify it [Jun 20 10:01:30.998] [ debug] [vmbackup] Async request 'VmBackupOnFreeze' completed ... The script above took close to 3 min 30 sec. ... [Jun 20 10:01:34.026] [ debug] [vmbackup] *** VmBackupStartScripts [Jun 20 10:01:34.026] [ debug] [vmbackup] Trying to run scripts from /etc/vmware-tools/backupScripts.d [Jun 20 10:01:34.026] [ debug] [vmbackup] Running script: "/usr/sbin/post-thaw-script" thaw [Jun 20 10:01:34.026] [ debug] [vmsvc] Executing async command: '"/usr/sbin/post-thaw-script" thaw' in working dir '' ... [Jun 20 10:06:14.326] [ debug] [vmbackup] VmBackupPostProcessCurrentOp: checking VmBackupOnThaw [Jun 20 10:06:14.326] [ debug] [vmsvc] Child w/ fd 8 exited with code=0 [Jun 20 10:06:14.326] [ debug] [vmsvc] Waiting on pid 12266 to de-zombify it [Jun 20 10:06:14.326] [ debug] [vmbackup] Async request 'VmBackupOnThaw' completed ... The script above took close to 4 min 40 sec. This totals to 8+ mins of script execution time. It is odd that the script execution took that long. The best guess I have is DB activity at the time of the failure. > Additional info: > In attachments debug logs I don't see the version in the logs, could you please confirm the open-vm-tools version used for collecting the debug logs attached in the bug? Hi Ravindra The attached logs are from version open-vm-tools-10.1.5-3, customer had originally installed open-vm-tools-10.0.5-4, we requested newest version for test to see if it helps with timeouts. The problem is reproducible on the both version. Sorry not full answers Problem here is that manually running of the same script which is passed to the /usr/local/etc/snapshot.d/pre take more time Timestaps from the manual/automatic, i edit it ## 40-pre-freeze-script.log START xxxxxxx 02/06/2017 14:46:21 xxxxxx ............................................ [OK] xxxxxxx 02/06/2017 14:47:40 xxxxxx ........................ [OK] xxxxxxx 02/06/2017 14:48:33 xxxxxx ............................................ [OK] xxxxxxx 02/06/2017 14:48:59 xxxxxx ........................ [OK] xxxxxxx 02/06/2017 14:49:52 xxxxxx ......................... [OK] END xxxxxxx 02/06/2017 14:50:44 xxxxxx ## 80-manual-pre-freeze-script.log START xxxxxxx 02/06/2017 15:07:39 xxxxxx ............................................ [OK] xxxxxxx 02/06/2017 15:07:40 xxxxxx ......................... [OK] xxxxxxx 02/06/2017 15:07:42 xxxxxx ........................ [OK] xxxxxxx 02/06/2017 15:07:45 xxxxxx ............................................ [OK] xxxxxxx 02/06/2017 15:07:46 xxxxxx ........................ [OK] xxxxxxx 02/06/2017 15:07:48 xxxxxx ......................... [OK] END xxxxxxx 02/06/2017 15:07:50 xxxxxx --- If you need more please let me know Thx Thanks Jaroslav. I don't think it is specific to open-vm-tools. I also noticed that customer compared VMware Tools on RHEL 6 with RHEL 7. RHEL 6 to RHEL 7 is a big variation in itself. Do we have access to the scripts? I would also like to know how consistently reproducible is it? If it is reproducible enough, I'd like to run each step in the script under strace, e.g. "strace -ttt -T -o <logfile> <command to be run>". Then we can collect all the <logfile>s generated to see where script has spent most of its time. I assume you have enough experience with 'strace', if you have any questions please let me know. Created attachment 1292872 [details]
strace post
Created attachment 1292873 [details]
strace pre
Hi Ravindra Sorry for delay, i uploaded all wich we have. Thank you! Created attachment 1292874 [details]
pre script
Created attachment 1292876 [details]
post script
Thanks Jaroslav for the data collected. From pre-script strace https://bugzilla.redhat.com/attachment.cgi?id=1292873: ... 1498637878.056838 rt_sigaction(SIGINT, {0x41a760, [], SA_RESTORER|SA_INTERRUPT, 0x7f62ff067250}, {SIG_DFL, [], 0}, 8) = 0 <0.000004> 1498637878.056860 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 4901 <263.327879> 1498638141.384792 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=4901, si_status=0, si_utime=1, si_stime=1} --- ... From post-script https://bugzilla.redhat.com/attachment.cgi?id=1292872: ... 1498638145.231333 rt_sigaction(SIGINT, {0x41a760, [], SA_RESTORER|SA_INTERRUPT, 0x7f58ba6eb250}, {SIG_DFL, [], 0}, 8) = 0 <0.000004> 1498638145.231353 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WSTOPPED|WCONTINUED, NULL) = 8317 <262.912505> 1498638408.143921 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8317, si_status=0, si_utime=0, si_stime=1} --- ... In both of the cases, most of the time went in wait4(). In order to capture what child process was doing, unfortunately, we will have to collect the strace output again. We need to also specify '-f' option on strace command to follow child processes, i.e. "strace -f -ttt -T -o <logfile> <command to be run>". I also looked at the pre and post scripts. These scripts are calling other scripts from /usr/local/etc/snapshot.d/pre/ and /usr/local/etc/snapshot.d/post/ directories respectively. For example, the scripts that are stuck are, "/usr/local/etc/snapshot.d/pre/200-pre_snapshot_db2_suspend_databases.ksh" and "/usr/local/etc/snapshot.d/post/200-post_snapshot_db2_resume_databases.ksh". We can just add 'strace' to these scripts only and also get a copy of the scripts if possible. Jaroslav,I am a QE test VMware related drivers, our team have test ENV for ESXi platform, I try to reproduce this issue on our test ENV, but the pre and post scripts can not run successfully, as Ravindra said they need calling other scripts, could you help ask the customer provide the scripts? thanks, ldu Hi Ravindra,Idu Uploaded the strace from original scripts and also from called. For now i have access only to the pre script, i uploaded it also. Thanks a lot ! Created attachment 1293222 [details]
all strace logs, called from script and direct also
Created attachment 1293224 [details]
pre script
(In reply to Robert Palco from comment #21) > Hi Ravindra, Idu, > > I am an engineer working on the case together with Jaroslav. > Could I get some update on the current status of the bugzilla? > I would like to update the customer. > > Thank you in advance. > > Regards, > Robert Hi Robert, Now this issue haven't reproduced on our test ENV, now the pre-freeze-script and post-thaw-script can not run successfully, we don't have called for script. could you can ask the customer provide all the script? also I am not sure the exact ENV the customer used, we don't have IBM database, does these scripts run in the guest OS? could you help ask customer to provide more detail info? thanks very much! ldu Hi Ravindra Kumar, Could you download the attachment in comments 19 and 20, have you analyzed the strace log? could you give some update form your side? thanks ldu (In reply to ldu from comment #22) > Hi Ravindra Kumar, > > Could you download the attachment in comments 19 and 20, have you analyzed > the strace log? > could you give some update form your side? Thanks @ldu. I have been busy with some other stuff, so I will take some time to get to this (may be next week). If you could help looking at 'strace' logs in the meantime that will be great. As there seems to be something slow in the customer scripts, could you please help go through the 'strace' logs to find out which syscalls is taking long? This should be generic 'strace' log and nothing specific to VMware as such. We just need to find out the slow syscall and then look into what is making that syscall slow. Thanks for the help. Nevermind @ldu. I just went through the 'strace' logs myself. Thanks Jaroslav for collecting the 'strace' logs. I went through the two logs '200-pre_snapshot_db2_suspend_databases.ksh (1).log' and '200-post_snapshot_db2_resume_databases.ksh (1).log'. Both logs have same pattern. Below is the pattern I captured from '200-pre_snapshot_db2_suspend_databases.ksh (1).log': ... 3201 1498822184.786250 socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 8 <0.000019> 3201 1498822184.786279 setsockopt(8, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0 <0.000004> 3201 1498822184.786293 setsockopt(8, SOL_SOCKET, 0x22 /* SO_??? */, [0], 4) = 0 <0.000003> 3201 1498822184.786305 getsockopt(8, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 <0.000003> 3201 1498822184.786317 setsockopt(8, SOL_SOCKET, 0x21 /* SO_??? */, [8388608], 4) = 0 <0.000004> 3201 1498822184.786329 getsockopt(8, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 <0.000003> 3201 1498822184.786341 setsockopt(8, SOL_SOCKET, 0x20 /* SO_??? */, [8388608], 4) = 0 <0.000004> 3201 1498822184.786353 connect(8, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0 <0.000590> 3201 1498822184.786966 getsockopt(8, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0 <0.000004> 3201 1498822184.786982 getsockopt(8, SOL_SOCKET, SO_PEERSEC, "system_u:system_r:system_dbusd_t"..., [36]) = 0 <0.000005> 3201 1498822184.786999 fstat(8, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 <0.000004> 3201 1498822184.787013 getsockopt(8, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0 <0.000004> 3201 1498822184.787025 getsockname(8, {sa_family=AF_LOCAL, NULL}, [2]) = 0 <0.000004> 3201 1498822184.787038 geteuid() = 0 <0.000003> 3201 1498822184.787052 sendmsg(8, {msg_name(0)=NULL, msg_iov(3)=[{"\0AUTH EXTERNAL ", 15}, {"30", 2}, {"\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 28}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 45 <0.000049> 3201 1498822184.787135 gettid() = 3201 <0.000003> ... 3201 1498822184.787242 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"OK 53cbadaeed359824c3f7568859254"..., 256}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 52 <0.000006> 3201 1498822184.787272 sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/fre"..., 128}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 128 <0.001944> 3201 1498822184.789244 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\16\0\0\0\1\0\0\0E\0\0\0\6\1s\0\t\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 <0.000005> 3201 1498822184.789266 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{":1.173285\0\0\0\0\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1s\0\0"..., 78}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 78 <0.000005> 3201 1498822184.789299 sendmsg(8, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1x\0\0\0\2\0\0\0\230\0\0\0\1\1o\0\27\0\0\0/org/fre"..., 168}, {"\370\351\0\0\201\f\0\0\4\0\0\0su-l\0\0\0\0\v\0\0\0unspecif"..., 120}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 288 <0.000007> 3201 1498822184.789321 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1\16\0\0\0\2\0\0\0\225\0\0\0\1\1o\0\25\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 <0.000004> 3201 1498822184.789336 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0"..., 158}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 158 <0.000004> 3201 1498822184.789367 recvmsg(8, 0x7ffc6d3a3660, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000004> 3201 1498822184.789385 ppoll([{fd=8, events=POLLIN}], 1, {24, 999938000}, NULL, 8) = 0 (Timeout) <25.000833> 3201 1498822209.790312 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 9 <0.000018> ... 4663 1498822210.027858 close(5 <unfinished ...> 4583 1498822210.027864 nanosleep({1, 0}, <unfinished ...> 4663 1498822210.027874 <... close resumed> ) = -1 EBADF (Bad file descriptor) <0.000012> 4663 1498822210.027881 close(6) = -1 EBADF (Bad file descriptor) <0.000003> 4663 1498822210.027891 close(7) = -1 EBADF (Bad file descriptor) <0.000003> 4663 1498822210.027902 close(8) = -1 EBADF (Bad file descriptor) <0.000003> 4663 1498822210.027912 close(9) = -1 EBADF (Bad file descriptor) <0.000003> ... 4663 1498822210.728899 close(65530) = -1 EBADF (Bad file descriptor) <0.000003> 4663 1498822210.728910 close(65531) = -1 EBADF (Bad file descriptor) <0.000003> 4663 1498822210.728920 close(65532) = -1 EBADF (Bad file descriptor) <0.000004> 4663 1498822210.728930 close(65533) = -1 EBADF (Bad file descriptor) <0.000003> 4663 1498822210.728943 close(0) = -1 EBADF (Bad file descriptor) <0.000003> ... The script is waiting for polling a socket for 25 seconds and then closes all possible file descriptors. The exact same pattern gets repeated 10 times. ... 4670 1498822211.113326 socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 8 <0.000019> ... 4670 1498822211.113443 connect(8, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0 <0.000505> ... 4670 1498822211.116479 recvmsg(8, 0x7ffebd17f980, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000004> 4670 1498822211.116497 ppoll([{fd=8, events=POLLIN}], 1, {24, 999948000}, NULL, 8 <unfinished ...> 4663 1498822216.064901 <... msgrcv resumed> 0x7fc20a742200, 8183, 0, MSG_NOERROR) = ? ERESTARTNOHAND (To be restarted if no handler) <5.000037> ... 4670 1498822236.116492 <... ppoll resumed> ) = 0 (Timeout) <24.999981> ... 4860 1498822237.660838 socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 8 <0.000018> ... 4860 1498822237.660938 connect(8, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0 <0.000596> ... 4860 1498822237.663148 recvmsg(8, 0x7ffccbb14100, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000006> 4860 1498822237.663170 ppoll([{fd=8, events=POLLIN}], 1, {24, 999856000}, NULL, 8 <unfinished ...> 4805 1498822242.405760 <... msgrcv resumed> 0x7f62d64c85a0, 8183, 0, MSG_NOERROR) = ? ERESTARTNOHAND (To be restarted if no handler) <5.000027> ... 4805 1498822242.408728 munmap(0x7f62d64dc000, 327680) = 0 <0.000009> 4805 1498822242.408873 exit_group(0) = ? 4805 1498822242.409853 +++ exited with 0 +++ 4860 1498822262.663888 <... ppoll resumed> ) = 0 (Timeout) <25.000705> ... <Following close pattern is repeated thousands of times in every instance of timeout> 5074 1498822262.991749 close(9270) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991760 close(9271) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991770 close(9272) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991781 close(9273) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991791 close(9274) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991802 close(9275) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991812 close(9276) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991823 close(9277) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991833 close(9278) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991844 close(9279) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991854 close(9280) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991865 close(9281) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991875 close(9282) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991886 close(9283) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991896 close(9284) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991906 close(9285) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991917 close(9286) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991927 close(9287) = -1 EBADF (Bad file descriptor) <0.000004> 5074 1498822262.991938 close(9288) = -1 EBADF (Bad file descriptor) <0.000003> 5074 1498822262.991948 close(9289) = -1 EBADF (Bad file descriptor) <0.000003> ... 5087 1498822264.042078 socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 8 <0.000019> ... 5087 1498822264.042188 connect(8, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0 <0.000568> ... 5087 1498822264.044917 recvmsg(8, 0x7ffe0b5475f0, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000004> 5087 1498822264.044933 ppoll([{fd=8, events=POLLIN}], 1, {24, 999952000}, NULL, 8 <unfinished ...> 5074 1498822268.991040 <... msgrcv resumed> 0x7f8cbe02b5a0, 8183, 0, MSG_NOERROR) = ? ERESTARTNOHAND (To be restarted if no handler) <5.000023> ... 5074 1498822268.993568 munmap(0x7f8cbe03f000, 327680) = 0 <0.000017> 5074 1498822268.993728 exit_group(0) = ? 5074 1498822268.994623 +++ exited with 0 +++ 5087 1498822289.045681 <... ppoll resumed> ) = 0 (Timeout) <25.000736> 5087 1498822289.045748 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 9 <0.000016> ... <Closing all possible descriptors> 5216 1498822289.977641 close(65531) = -1 EBADF (Bad file descriptor) <0.000003> 5216 1498822289.977652 close(65532) = -1 EBADF (Bad file descriptor) <0.000003> 5216 1498822289.977662 close(65533) = -1 EBADF (Bad file descriptor) <0.000003> 5216 1498822289.977675 close(0) = -1 EBADF (Bad file descriptor) <0.000003> ... 5216 1498822295.331271 munmap(0x7fc365483000, 327680) = 0 <0.000011> 5216 1498822295.331410 exit_group(0) = ? 5216 1498822295.332253 +++ exited with 0 +++ 5221 1498822315.592161 <... ppoll resumed> ) = 0 (Timeout) <24.999989> ... 5344 1498822321.915591 munmap(0x7fae4b980000, 327680) = 0 <0.000014> 5344 1498822321.915744 exit_group(0) = ? 5344 1498822321.916458 +++ exited with 0 +++ 5357 1498822341.972559 <... ppoll resumed> ) = 0 (Timeout) <25.000694> ... 5563 1498822348.242026 munmap(0x7f7c18283000, 327680) = 0 <0.000012> 5563 1498822348.242166 exit_group(0) = ? 5563 1498822348.242991 +++ exited with 0 +++ 5570 1498822368.568693 <... ppoll resumed> ) = 0 (Timeout) <25.000791> ... 5717 1498822374.864786 munmap(0x7fe63376b000, 327680) = 0 <0.000011> 5717 1498822374.864927 exit_group(0) = ? 5717 1498822374.865769 +++ exited with 0 +++ 5722 1498822394.899681 <... ppoll resumed> ) = 0 (Timeout) <25.000005> ... 5911 1498822401.231353 munmap(0x7fc1a2754000, 327680) = 0 <0.000012> 5911 1498822401.231493 exit_group(0) = ? 5911 1498822401.232336 +++ exited with 0 +++ 5922 1498822421.501852 <... ppoll resumed> ) = 0 (Timeout) <24.999995> ... 6065 1498822422.809733 socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 8 <0.000018> 6065 1498822422.809762 setsockopt(8, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0 <0.000005> 6065 1498822422.809775 setsockopt(8, SOL_SOCKET, 0x22 /* SO_??? */, [0], 4) = 0 <0.000003> 6065 1498822422.809787 getsockopt(8, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 <0.000004> 6065 1498822422.809799 setsockopt(8, SOL_SOCKET, 0x21 /* SO_??? */, [8388608], 4) = 0 <0.000004> 6065 1498822422.809810 getsockopt(8, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 <0.000004> 6065 1498822422.809821 setsockopt(8, SOL_SOCKET, 0x20 /* SO_??? */, [8388608], 4) = 0 <0.000004> 6065 1498822422.809833 connect(8, {sa_family=AF_LOCAL, sun_path="/var/run/dbus/system_bus_socket"}, 33) = 0 <0.000654> 6065 1498822422.810504 getsockopt(8, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0 <0.000004> 6065 1498822422.810519 getsockopt(8, SOL_SOCKET, SO_PEERSEC, "system_u:system_r:system_dbusd_t"..., [36]) = 0 <0.000005> 6065 1498822422.810534 fstat(8, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 <0.000004> 6065 1498822422.810548 getsockopt(8, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0 <0.000003> 6065 1498822422.810559 getsockname(8, {sa_family=AF_LOCAL, NULL}, [2]) = 0 <0.000004> 6065 1498822422.810589 geteuid() = 0 <0.000004> 6065 1498822422.810634 sendmsg(8, {msg_name(0)=NULL, msg_iov(3)=[{"\0AUTH EXTERNAL ", 15}, {"30", 2}, {"\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 28}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 45 <0.000057> ... 6065 1498822422.810819 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"OK 53cbadaeed359824c3f7568859254"..., 256}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 52 <0.000005> 6065 1498822422.810847 sendmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/fre"..., 128}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 128 <0.000659> 6065 1498822422.811530 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"l\2\1\1\16\0\0\0\1\0\0\0E\0\0\0\6\1s\0\t\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 <0.000004> 6065 1498822422.811550 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{":1.173310\0\0\0\0\0\0\0\5\1u\0\1\0\0\0\10\1g\0\1s\0\0"..., 78}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 78 <0.000004> 6065 1498822422.811597 sendmsg(8, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1x\0\0\0\2\0\0\0\230\0\0\0\1\1o\0\27\0\0\0/org/fre"..., 168}, {"\372\351\0\0\261\27\0\0\4\0\0\0su-l\0\0\0\0\v\0\0\0unspecif"..., 120}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 288 <0.000547> 6065 1498822422.812174 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1\16\0\0\0\2\0\0\0\225\0\0\0\1\1o\0\25\0\0\0", 24}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24 <0.000004> 6065 1498822422.814666 recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0"..., 158}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 158 <0.000010> 6065 1498822422.814700 recvmsg(8, 0x7ffd3cd15d90, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) <0.000004> 6065 1498822422.814722 ppoll([{fd=8, events=POLLIN}], 1, {24, 997453000}, NULL, 8 <unfinished ...> 6060 1498822427.776568 <... msgrcv resumed> 0x7f302d34a5a0, 8183, 0, MSG_NOERROR) = ? ERESTARTNOHAND (To be restarted if no handler) <5.000041> 6060 1498822427.776609 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL, si_value={int=0, ptr=0x7f1200000000}} --- ... 6060 1498822427.778727 munmap(0x7f302d35e000, 327680) = 0 <0.000014> 6060 1498822427.778901 exit_group(0) = ? 6060 1498822427.780247 +++ exited with 0 +++ 6065 1498822447.812228 <... ppoll resumed> ) = 0 (Timeout) <24.997494> ... Looks like the script makes 10 attempts, so that amounts to roughly 250 seconds. There is nothing related to open-vm-tools here because '200-pre_snapshot_db2_suspend_databases.ksh (1).log' gets generated even before we freeze the file systems. We need to know: 1. What this script needs from the socket? 2. Why the socket is returning EAGAIN? 3. Why the service listening on this socket is not responding? Some of the answers may be available in the log messages of the service listening on the socket. Either the customer should debug this from the service side themselves (based on their expertise with the service) or get the vendor of the service involved in debugging why the service is not responding in timely fashion. Closing all possible handles is another weird behavior of the script, but this does not add much to the total run time of the script. Longest wait is coming from the socket polling timeout multiplied by the number of attempts. @(In reply to Ravindra Kumar from comment #25) > Closing all possible handles is another weird behavior of the script, but > this does not add much to the total run time of the script. Longest wait is > coming from the socket polling timeout multiplied by the number of attempts. Thanks very for your detail analysis about this issue, we will update this to customer for further debug. Hi, I can now reproduce the observed behaviour (delay when running freeze/thaw scripts). This is due to the use of "su" or "runuser" inside the scripts. Because of that, systemd-logind gets called (though pam_systemd), sends a DBus message to systemd, but this fails because the SELinux target context is not appropriate: type=USER_AVC msg=audit(XXX): pid=715 uid=81 auid=4294967295 ses=4294967295 subj=system_u:system_r:system_dbusd_t:s0-s0:c0.c1023 msg='avc: denied { send_msg } for msgtype=method_return dest=:1.46 spid=723 tpid=2105 scontext=system_u:system_r:systemd_logind_t:s0 tcontext=system_u:system_r:vmtools_t:s0 tclass=dbus exe="/usr/bin/dbus-daemon" sauid=81 hostname=? addr=? terminal=?' A workaround is to let systemd-logind run as a permissive domain: # semanage permissive -a systemd_logind_t REPRODUCER: 1. Install RHEL7.4 VMWare system 2. Create test script below: # mkdir /etc/vmware-tools/backupScripts.d # cat > /etc/vmware-tools/backupScripts.d/test.sh << EOF #!/bin/bash logger -p warning "Executing $0 $*" su - localuser -c "ls -lrt /" logger -p warning "Done executing $0 $*" EOF 3. Create local user "localuser" # useradd localuser 4. Monitoring journal # journalctl --follow 5. From VSphere, create a VM snapshot VM -> Snapshots -> Take Snapshots then: - Deselect "Snapshot the virtual machine's memory" - Select "Quiesce guest file system (...)" ACTUAL RESULT: In the journal, we can see a 25s delay for both "freeze" and "thaw": Oct 10 10:31:32 vm-openvmtools logger: Executing /etc/vmware-tools/backupScripts.d/test.sh freeze Oct 10 10:31:32 vm-openvmtools su: (to localuser) root on none Oct 10 10:31:32 vm-openvmtools systemd: Created slice User Slice of localuser. Oct 10 10:31:32 vm-openvmtools systemd: Starting User Slice of localuser. Oct 10 10:31:32 vm-openvmtools systemd: Started Session c11 of user localuser. Oct 10 10:31:32 vm-openvmtools systemd: Starting Session c11 of user localuser. Oct 10 10:31:57 vm-openvmtools vmtoolsd: total 22 Oct 10 10:31:57 vm-openvmtools vmtoolsd: dr-xr-xr-x. 2 root root 2048 Feb 17 2016 mnt ... Oct 10 10:31:57 vm-openvmtools logger: Done executing /etc/vmware-tools/backupScripts.d/test.sh freeze (In reply to Renaud Métrich from comment #30) > REPRODUCER: > Hi Renaud, Follow you steps , in the journal log, I did not see "freeze", please help take look the log, [root@rhel7 backupScripts.d]# journalctl --follow -- Logs begin at Thu 2017-09-28 19:31:52 CST. -- Oct 10 17:22:49 rhel7 vmtoolsd[891]: drwxrwxrwt. 3 root root 25 Sep 28 11:31 tmp Oct 10 17:22:49 rhel7 vmtoolsd[891]: drwxr-xr-x. 19 root root 3160 Sep 28 11:31 dev Oct 10 17:22:49 rhel7 vmtoolsd[891]: drwxr-xr-x. 37 root root 1200 Sep 28 11:32 run Oct 10 17:22:49 rhel7 vmtoolsd[891]: dr-xr-xr-x. 186 root root 0 Sep 28 19:31 proc Oct 10 17:22:49 rhel7 vmtoolsd[891]: drwxr-xr-x. 147 root root 12288 Oct 10 17:19 etc Oct 10 17:22:49 rhel7 vmtoolsd[891]: drwxr-xr-x. 3 root root 23 Oct 10 17:19 home Oct 10 17:22:49 rhel7 su[27187]: pam_unix(su-l:session): session closed for user localuser Oct 10 17:22:49 rhel7 systemd[1]: Removed slice User Slice of localuser. Oct 10 17:22:49 rhel7 systemd[1]: Stopping User Slice of localuser. Oct 10 17:22:49 rhel7 logger[27212]: Done executing -bash Oct 10 17:25:01 rhel7 systemd[1]: Created slice User Slice of pcp. Oct 10 17:25:01 rhel7 systemd[1]: Starting User Slice of pcp. Oct 10 17:25:01 rhel7 systemd[1]: Started Session 3272 of user pcp. Oct 10 17:25:01 rhel7 systemd[1]: Starting Session 3272 of user pcp. Oct 10 17:25:01 rhel7 CROND[27235]: (pcp) CMD ( /usr/libexec/pcp/bin/pmlogger_check -C) Oct 10 17:25:01 rhel7 systemd[1]: Removed slice User Slice of pcp. Oct 10 17:25:01 rhel7 systemd[1]: Stopping User Slice of pcp. Oct 10 17:25:12 rhel7 logger[27274]: Executing -bash Oct 10 17:25:12 rhel7 su[27275]: (to localuser) root on none Oct 10 17:25:12 rhel7 systemd[1]: Created slice User Slice of localuser. Oct 10 17:25:12 rhel7 systemd[1]: Starting User Slice of localuser. Oct 10 17:25:12 rhel7 systemd[1]: Started Session c3 of user localuser. Oct 10 17:25:12 rhel7 systemd[1]: Starting Session c3 of user localuser. Oct 10 17:25:37 rhel7 su[27275]: pam_systemd(su-l:session): Failed to create session: Connection timed out Oct 10 17:25:37 rhel7 su[27275]: pam_unix(su-l:session): session opened for user localuser by (uid=0) Oct 10 17:25:37 rhel7 vmtoolsd[891]: total 24 Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 2 root root 6 Mar 10 2016 srv Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 2 root root 6 Mar 10 2016 mnt Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 2 root root 6 Mar 10 2016 media Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 21 root root 4096 Jul 12 06:49 var Oct 10 17:25:37 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 7 Jul 12 14:39 bin -> usr/bin Oct 10 17:25:37 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 9 Jul 12 14:39 lib64 -> usr/lib64 Oct 10 17:25:37 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 7 Jul 12 14:39 lib -> usr/lib Oct 10 17:25:37 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 8 Jul 12 14:39 sbin -> usr/sbin Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 13 root root 155 Jul 12 14:39 usr Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 3 root root 16 Jul 12 14:44 opt Oct 10 17:25:37 rhel7 vmtoolsd[891]: dr-xr-xr-x. 5 root root 4096 Jul 12 14:50 boot Oct 10 17:25:37 rhel7 vmtoolsd[891]: dr-xr-x---. 5 root root 239 Aug 24 18:54 root Oct 10 17:25:37 rhel7 vmtoolsd[891]: dr-xr-xr-x. 13 root root 0 Sep 28 11:31 sys Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxrwxrwt. 3 root root 25 Sep 28 11:31 tmp Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 19 root root 3160 Sep 28 11:31 dev Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 37 root root 1200 Sep 28 11:32 run Oct 10 17:25:37 rhel7 vmtoolsd[891]: dr-xr-xr-x. 187 root root 0 Sep 28 19:31 proc Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 147 root root 12288 Oct 10 17:19 etc Oct 10 17:25:37 rhel7 vmtoolsd[891]: drwxr-xr-x. 3 root root 23 Oct 10 17:19 home Oct 10 17:25:37 rhel7 su[27275]: pam_unix(su-l:session): session closed for user localuser Oct 10 17:25:37 rhel7 systemd[1]: Removed slice User Slice of localuser. Oct 10 17:25:37 rhel7 systemd[1]: Stopping User Slice of localuser. Oct 10 17:25:37 rhel7 logger[27300]: Done executing -bash Oct 10 17:25:40 rhel7 logger[27306]: Executing -bash Oct 10 17:25:40 rhel7 su[27307]: (to localuser) root on none Oct 10 17:25:40 rhel7 systemd[1]: Created slice User Slice of localuser. Oct 10 17:25:40 rhel7 systemd[1]: Starting User Slice of localuser. Oct 10 17:25:40 rhel7 systemd[1]: Started Session c4 of user localuser. Oct 10 17:25:40 rhel7 systemd[1]: Starting Session c4 of user localuser. Oct 10 17:26:05 rhel7 su[27307]: pam_systemd(su-l:session): Failed to create session: Connection timed out Oct 10 17:26:05 rhel7 su[27307]: pam_unix(su-l:session): session opened for user localuser by (uid=0) Oct 10 17:26:05 rhel7 vmtoolsd[891]: total 24 Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 2 root root 6 Mar 10 2016 srv Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 2 root root 6 Mar 10 2016 mnt Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 2 root root 6 Mar 10 2016 media Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 21 root root 4096 Jul 12 06:49 var Oct 10 17:26:05 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 7 Jul 12 14:39 bin -> usr/bin Oct 10 17:26:05 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 9 Jul 12 14:39 lib64 -> usr/lib64 Oct 10 17:26:05 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 7 Jul 12 14:39 lib -> usr/lib Oct 10 17:26:05 rhel7 vmtoolsd[891]: lrwxrwxrwx. 1 root root 8 Jul 12 14:39 sbin -> usr/sbin Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 13 root root 155 Jul 12 14:39 usr Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 3 root root 16 Jul 12 14:44 opt Oct 10 17:26:05 rhel7 vmtoolsd[891]: dr-xr-xr-x. 5 root root 4096 Jul 12 14:50 boot Oct 10 17:26:05 rhel7 vmtoolsd[891]: dr-xr-x---. 5 root root 239 Aug 24 18:54 root Oct 10 17:26:05 rhel7 vmtoolsd[891]: dr-xr-xr-x. 13 root root 0 Sep 28 11:31 sys Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxrwxrwt. 3 root root 25 Sep 28 11:31 tmp Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 19 root root 3160 Sep 28 11:31 dev Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 37 root root 1200 Sep 28 11:32 run Oct 10 17:26:05 rhel7 vmtoolsd[891]: dr-xr-xr-x. 187 root root 0 Sep 28 19:31 proc Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 147 root root 12288 Oct 10 17:19 etc Oct 10 17:26:05 rhel7 vmtoolsd[891]: drwxr-xr-x. 3 root root 23 Oct 10 17:19 home Oct 10 17:26:05 rhel7 su[27307]: pam_unix(su-l:session): session closed for user localuser Oct 10 17:26:05 rhel7 systemd[1]: Removed slice User Slice of localuser. Oct 10 17:26:05 rhel7 systemd[1]: Stopping User Slice of localuser. Oct 10 17:26:05 rhel7 logger[27324]: Done executing -bash Oct 10 17:27:36 rhel7 chronyd[942]: Source 216.187.142.202 replaced with 193.228.143.22 Oct 10 17:28:01 rhel7 systemd[1]: Created slice User Slice of pcp. Oct 10 17:28:01 rhel7 systemd[1]: Starting User Slice of pcp. Oct 10 17:28:01 rhel7 systemd[1]: Started Session 3273 of user pcp. Oct 10 17:28:01 rhel7 systemd[1]: Starting Session 3273 of user pcp. Oct 10 17:28:01 rhel7 CROND[27346]: (pcp) CMD ( /usr/libexec/pcp/bin/pmie_check -C) Oct 10 17:28:01 rhel7 systemd[1]: Removed slice User Slice of pcp. Oct 10 17:28:01 rhel7 systemd[1]: Stopping User Slice of pcp. Oct 10 17:30:01 rhel7 systemd[1]: Started Session 3274 of user root. Oct 10 17:30:01 rhel7 systemd[1]: Starting Session 3274 of user root. Oct 10 17:30:01 rhel7 CROND[27401]: (root) CMD (/usr/lib64/sa/sa1 1 1) Oct 10 17:34:52 rhel7 rhnsd[1312]: /etc/sysconfig/rhn/systemid does not exist or is unreadable Oct 10 17:40:01 rhel7 systemd[1]: Started Session 3275 of user root. Oct 10 17:40:01 rhel7 systemd[1]: Starting Session 3275 of user root. Oct 10 17:40:01 rhel7 CROND[27488]: (root) CMD (/usr/lib64/sa/sa1 1 1) I missed the "chmod +x /etc/vmware-tools/backupScripts.d/test.sh", may be the cause. Anyway, from the log, you can see the issue: Start of script: Oct 10 17:25:12 rhel7 su[27275]: (to localuser) root on none Oct 10 17:25:12 rhel7 systemd[1]: Created slice User Slice of localuser. Oct 10 17:25:12 rhel7 systemd[1]: Starting User Slice of localuser. Oct 10 17:25:12 rhel7 systemd[1]: Started Session c3 of user localuser. Oct 10 17:25:12 rhel7 systemd[1]: Starting Session c3 of user localuser. pam_systemd failing in timeout: Oct 10 17:25:37 rhel7 su[27275]: pam_systemd(su-l:session): Failed to create session: Connection timed out Oct 10 17:25:37 rhel7 su[27275]: pam_unix(su-l:session): session opened for user localuser by (uid=0) Then script runs normally I suggest closing this as "Not a bug" because the source of the delay is appears to be caused by the user's scripts. Thanks, John I'm sorry, I completely disagree, I consider this as a Bug or at least very heavy limitation. Indeed, even though the user script may run "su" or "runuser" internally, there shouldn't be that limitation. User script should be able to run these commands to switch user context. The real issue here is that the scripts spawned by "vmtoolsd" inherit the SELinux context of "vmtoolsd", causing these denies when using "su" or "runuser". The fix is to have "vmtoolsd" run in another SELinux context that would enable the scripts to run in the appropriate SELinux context, likely same as what systemd does, or set up the transitions upon script execution to make that possible. (In reply to Renaud Métrich from comment #34) > The real issue here is that the scripts spawned by "vmtoolsd" inherit the > SELinux context of "vmtoolsd", causing these denies when using "su" or > "runuser". > > The fix is to have "vmtoolsd" run in another SELinux context that would > enable the scripts to run in the appropriate SELinux context, likely same as > what systemd does, or set up the transitions upon script execution to make > that possible. This could all be true but I have no idea how to implement that. Does it need changes to the code or the spec file? Could you outline what kind of changes need to be made? It will be great if we can avoid a code change for this because vmtoolsd may be running in non-SELinux systems too. Based on https://selinuxproject.org/page/NB_Domain_and_Object_Transitions, it looks like even transitions can be dealt with just configuration? But, one problem with transitions would be that we may not be aware of the target domains to transition to upfront. I'm not sure what does Systemd do, but if that does not require awareness of target domains that might be more reliable approach. Just to help set some requirements, do we have a good idea of what these scripts may try to do? Or do we simply need to expect that these scripts will need to run any arbitrary command? These scripts are customer scripts, that can do various things. For example, quiescing a database. I believe the best thing to do is to have them run as unconfined_t. I've written a module for that but wasn't able to test yet because of issues with the lab. From my knowledge of SELinux, the following should make it: # Already done by default policy # allow vmtools_t shell_exec_t : file { ioctl read getattr lock execute execute_no_trans open } ; # Already done by default policy # allow unconfined_t shell_exec_t : file { entrypoint } ; allow vmtools_t unconfined_t : process transition ; type_transition vmtools_t shell_exec_t : process unconfined_t ; Re-adding the NEEDINFO flag for Lukas, he maintains the SELinux policy for Fedora/RHEL, we should hear what he has to say on this issue. We (VMware) do not put any restrictions on these scripts. Customers are free to do whatever they like to be done before and after quiescing the guest OS. However, we do have a max limit on the overall quiescing time. The whole operation has to complete with 15 mins (scripts before quiescing, quiescing, resuming and scripts after resuming). Hi Guys, Paul, THanks for heads up. It took me some time to read whole thread and understand the issue here. However, I'm still not sure who or what is executing these kind of scripts. It looks like vmtools execute it from /etc/vmware-tools/backupScripts.d. And it's OK that SELinux domain have vmtools_t inherited. This is how it works. Right solution here looks like that, we should create new SELinux domain e.g. vmtools_script_t and label all scripts with label vmtools_script_exec_t and create following SELinux transition rule: type_transition vmtools_t vmtools_script_exec_t:process vmtools_script_t; and these new type vmtools_scirpt_t will be benevolent from SELinux policy POV. This can be fixed in distribution policy but in 7.6, for 7.5 is too late. Lukas. Lukas, we had a question from VMware about whether they would need to make any changes to vmtoolsd code or the spec file, or would the change be entirely confined to the selinux-policy* package? Richard, Everything will be done in SELinux policy vmtools sources. So no action from VMware is needed. Lukas. Thanks Rich with letting us see some of the comments. Thanks Renaud with root causing this issue, reproducer and also finding a workaround. Great effort! (In reply to Lukas Vrabec from comment #44) > Right solution here looks like that, we should create new SELinux domain > e.g. vmtools_script_t and label all scripts with label vmtools_script_exec_t > and create following SELinux transition rule: > > type_transition vmtools_t vmtools_script_exec_t:process vmtools_script_t; > > and these new type vmtools_scirpt_t will be benevolent from SELinux policy > POV. It looks like a good idea. However, I'm not very clear on how the file labeling process works. Please note that the scripts we are talking about here are not part of the open-vm-tools package. These are customer provided (environment specific) scripts. May be this can be taken care automatically by labeling the /etc/vmware-tools/backupScripts.d directory appropriately as files will inherit their SELinux type from the parent directory? I'm asking this because I'm not sure of customer impact of the proposed solution and it will be good to avoid any manual steps if possible. FWIW, /etc/vmware-tools/backupScripts.d directory is also not pre-created because we don't ship any built-in scripts. So, we may need to pre-create the directory if we are going to rely on scripts inheriting their SELinux type from the parent dir. Also, I'm not sure how will this work for following case: 1. Disable SELinux 2. Create /etc/vmware-tools/backupScripts.d with scripts 3. Enable SELinux What SELinux type will get assigned to scripts? I'm also worried about this problem in following 2 other cases: 1. We ship some power ops scripts and also support customer provided power ops scripts under /etc/vmware-tools/scripts directory. These scripts are run during different power operations on the VM. 2. We support guestOps feature that allows an authenticated user to launch a script/program in the guest from any path accessible to the user, just like a user could do over SSH session. Considering all this, would it be better if we can have a higher level solution of some kind? Something similar to what systemd would be doing as Renaud mentioned in his update #34? What happens is a rule will be added to /etc/selinux/targeted/contexts/files/file_contexts which will cause any files created in a particular directory (eg. /etc/vmware-tools/backupScripts.d) to acquire a particular label automatically when the file is created. So you don't need to do anything. If the user disables SELinux then that's their own problem. We advise people not to do that. It is possible to relabel the whole filesystem if you mess things up. You can also label a directory using ‘restorecon -rv /etc/vmware-tools/backupScripts.d’. But don't turn off SELinux in the first place is the best advise. Re comment 34, it seems as if vmtoolsd is or will be run in its own context (vmtools_t). This is done by modifying the policy to add an execute transition. The basic idea is outlined here: https://wiki.gentoo.org/wiki/SELinux/Tutorials/How_does_a_process_get_into_a_certain_context It also doesn't require changes to the program. (In reply to Ravindra Kumar from comment #50) > Thanks Rich with letting us see some of the comments. > > Thanks Renaud with root causing this issue, reproducer and also finding a > workaround. Great effort! > > (In reply to Lukas Vrabec from comment #44) > > Right solution here looks like that, we should create new SELinux domain > > e.g. vmtools_script_t and label all scripts with label vmtools_script_exec_t > > and create following SELinux transition rule: > > > > type_transition vmtools_t vmtools_script_exec_t:process vmtools_script_t; > > > > and these new type vmtools_scirpt_t will be benevolent from SELinux policy > > POV. > > It looks like a good idea. However, I'm not very clear on how the file > labeling process works. Please note that the scripts we are talking about > here are not part of the open-vm-tools package. These are customer provided > (environment specific) scripts. May be this can be taken care automatically > by labeling the /etc/vmware-tools/backupScripts.d directory appropriately as > files will inherit their SELinux type from the parent directory? I'm asking > this because I'm not sure of customer impact of the proposed solution and it > will be good to avoid any manual steps if possible. I'll label whole directory in /etc/vmware-tools/backupScripts.d so all files in this dir will inherit the label which will be vmtools_script_exec_t. (In reply to Ravindra Kumar from comment #51) > FWIW, /etc/vmware-tools/backupScripts.d directory is also not pre-created > because we don't ship any built-in scripts. So, we may need to pre-create > the directory if we are going to rely on scripts inheriting their SELinux > type from the parent dir. > > Also, I'm not sure how will this work for following case: > > 1. Disable SELinux > 2. Create /etc/vmware-tools/backupScripts.d with scripts > 3. Enable SELinux > > What SELinux type will get assigned to scripts? > > I'm also worried about this problem in following 2 other cases: > > 1. We ship some power ops scripts and also support customer provided power > ops scripts under /etc/vmware-tools/scripts directory. These scripts are run > during different power operations on the VM. I don't think this will affect proper running these script.s Policy will be really benevolent. > 2. We support guestOps feature that allows an authenticated user to launch a > script/program in the guest from any path accessible to the user, just like > a user could do over SSH session. > How is this working? Are you using SSH for this? > Considering all this, would it be better if we can have a higher level > solution of some kind? Something similar to what systemd would be doing as > Renaud mentioned in his update #34? Yeah, we create new SELinux type and there will be SELinux domain transition to vmtools_scripts_t which will be unconfined. Thanks Rich and Lukas. (In reply to Lukas Vrabec from comment #54) > I don't think this will affect proper running these script.s Policy will be > really benevolent. I was not worried about the effect of new policy on these other scripts. Instead I meant to say we also need a solution for these additional scripts. > How is this working? Are you using SSH for this? We don't use SSH. We authenticate users ourselves and impersonate. Impersonation is done in vmtoolsd's context. Hi Lukas and Robert, As this issue affects too many customer systems, customer is urgent to have a hot fix for this issue, how about to apply z-stream on RHEL 7.4, I can help do some regression test for this fix. Or maybe we can offer a workaround to customer then they can wait for RHEL 7.6 release. If something does not work as expected and you assume that SELinux is affecting the scenario then following command will collect SELinux denials on your machine: # ausearch -m avc -m user_avc -m selinux_err -m user_selinux_err -i -ts today Collected SELinux denials will help us immensely to understand where the problem is. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3111 |