Bug 1661625
| Summary: | rsync-3.1.2-4.el7 hangs when run with -vvv to sync a large repository | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Thiago Rafael Becker <tbecker> | |
| Component: | rsync | Assignee: | Michal Ruprich <mruprich> | |
| Status: | CLOSED WONTFIX | QA Contact: | qe-baseos-daemons | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 7.6 | CC: | bwelterl, fkrska, mruprich, pandrade, rmetrich, thozza, yferszt | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1775561 (view as bug list) | Environment: | ||
| Last Closed: | 2019-12-18 11:07:23 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1757052 | |||
rsync-3.0.9-18 uses protocol 30 whereas rsync-3.1.2-4 uses protocol 31. After digging into Upstream repo, it appears the issue started to happen with v3.1.0pre1 (didn't find an older tag to test further). This should probably be reported Upstream. Issue happens only with "-vvv", not "-vv". Issue was introduced by: d8587b4 - Change the msg pipe to use a real multiplexed IO mode for the data that goes from the receiver to the generator. (9 years ago) <Wayne Davison> The issue is within read_a_msg() which is used by 2nd child process:
- first process (main process) is the sender
- second (child) process (PID 11453) is a receiver that also sends messages (maybe it's due to verbosity, I don't know yet ...)
- third (child) process is a receiver
# pstack 11453
#0 0x00007f6b2384df53 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000427fdf in writefd_unbuffered (fd=fd@entry=1, buf=buf@entry=0x7ffe731183c0 "\211", len=141) at io.c:1434
#2 0x00000000004284af in mplex_write (fd=1, code=code@entry=MSG_INFO, buf=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep 6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=137, convert=<optimized out>) at io.c:378
#3 0x0000000000426fe8 in io_multiplex_write (code=MSG_INFO, buf=<optimized out>, len=len@entry=137, convert=<optimized out>) at io.c:1833
#4 0x000000000042709c in send_msg (code=<optimized out>, buf=buf@entry=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep 6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=len@entry=137, convert=<optimized out>) at io.c:392
#5 0x000000000041ea5e in rwrite (code=<optimized out>, code@entry=FINFO, buf=buf@entry=0x7ffe7311a310 "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep 6 08:43:06 2018\nl7.x86_64/.gitattributes\n", len=len@entry=137, is_utf8=0) at log.c:313
#6 0x0000000000427395 in read_a_msg (fd=3) at io.c:1044
#7 0x0000000000428075 in writefd_unbuffered (fd=fd@entry=1, buf=buf@entry=0x7ffe7311bd90 "p", len=116) at io.c:1448
...
In read_a_msg(), the "msg_fd_in" variable is temporarily disabled:
874 /* Read a message from a multiplexed source. */
875 static void read_a_msg(int fd)
876 {
...
879 int tag, flist_parent = -1, save_msg_fd_in = msg_fd_in;
...
881 /* Temporarily disable msg_fd_in. This is needed to avoid looping back
882 * to this routine from writefd_unbuffered(). */
883 msg_fd_in = -1;
...
Due to verbosity (?? unsure ??), we enter the following code:
1031 case MSG_INFO:
1032 case MSG_ERROR:
1033 case MSG_ERROR_XFER:
1034 case MSG_WARNING:
1035 if (msg_bytes >= sizeof line) {
1036 overflow:
1037 rprintf(FERROR,
1038 "multiplexing overflow %d:%lu [%s%s]\n",
1039 tag, (unsigned long)msg_bytes, who_am_i(),
1040 inc_recurse ? "/inc" : "");
1041 exit_cleanup(RERR_STREAMIO);
1042 }
1043 read_loop(fd, line, msg_bytes);
1044 rwrite((enum logcode)tag, line, msg_bytes, !am_generator);
We have here: tag == MSG_INFO, line == "set modtime of sdc/kernel-3.10.0-862.17.1.el7/linux-3.10.0-862.17.1.el7.x86_64/..mailmap.5xvc9E to (1536216186) Thu Sep 6 08:43:06 2018\nl7.x86_64/.gitattributes\n"
On line 1044, we try writing this, but cannot because buffer is probably full.
Since msg_fd_in has been disabled temporarily, there is probably something on the FD but we never handle it, causing a deadlock to occur between the various sides.
This rwrite() call is indeed due to verbosity. In some rare cases, if there are tons of errors, we should be able to hit the same issue without "-vvv". Red Hat Enterprise Linux version 7 entered the Maintenance Support 1 Phase in August 2019. In this phase only qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available. Other errata advisories may be delivered as appropriate. This bug has been reviewed by Support and Engineering representative and does not meet the inclusion criteria for Maintenance Support 1 Phase. The issue has been cloned to RHEL-8 as Bug #1775561. For more information about Red Hat Enterprise Linux Lifecycle, please see https://access.redhat.com/support/policy/updates/errata/ Development Management has reviewed and declined this request. You may appeal this decision by using your Red Hat support channels, who will make certain the issue receives the proper prioritization with product and development management. https://www.redhat.com/support/process/production/#howto |
Description of problem: The following command hangs after some time. # rsync -avvv /<large_directory> /dst Version-Release number of selected component (if applicable): rsync-3.1.2-4.el7 How reproducible: 100% Steps to Reproduce: 1. Install a new machine with two extra disk formatted as xfs, mount them as /git1 and /git2 The file system is actually not important. I've seen it happening with nfs as the source file system. Also seen with mvfs. 2. Clone the linux kernel repository into /git1 3. Start rsync from /git1 to /git2 Actual results: The rsync command stops after some time. Expected results: rsync to complete. Additional info: Tested with kernel-3.10.0-862.11.6.el7.x86_64 and rsync-3.1.2-4.el7.x86_64, hangs. Tested with kernel-3.10.0-957.1.3.el7.x86_64 and rsync-3.1.2-4.el7.x86_64, hangs. Tested with kernel-3.10.0-957.1.3.el7 and rsync-3.0.9-18.el7.x86_64, doesn't hang. Tested with kernel-3.10.0-862.11.6.el7.x86_64 and rsync-3.0.9-18.el7.x86_64, doesn't hang. The rsync processes appear to be stuck waiting for a select to complete. root 13624 9812 6 17:35 pts/0 00:00:04 rsync -avvv /git1 /git2 root 13625 13624 0 17:35 pts/0 00:00:00 rsync -avvv /git1 /git2 root 13626 13625 7 17:35 pts/0 00:00:04 rsync -avvv /git1 /git2 # cat /proc/{13624,13625,13626}/stack [<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0 [<ffffffff9f635711>] do_select+0x6d1/0x7c0 [<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0 [<ffffffff9f635b9a>] SyS_select+0xba/0x110 [<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0 [<ffffffff9f635711>] do_select+0x6d1/0x7c0 [<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0 [<ffffffff9f635b9a>] SyS_select+0xba/0x110 [<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff9f634d95>] poll_schedule_timeout+0x55/0xb0 [<ffffffff9f635711>] do_select+0x6d1/0x7c0 [<ffffffff9f6359b6>] core_sys_select+0x1b6/0x2e0 [<ffffffff9f635b9a>] SyS_select+0xba/0x110 [<ffffffff9fb2579b>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff