Bug 1455451
Summary: | VM with pty based serial console hangs if PTS buffer is full | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sergio Lopez <slopezpa> | ||||||
Component: | qemu-kvm | Assignee: | pagupta | ||||||
Status: | CLOSED ERRATA | QA Contact: | Sitong Liu <siliu> | ||||||
Severity: | urgent | Docs Contact: | Yehuda Zimmerman <yzimmerm> | ||||||
Priority: | medium | ||||||||
Version: | 7.3 | CC: | agurenko, chayang, cww, dvacek, ehabkost, hhuang, juzhang, mdeng, michen, miturria, pagupta, pbonzini, qzhang, rbalakri, slopezpa, virt-bugs, virt-maint, xfu | ||||||
Target Milestone: | rc | Keywords: | TestOnly | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | qemu-kvm-1.5.3-141.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Guests accessed using a serial console no longer become unresponsive
Previously, if a client opened a host-side pseudoterminal device (pty) on a KVM guest pty serial console and did not read from it, the guest in some cases became unresponsive because because of blocking read/write calls. With this update, the host-side pty open mode was set to non-blocking. As a result, the guest machine does not become unresponsive in the described scenario.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-04-10 14:32:19 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1452067 | ||||||||
Bug Blocks: | 1420851, 1469590 | ||||||||
Attachments: |
|
Description
Sergio Lopez
2017-05-25 08:53:48 UTC
Created attachment 1282164 [details]
PoC for making the serial device lossy
PoC patch which changes two things:
1) The O_NONBLOCK flag is added to the PTS file descriptor.
2) Changes retry logic in serial_xmit to immediately retry when qemu_chr_fe_write fails, and if MAX_XMIT_RETRY is reached, adds UART_LSR_THRE flag to lsr and stops trying.
With this, the Guest no longer hangs, no matter the state of the PTS on the Host, at the cost of potentially losing some chars if the FIFO is full and can't write to the PTS.
Created attachment 1282166 [details]
Python script to keep the PTS open without reading from it
Hello Sergio, Just looking into this. I want to ask you a question based on 'bt' you provided in #comment 0. It looks like its a blocking call? #0 0x00007f5edf4ac43d in write () at -------------> here #1 0x00007f5ee0757efc in g_io_unix_write () from /lib64/libglib-2.0.so.0 #2 0x00007f5ee070a1ed in g_io_channel_write_chars () from /lib64/libglib-2.0.so.0 #3 0x00007f5ee22ea44e in io_channel_send (fd=0x7f5ee4a7c3f0, buf=0x7f5ee4f2c894, len=1) at qemu-char.c:787 #4 0x00007f5ee224c153 in serial_xmit (chan=chan@entry=0x0, cond=cond@entry=G_IO_OUT, So, there could be two reasons for that: 1] If its a blocking call. 2] As spotted by you retry logic could be wrong. For 2] I went through the code. If we compare with the 'bt' above 'qemu_chr_fe_write' function is called only once by 'serial_xmit' to check if we can write len '1'. } else if (qemu_chr_fe_write(&s->chr, &s->tsr, 1) != 1 && If it fails(could not write or no one reading data from serial device) and we add a callback. In this single character write, might be because of blocking call? For 1] I went through your patch as well, I can see you are returning without setting callback. I am not sure whether this is correct thing to do here. Also, you have changed blocking fd to non-blocking. I want to know have to only tried with just changing from blocking to non-blocking fd? In BZ 1451470 I could see Fam has mentioned below commit is already upstream. fac6688a18574b6f2caa8c699a936e729ed53ece Thanks, Pankaj (In reply to pagupta from comment #5) > I want to know have to only tried with just changing from blocking to > non-blocking fd? In BZ 1451470 I could see Fam has mentioned below commit is > already upstream. Hi Pankaj, Yes, I've first tried only with adding the O_NONBLOCK flag. With it, the write() does indeed return, but the VM still manifested similar symptoms (no network response to SSH nor ping), until the console is flushed using "virsh console". Looking a bit into this, this is the stack trace generated by the guest kernel: <snip> [ 2120.755147] INFO: rcu_sched self-detected stall on CPU { 6} (t=74571 jiffies g=157201 c=157200 q=2275) [ 2120.755148] Task dump for CPU 6: [ 2120.755150] swapper/6 R running task 0 0 1 0x00000008 [ 2120.755153] ffff88007bc8edd0 7d015a79700ada28 ffff88007d383738 ffffffff810c47e8 [ 2120.755154] 0000000000000006 ffffffff81a1e6c0 ffff88007d383750 ffffffff810c80c9 [ 2120.755155] 0000000000000007 ffff88007d383780 ffffffff81137960 ffff88007d3901c0 [ 2120.755156] Call Trace: [ 2120.755165] <IRQ> [<ffffffff810c47e8>] sched_show_task+0xa8/0x110 [ 2120.755167] [<ffffffff810c80c9>] dump_cpu_task+0x39/0x70 [ 2120.755171] [<ffffffff81137960>] rcu_dump_cpu_stacks+0x90/0xd0 [ 2120.755173] [<ffffffff8113b0b2>] rcu_check_callbacks+0x442/0x720 [ 2120.755178] [<ffffffff810eb9ac>] ? update_wall_time+0x26c/0x6c0 [ 2120.755180] [<ffffffff810f3610>] ? tick_sched_handle.isra.13+0x60/0x60 [ 2120.755184] [<ffffffff81099697>] update_process_times+0x47/0x80 [ 2120.755185] [<ffffffff810f35d5>] tick_sched_handle.isra.13+0x25/0x60 [ 2120.755186] [<ffffffff810f3651>] tick_sched_timer+0x41/0x70 [ 2120.755189] [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260 [ 2120.755191] [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0 [ 2120.755195] [<ffffffff81050ff7>] local_apic_timer_interrupt+0x37/0x60 [ 2120.755200] [<ffffffff81699bcf>] smp_apic_timer_interrupt+0x3f/0x60 [ 2120.755201] [<ffffffff8169811d>] apic_timer_interrupt+0x6d/0x80 [ 2120.755203] [<ffffffff810889ee>] ? console_unlock+0x46e/0x4b0 [ 2120.755205] [<ffffffff81088df4>] vprintk_emit+0x3c4/0x510 [ 2120.755206] [<ffffffff81089169>] vprintk_default+0x29/0x40 [ 2120.755208] [<ffffffff816803bb>] printk+0x5e/0x75 [ 2120.755211] [<ffffffff8131ee43>] ? ___ratelimit+0x93/0x100 [ 2120.755214] [<ffffffff815ac1bf>] ip_handle_martian_source.isra.35+0x6f/0xd0 [ 2120.755216] [<ffffffff815f3c44>] ? fib_validate_source+0x64/0xe0 [ 2120.755217] [<ffffffff815ae1ba>] ip_route_input_slow+0x75a/0xca0 [ 2120.755240] [<ffffffff815ae746>] ip_route_input_noref+0x46/0x320 [ 2120.755243] [<ffffffff815b102c>] ip_rcv_finish+0xbc/0x350 [ 2120.755244] [<ffffffff815b1986>] ip_rcv+0x2b6/0x410 [ 2120.755245] [<ffffffff815b0f70>] ? inet_del_offload+0x40/0x40 [ 2120.755249] [<ffffffff8156fe52>] __netif_receive_skb_core+0x582/0x800 [ 2120.755253] [<ffffffff815dc974>] ? tcp4_gro_receive+0x134/0x1b0 [ 2120.755255] [<ffffffff815700e8>] __netif_receive_skb+0x18/0x60 [ 2120.755256] [<ffffffff81570170>] netif_receive_skb_internal+0x40/0xc0 [ 2120.755258] [<ffffffff81571288>] napi_gro_receive+0xd8/0x130 [ 2120.755276] [<ffffffffa00c79d5>] virtnet_poll+0x265/0x750 [virtio_net] [ 2120.755277] [<ffffffff81570910>] net_rx_action+0x170/0x380 [ 2120.755281] [<ffffffff8108f63f>] __do_softirq+0xef/0x280 [ 2120.755282] [<ffffffff81698f5c>] call_softirq+0x1c/0x30 [ 2120.755285] [<ffffffff8102d365>] do_softirq+0x65/0xa0 [ 2120.755286] [<ffffffff8108f9d5>] irq_exit+0x115/0x120 [ 2120.755287] [<ffffffff81699bd5>] smp_apic_timer_interrupt+0x45/0x60 [ 2120.755288] [<ffffffff8169811d>] apic_timer_interrupt+0x6d/0x80 [ 2120.755293] <EOI> [<ffffffff81060fe6>] ? native_safe_halt+0x6/0x10 [ 2120.755297] [<ffffffff810347bf>] default_idle+0x1f/0xc0 [ 2120.755298] [<ffffffff81035106>] arch_cpu_idle+0x26/0x30 [ 2120.755300] [<ffffffff810e82f5>] cpu_startup_entry+0x245/0x290 [ 2120.755302] [<ffffffff8104f09a>] start_secondary+0x1ba/0x230 </snip> This stack trace is definitely different from the one it generates running on a QEMU instance without the O_NONBLOCK patch, but as I said above, from a final user perspective, the issue still persists. I think the problem comes from the retry logic: <snip from hw/char/serial.c:234> do { assert(!(s->lsr & UART_LSR_TEMT)); if (s->tsr_retry == 0) { assert(!(s->lsr & UART_LSR_THRE)); if (s->fcr & UART_FCR_FE) { assert(!fifo8_is_empty(&s->xmit_fifo)); s->tsr = fifo8_pop(&s->xmit_fifo); if (!s->xmit_fifo.num) { s->lsr |= UART_LSR_THRE; } } else { s->tsr = s->thr; s->lsr |= UART_LSR_THRE; } if ((s->lsr & UART_LSR_THRE) && !s->thr_ipending) { s->thr_ipending = 1; serial_update_irq(s); } } if (s->mcr & UART_MCR_LOOP) { /* in loopback mode, say that we just received a char */ serial_receive1(s, &s->tsr, 1); } else if (qemu_chr_fe_write(s->chr, &s->tsr, 1) != 1 && s->tsr_retry < MAX_XMIT_RETRY) { assert(s->watch_tag == 0); s->watch_tag = qemu_chr_fe_add_watch(s->chr, G_IO_OUT|G_IO_HUP, serial_watch_cb, s); if (s->watch_tag > 0) { s->tsr_retry++; return; } } s->tsr_retry = 0; /* Transmit another byte if it is already available. It is only possible when FIFO is enabled and not empty. */ } while (!(s->lsr & UART_LSR_THRE)); s->last_xmit_ts = qemu_get_clock_ns(vm_clock); s->lsr |= UART_LSR_TEMT; } </snip> Combined with this behavior from the VM kernel: <snip from drivers/tty/serial/8250/8250_core.c> static void serial8250_console_putchar(struct uart_port *port, int ch) { struct uart_8250_port *up = container_of(port, struct uart_8250_port, port); wait_for_xmitr(up, UART_LSR_THRE); serial_port_out(port, UART_TX, ch); } </snip> My hypothesis for what's happening is: - From QEMU perspective: "qemu_chr_fe_add_watch" starts failing when the PTS is full from Host perspective. Each time it fails, tsr_retry is incremented, installing the callback and returning. When MAX_XMIT_RETRY is reached, tsr_retry is set to 0, returning to the beginning of the loop, which will discard the current character and take a new one from xmit_fifo. But, and this is especially important, UART_LSR_THRE will not be set unless xmit_fifo is empty. - From VM's kernel perspective: writes to the I/O port, and waits a bit for UART_LSR_THRE. As is doesn't arrive on time, the wait times out and writes another character to the I/O port, increasing xmit_fifo. I think that the combination of both perspectives makes that writes to the serial device succeed, although _very_ slowly. So slowly, that from outside the VM looks exactly the same as when it was blocked on write(). The PoC attached to the case just makes that writes to the serial device will fail or succeed quickly, to avoid triggering the above situation. Sergio. (In reply to Sergio Lopez from comment #6) > (In reply to pagupta from comment #5) > > I want to know have to only tried with just changing from blocking to > > non-blocking fd? In BZ 1451470 I could see Fam has mentioned below commit is > > already upstream. > > Hi Pankaj, > > Yes, I've first tried only with adding the O_NONBLOCK flag. With it, the > write() does indeed return, but the VM still manifested similar symptoms (no > network response to SSH nor ping), until the console is flushed using "virsh > console". > > Looking a bit into this, this is the stack trace generated by the guest > kernel: > > <snip> > [ 2120.755147] INFO: rcu_sched self-detected stall on CPU { 6} (t=74571 > jiffies g=157201 c=157200 q=2275) > [ 2120.755148] Task dump for CPU 6: > [ 2120.755150] swapper/6 R running task 0 0 1 > 0x00000008 > [ 2120.755153] ffff88007bc8edd0 7d015a79700ada28 ffff88007d383738 > ffffffff810c47e8 > [ 2120.755154] 0000000000000006 ffffffff81a1e6c0 ffff88007d383750 > ffffffff810c80c9 > [ 2120.755155] 0000000000000007 ffff88007d383780 ffffffff81137960 > ffff88007d3901c0 > [ 2120.755156] Call Trace: > [ 2120.755165] <IRQ> [<ffffffff810c47e8>] sched_show_task+0xa8/0x110 > [ 2120.755167] [<ffffffff810c80c9>] dump_cpu_task+0x39/0x70 > [ 2120.755171] [<ffffffff81137960>] rcu_dump_cpu_stacks+0x90/0xd0 > [ 2120.755173] [<ffffffff8113b0b2>] rcu_check_callbacks+0x442/0x720 > [ 2120.755178] [<ffffffff810eb9ac>] ? update_wall_time+0x26c/0x6c0 > [ 2120.755180] [<ffffffff810f3610>] ? tick_sched_handle.isra.13+0x60/0x60 > [ 2120.755184] [<ffffffff81099697>] update_process_times+0x47/0x80 > [ 2120.755185] [<ffffffff810f35d5>] tick_sched_handle.isra.13+0x25/0x60 > [ 2120.755186] [<ffffffff810f3651>] tick_sched_timer+0x41/0x70 > [ 2120.755189] [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260 > [ 2120.755191] [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0 > [ 2120.755195] [<ffffffff81050ff7>] local_apic_timer_interrupt+0x37/0x60 > [ 2120.755200] [<ffffffff81699bcf>] smp_apic_timer_interrupt+0x3f/0x60 > [ 2120.755201] [<ffffffff8169811d>] apic_timer_interrupt+0x6d/0x80 > [ 2120.755203] [<ffffffff810889ee>] ? console_unlock+0x46e/0x4b0 > [ 2120.755205] [<ffffffff81088df4>] vprintk_emit+0x3c4/0x510 > [ 2120.755206] [<ffffffff81089169>] vprintk_default+0x29/0x40 > [ 2120.755208] [<ffffffff816803bb>] printk+0x5e/0x75 > [ 2120.755211] [<ffffffff8131ee43>] ? ___ratelimit+0x93/0x100 > [ 2120.755214] [<ffffffff815ac1bf>] > ip_handle_martian_source.isra.35+0x6f/0xd0 > [ 2120.755216] [<ffffffff815f3c44>] ? fib_validate_source+0x64/0xe0 > [ 2120.755217] [<ffffffff815ae1ba>] ip_route_input_slow+0x75a/0xca0 > [ 2120.755240] [<ffffffff815ae746>] ip_route_input_noref+0x46/0x320 > [ 2120.755243] [<ffffffff815b102c>] ip_rcv_finish+0xbc/0x350 > [ 2120.755244] [<ffffffff815b1986>] ip_rcv+0x2b6/0x410 > [ 2120.755245] [<ffffffff815b0f70>] ? inet_del_offload+0x40/0x40 > [ 2120.755249] [<ffffffff8156fe52>] __netif_receive_skb_core+0x582/0x800 > [ 2120.755253] [<ffffffff815dc974>] ? tcp4_gro_receive+0x134/0x1b0 > [ 2120.755255] [<ffffffff815700e8>] __netif_receive_skb+0x18/0x60 > [ 2120.755256] [<ffffffff81570170>] netif_receive_skb_internal+0x40/0xc0 > [ 2120.755258] [<ffffffff81571288>] napi_gro_receive+0xd8/0x130 > [ 2120.755276] [<ffffffffa00c79d5>] virtnet_poll+0x265/0x750 [virtio_net] > [ 2120.755277] [<ffffffff81570910>] net_rx_action+0x170/0x380 > [ 2120.755281] [<ffffffff8108f63f>] __do_softirq+0xef/0x280 > [ 2120.755282] [<ffffffff81698f5c>] call_softirq+0x1c/0x30 > [ 2120.755285] [<ffffffff8102d365>] do_softirq+0x65/0xa0 > [ 2120.755286] [<ffffffff8108f9d5>] irq_exit+0x115/0x120 > [ 2120.755287] [<ffffffff81699bd5>] smp_apic_timer_interrupt+0x45/0x60 > [ 2120.755288] [<ffffffff8169811d>] apic_timer_interrupt+0x6d/0x80 > [ 2120.755293] <EOI> [<ffffffff81060fe6>] ? native_safe_halt+0x6/0x10 > [ 2120.755297] [<ffffffff810347bf>] default_idle+0x1f/0xc0 > [ 2120.755298] [<ffffffff81035106>] arch_cpu_idle+0x26/0x30 > [ 2120.755300] [<ffffffff810e82f5>] cpu_startup_entry+0x245/0x290 > [ 2120.755302] [<ffffffff8104f09a>] start_secondary+0x1ba/0x230 > </snip> > > This stack trace is definitely different from the one it generates running > on a QEMU instance without the O_NONBLOCK patch, but as I said above, from a > final user perspective, the issue still persists. > > I think the problem comes from the retry logic: > > <snip from hw/char/serial.c:234> > do { > assert(!(s->lsr & UART_LSR_TEMT)); > if (s->tsr_retry == 0) { > assert(!(s->lsr & UART_LSR_THRE)); > > if (s->fcr & UART_FCR_FE) { > assert(!fifo8_is_empty(&s->xmit_fifo)); > s->tsr = fifo8_pop(&s->xmit_fifo); > if (!s->xmit_fifo.num) { > s->lsr |= UART_LSR_THRE; > } > } else { > s->tsr = s->thr; > s->lsr |= UART_LSR_THRE; > } > if ((s->lsr & UART_LSR_THRE) && !s->thr_ipending) { > s->thr_ipending = 1; > serial_update_irq(s); > } > } > > if (s->mcr & UART_MCR_LOOP) { > /* in loopback mode, say that we just received a char */ > serial_receive1(s, &s->tsr, 1); > } else if (qemu_chr_fe_write(s->chr, &s->tsr, 1) != 1 && > s->tsr_retry < MAX_XMIT_RETRY) { > assert(s->watch_tag == 0); > s->watch_tag = qemu_chr_fe_add_watch(s->chr, G_IO_OUT|G_IO_HUP, > serial_watch_cb, s); > if (s->watch_tag > 0) { > s->tsr_retry++; > return; > } > } > s->tsr_retry = 0; > > /* Transmit another byte if it is already available. It is only > possible when FIFO is enabled and not empty. */ > } while (!(s->lsr & UART_LSR_THRE)); > > s->last_xmit_ts = qemu_get_clock_ns(vm_clock); > s->lsr |= UART_LSR_TEMT; > } > </snip> > > Combined with this behavior from the VM kernel: > > <snip from drivers/tty/serial/8250/8250_core.c> > static void serial8250_console_putchar(struct uart_port *port, int ch) > { > struct uart_8250_port *up = > container_of(port, struct uart_8250_port, port); > > wait_for_xmitr(up, UART_LSR_THRE); > serial_port_out(port, UART_TX, ch); > } > </snip> > > My hypothesis for what's happening is: > > - From QEMU perspective: "qemu_chr_fe_add_watch" starts failing when the > PTS is full from Host perspective. Each time it fails, tsr_retry is > incremented, installing the callback and returning. When MAX_XMIT_RETRY is > reached, tsr_retry is set to 0, returning to the beginning of the loop, > which will discard the current character and take a new one from xmit_fifo. > But, and this is especially important, UART_LSR_THRE will not be set unless > xmit_fifo is empty. yes, As we are not flushing just adding new callback if there is no already setup. Let suppose *first* char write from guest fails and we add a watch and callback. Also, we increment 'tsr_retry'. As there is no activity(nobody reading/writing) in the file handle after this so callback won't be called? serial_xmit() ... else if (qemu_chr_fe_write(&s->chr, &s->tsr, 1) != 1 && s->tsr_retry < MAX_XMIT_RETRY) { assert(s->watch_tag == 0); s->watch_tag = qemu_chr_fe_add_watch(&s->chr, G_IO_OUT | G_IO_HUP, serial_watch_cb, s);-------->add callback if (s->watch_tag > 0) { s->tsr_retry++; // increment & return return; } .... } In function 'serial_ioport_write', we are checking for tsr_retry==0: if (s->tsr_retry == 0) { serial_xmit(s); } So, if there is no activity on file FD we are not at all flushing the data. > > - From VM's kernel perspective: writes to the I/O port, and waits a bit for > UART_LSR_THRE. As is doesn't arrive on time, the wait times out and writes > another character to the I/O port, increasing xmit_fifo. > > I think that the combination of both perspectives makes that writes to the > serial device succeed, although _very_ slowly. So slowly, that from outside > the VM looks exactly the same as when it was blocked on write(). results in RCU stalls and other bad behaviour. > > The PoC attached to the case just makes that writes to the serial device > will fail or succeed quickly, to avoid triggering the above situation. Your fix is working because you have removed 'callback' and every-time you are calling 'serial_xmit' which might not be correct thing. As callbacks are handled by main thread if there is activity on file FD. Thanks, Pankaj > > Sergio. Thanks, Pankaj Is the issue reproducible using "virsh console" or the virDomainOpenConsole() libvirt API? I'm not sure it is valid usage to open the pty device directly and not read data as soon as it's available. >> - From QEMU perspective: "qemu_chr_fe_add_watch" starts failing when the >> PTS is full from Host perspective. Each time it fails, tsr_retry is >> incremented, installing the callback and returning. When MAX_XMIT_RETRY is >> reached, tsr_retry is set to 0, returning to the beginning of the loop, >> which will discard the current character and take a new one from xmit_fifo. I don't think this is correct. Why would qemu_chr_fe_add_watch fail? >> But, and this is especially important, UART_LSR_THRE will not be set unless >> xmit_fifo is empty. Right, THRE is "transmitter hold register empty". Commit fac6688a should definitely be backported (z-stream too). To some extend this could be considered also a kernel issue; printk inside rcu_read_lock doesn't seem to be a good idea. It is probably very widespread, especially for pr_debug and pr_info messages, but martian source does it at pr_warn. But I agree with Eduardo: if you open the pty device, you should be reading it. (In reply to Paolo Bonzini from comment #9) > To some extend this could be considered also a kernel issue; printk inside > rcu_read_lock doesn't seem to be a good idea. It is probably very > widespread, especially for pr_debug and pr_info messages, but martian source > does it at pr_warn. Not at all. printk is perfectly valid in any context. Even in NMI or interrupts disabled sections. And if you check the code martian source printk is rate-limited to twice a second, IIRC. (In reply to Eduardo Habkost from comment #8) > Is the issue reproducible using "virsh console" or the > virDomainOpenConsole() libvirt API? I'm not sure it is valid usage to open > the pty device directly and not read data as soon as it's available. Yes, it is valid to not read data immediately. For example, when using virDomainOpenConsole, libvirtd will open the PTY, read data from it, and transmit that to the client. If the outgoing TCP buffer for the client is full, libvirtd can stop reading further data from the PTY until there is more space to transmit, rather than queuing an arbitrary amount of data in memory waiting for transmission. Note I'm going to pick commit 5c2b22f5bf ("Do not hang on full PTY", 2017-06-06) for bug 1452067, because it is necessary for QE to test that scenario easily. Wrong commit id in comment 14, the right one is fac6688a185. QE tried to verify this bug with libvirt. - Packages in test on host: qemu-kvm-1.5.3-149.el7.x86_64 kernel 3.10.0-799.el7.x86_64 libvirt-3.9.0-4.el7.x86_64 - Snippet of libvirt xml: ... <console type='pty' tty='/dev/pts/2'> <source path='/dev/pts/2'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> ... - Test steps according to bug 1451470 comment 26(no step 3) and step 6)). 1. Create and launch a VM with a pty serial console. 2. On the Host, login guest with 'virsh console'. 3. On the Guest, set 'quiet' option to kernel command line and reboot the guest. [root@guest ~]# cat /proc/sys/kernel/printk 4 4 1 7 3. On the Guest, enable logging for martian packages. [root@localhost ~]# sysctl net.ipv4.conf.all.log_martians=1 net.ipv4.conf.all.log_martians = 1 4. On the Host, open an SSH connection to the Guest, and run date in a loop. Leave this running in another terminal: [root@localhost ~]# while sleep 1; do date ; done 5. On the Host, use rawsock.py to generate martian packages. root@dhcp-8-163 home]# python rawsock.py 192.168.122.155 192.168.122.255 - Actual Result: 1. No guest hang was hit after around 10 minutes. 2. If 'quiet' option was not set, [root@guest ~]# cat /proc/sys/kernel/printk 7 4 1 7 No hang was hit either after around 10 minutes. 3. If quit 'virsh console' when it is printing log, no hang was hit after around 10 minutes. ------- Hi Pankaj, Sergio Lopez, Do you think this bug can be verified like above? If I missed anything or did something wrong, please let me know. Thanks. Best regards, Sitong Hi Sitong, (In reply to Sitong Liu from comment #25) > QE tried to verify this bug with libvirt. > > - Packages in test on host: > qemu-kvm-1.5.3-149.el7.x86_64 > kernel 3.10.0-799.el7.x86_64 > libvirt-3.9.0-4.el7.x86_64 > > - Snippet of libvirt xml: > ... > <console type='pty' tty='/dev/pts/2'> > <source path='/dev/pts/2'/> > <target type='serial' port='0'/> > <alias name='serial0'/> > </console> > ... > > - Test steps according to bug 1451470 comment 26(no step 3) and step 6)). > 1. Create and launch a VM with a pty serial console. > > 2. On the Host, login guest with 'virsh console'. > > 3. On the Guest, set 'quiet' option to kernel command line and reboot the > guest. > > [root@guest ~]# cat /proc/sys/kernel/printk > 4 4 1 7 > > 3. On the Guest, enable logging for martian packages. > > [root@localhost ~]# sysctl net.ipv4.conf.all.log_martians=1 > net.ipv4.conf.all.log_martians = 1 > > 4. On the Host, open an SSH connection to the Guest, and run date in a loop. > > Leave this running in another terminal: > [root@localhost ~]# while sleep 1; do date ; done > > 5. On the Host, use rawsock.py to generate martian packages. > > root@dhcp-8-163 home]# python rawsock.py 192.168.122.155 192.168.122.255 > > > - Actual Result: > > 1. No guest hang was hit after around 10 minutes. > > 2. If 'quiet' option was not set, > > [root@guest ~]# cat /proc/sys/kernel/printk > 7 4 1 7 > > No hang was hit either after around 10 minutes. > > 3. If quit 'virsh console' when it is printing log, no hang was > hit after around 10 minutes. If you are not seeing hang if you close virsh console or don't open it at all it confirms Bug is fixed and libvirt takes care of client socket handling. > > ------- > > Hi Pankaj, Sergio Lopez, > Do you think this bug can be verified like above? If I missed anything or > did something wrong, please let me know. Thanks. > > Best regards, > Sitong Thanks Pankaj. Verify this bug with following packages: qemu-kvm-1.5.3-149.el7.x86_64 kernel 3.10.0-799.el7.x86_64 libvirt-3.9.0-4.el7.x86_64 Test scenarios: 1. open virsh console no hang 2. open virsh console + 'quiet' option no hang 3. don't open virsh console no hang 4. don't open virsh console + 'quiet' option no hang 5. close virsh console when printing log no hang 6. close virsh console when printing + 'quiet' option no hang According to these results, this issue has been fixed, set this bug as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0816 |