Bug 2143884 - non-stop polling of removable media storage
Summary: non-stop polling of removable media storage
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel
Version: 8.7
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: rc
: ---
Assignee: Ewan D. Milne
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-11-18 08:43 UTC by GV
Modified: 2023-02-17 20:32 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
dmesg (73.09 KB, text/plain)
2022-12-08 13:24 UTC, GV
no flags Details
journalctl (138.40 KB, text/plain)
2022-12-08 13:25 UTC, GV
no flags Details
dmesg (504 bytes, text/plain)
2022-12-08 13:25 UTC, GV
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-139892 0 None None None 2022-11-18 09:13:06 UTC

Description GV 2022-11-18 08:43:40 UTC
Starting with kernel 4.18.0-425.3.1.el8.x86_64 I get a lot of the following messages that spam my logs:

# dmesg | grep "Media removed, stopped polling"
[    2.382995] sd 0:0:0:0: [sda] Media removed, stopped polling
[    2.389493] sd 0:0:0:1: [sdb] Media removed, stopped polling
[    2.391864] sd 0:0:0:2: [sdc] Media removed, stopped polling
[    2.395737] sd 0:0:0:3: [sdd] Media removed, stopped polling
[    2.427609] sd 0:0:0:4: [sde] Media removed, stopped polling
[   54.732128] sd 0:0:0:4: [sde] Media removed, stopped polling
[  108.088275] sd 0:0:0:3: [sdd] Media removed, stopped polling
[ 1333.664108] sd 0:0:0:0: [sda] Media removed, stopped polling
[ 1758.637806] sd 0:0:0:0: [sda] Media removed, stopped polling
[ 1758.649352] sd 0:0:0:4: [sde] Media removed, stopped polling
[ 1918.772851] sd 0:0:0:0: [sda] Media removed, stopped polling
[ 2239.034554] sd 0:0:0:1: [sdb] Media removed, stopped polling
[ 2239.040563] sd 0:0:0:3: [sdd] Media removed, stopped polling
[ 3091.095781] sd 0:0:0:0: [sda] Media removed, stopped polling
[ 5433.569760] sd 0:0:0:0: [sda] Media removed, stopped polling
[ 6870.670406] sd 0:0:0:2: [sdc] Media removed, stopped polling
[ 8628.029277] sd 0:0:0:4: [sde] Media removed, stopped polling
[ 9055.047826] sd 0:0:0:4: [sde] Media removed, stopped polling
[ 9055.053926] sd 0:0:0:3: [sdd] Media removed, stopped polling
[ 9907.045308] sd 0:0:0:2: [sdc] Media removed, stopped polling
...

# dmesg | grep "Media removed, stopped polling" | wc -l
414

# uptime
 10:38:18 up 2 days, 13:54,  1 user,  load average: 0.06, 0.03, 0.00

It's really necessary to have to have a non-stop polling?

Thank you!

Comment 1 Ewan D. Milne 2022-12-07 19:44:52 UTC
There were changes made in kernel-4.18.0-396.el8 which included a log
message for a previously unreported condition, followed by a change to
reduce the amount of the resulting logging.  The patch that introduced
the additional logging was:

commit 848ade90ba9c1e2bc2f7869fbe5d08bb32f3db09
Author: Christian Loehle <cloehle>
Date:   Mon Aug 16 09:37:51 2021 +0000

    scsi: sd: Do not exit sd_spinup_disk() quietly
    
    The sd_spinup_disk() function logs what is happening. Unfortunately this
    output stops if the media was marked as removed in the meantime. Add a
    print for this case too.
    
    Link: https://lore.kernel.org/r/CWXP265MB26803209FD08A64222EEEA02C4FD9@CWXP265MB2680.GBRP265.PROD.OUTLOOK.COM
    Reviewed-by: Bart Van Assche <bvanassche>
    Signed-off-by: Christian Loehle <cloehle>
    Signed-off-by: Martin K. Petersen <martin.petersen>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ac431b0477da..920df3a04a7b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2136,8 +2136,10 @@ sd_spinup_disk(struct scsi_disk *sdkp)
                         * doesn't have any media in it, don't bother
                         * with any more polling.
                         */
-                       if (media_not_present(sdkp, &sshdr))
+                       if (media_not_present(sdkp, &sshdr)) {
+                               sd_printk(KERN_NOTICE, sdkp, "Media removed, stopped polling\n");
                                return;
+                       }
 
                        if (the_result)
                                sense_valid = scsi_sense_valid(&sshdr);

but this was changed to suppress the message if the media was already
marked as not present:

Author: Heiner Kallweit <hkallweit1>
Date:   Sat Sep 11 14:11:59 2021 +0200

    scsi: sd: Make sd_spinup_disk() less noisy
    
    sd_spinup_disk() is a little bit noisy after commit 848ade90ba9c ("scsi:
    sd: Do not exit sd_spinup_disk() quietly"):
    
    scsi 0:0:0:0: Direct-Access     Multiple Card  Reader     1.00 PQ: 0 ANSI: 0
    sd 0:0:0:0: Attached scsi generic sg0 type 0
    sd 0:0:0:0: [sda] Media removed, stopped polling
    sd 0:0:0:0: [sda] Media removed, stopped polling
    sd 0:0:0:0: [sda] Attached SCSI removable disk
    sd 0:0:0:0: [sda] Media removed, stopped polling
    
    There's not really a benefit in printing the same message multiple
    times. Therefore print it only if media_present was previously set.
    
    Link: https://lore.kernel.org/r/a2d0a249-6035-9697-626a-e14ec50ef6ee@gmail.com
    Reviewed-by: Bart Van Assche <bvanassche>
    Signed-off-by: Heiner Kallweit <hkallweit1>
    Signed-off-by: Martin K. Petersen <martin.petersen>

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a8039beb5a02..523bf2fdc253 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2124,6 +2124,8 @@ sd_spinup_disk(struct scsi_disk *sdkp)
                retries = 0;
 
                do {
+                       bool media_was_present = sdkp->media_present;
+
                        cmd[0] = TEST_UNIT_READY;
                        memset((void *) &cmd[1], 0, 9);
 
@@ -2138,7 +2140,8 @@ sd_spinup_disk(struct scsi_disk *sdkp)
                         * with any more polling.
                         */
                        if (media_not_present(sdkp, &sshdr)) {
-                               sd_printk(KERN_NOTICE, sdkp, "Media removed, stopped polling\n");
+                               if (media_was_present)
+                                       sd_printk(KERN_NOTICE, sdkp, "Media removed, stopped polling\n");
                                return;
                        }

Since you are seeing the messages recorr every few hundred seconds it would
appear as if your devices are being re-inserted in the intervening period,
unless there is something else marking them present again.

Could you please attach a log from your system, beginning from when the machine
was booted (e.g. dmesg, if it shows all the messages, or journalctl -b 0) to this
Bugzilla (using "Add an attachment" above in the BZ user interface so we can see
what kind of devices these are any what status changes are taking place?

Comment 2 GV 2022-12-08 13:24:51 UTC
Created attachment 1931054 [details]
dmesg

Comment 3 GV 2022-12-08 13:25:18 UTC
Created attachment 1931055 [details]
journalctl

Comment 4 GV 2022-12-08 13:25:33 UTC
Created attachment 1931056 [details]
dmesg

Comment 5 GV 2022-12-08 13:31:09 UTC
Hi!

(In reply to Ewan D. Milne from comment #1)
> Could you please attach a log from your system, beginning from when the
> machine
> was booted (e.g. dmesg, if it shows all the messages, or journalctl -b 0) to
> this
> Bugzilla (using "Add an attachment" above in the BZ user interface so we can
> see
> what kind of devices these are any what status changes are taking place?
dmesg, journalctl and lsscsi output attached.

Please, make this issue public. I _really_ do not understand why by default is to make-it private and why RH bugzilla keep the private mode even if I choose to be public when the issue was created.
Thank you!


Sincerely,
Gabriel

Comment 6 Ewan D. Milne 2022-12-08 21:30:53 UTC
Something is making your devices get alternately marked as
media not present vs. media present

Can you install the sg3_utils package and run the commands:

sg_readcap <device>
sg_readcap --16 <device>

where the <device> is each of sda, sdb, sdc, sdd

One possibility is that you are getting a NOT READY response to
a READ CAPACITY(16) command to your USB attached devices instead
of some other sense code (your USB bridge may be doing that).
The Linux kernel code will set the removable media as not present
in this case, but the periodic event checking would set it back
as being present if it responded to TEST UNIT READY.

There are other ways this could happen but let's try this first.

I don't seem to have permissions to unlock this BZ but I will look into
whether someone else here can do that.

Comment 7 GV 2022-12-09 15:21:39 UTC
Hi!

Thank you for spending time on this issue.

> Can you install the sg3_utils package and run the commands:
# for dev in /dev/sda /dev/sdb /dev/sdc /dev/sdd; do echo "[[sg_readcap $dev]]"; sg_readcap $dev; echo; echo "[[sg_readcap --16 $dev]]"; sg_readcap --16 $dev; echo; done
[[sg_readcap /dev/sda]]
READ CAPACITY (10) failed: Device not ready
sg_readcap failed: Device not ready

[[sg_readcap --16 /dev/sda]]
bad field in READ CAPACITY (16) cdb including unsupported service action
sg_readcap failed: Illegal request

[[sg_readcap /dev/sdb]]
READ CAPACITY (10) failed: Device not ready
sg_readcap failed: Device not ready

[[sg_readcap --16 /dev/sdb]]
bad field in READ CAPACITY (16) cdb including unsupported service action
sg_readcap failed: Illegal request

[[sg_readcap /dev/sdc]]
READ CAPACITY (10) failed: Device not ready
sg_readcap failed: Device not ready

[[sg_readcap --16 /dev/sdc]]
bad field in READ CAPACITY (16) cdb including unsupported service action
sg_readcap failed: Illegal request

[[sg_readcap /dev/sdd]]
READ CAPACITY (10) failed: Device not ready
sg_readcap failed: Device not ready

[[sg_readcap --16 /dev/sdd]]
bad field in READ CAPACITY (16) cdb including unsupported service action
sg_readcap failed: Illegal request


> I don't seem to have permissions to unlock this BZ but I will look into whether someone else here can do that.
Thank you!


Sincerely,
Gabriel

Comment 8 Ewan D. Milne 2022-12-13 16:00:41 UTC
I am still looking at how we might be able to handle differently in the
upstream kernel the way your hardware seems to be working.  I take it that
if you ran

  sg_tur <device>

it would respond ready even though other commands do not.  This seems to
be why the messages are being generated.

I think it is possible to adjust the polling interval, if this would help
in the time being.  There is a kernel parameter in the block layer:

block.disk_events_dfl_poll_msecs

that defaults to 2000 (i.e. 2 seconds), perhaps if you increase it you
would see less logging.  The tradeoff is that if you insert a device,
it would take longer to be found via the polling.

The individual devices also have a "poll_msecs" attribute that can be
set via sysfs.  The default value is -1 which means to use the global
default above, but it can be set separately.  For example, on my system:

[root@rhel-storage-10 ~]# find /sys -name "*poll_msecs*"
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/events_poll_msecs
/sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sdb/events_poll_msecs
/sys/devices/pci0000:00/0000:00:1f.2/ata5/host4/target4:0:0/4:0:0:0/block/sr0/events_poll_msecs
/sys/module/block/parameters/events_dfl_poll_msecs
[root@rhel-storage-10 ~]# for i in `find /sys -name "*poll_msecs*"`
> do
> echo $i
> cat $i
> done
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/events_poll_msecs
-1
/sys/devices/pci0000:00/0000:00:1f.2/ata2/host1/target1:0:0/1:0:0:0/block/sdb/events_poll_msecs
-1
/sys/devices/pci0000:00/0000:00:1f.2/ata5/host4/target4:0:0/4:0:0:0/block/sr0/events_poll_msecs
-1
/sys/module/block/parameters/events_dfl_poll_msecs
2000



Meanwhile I will continue to look into this but it may take a while as
we are entering our end-of-year shutdown.

I think I have set the BZ to be unlocked as you asked.

Comment 9 GV 2022-12-14 08:00:07 UTC
Hi!

(In reply to Ewan D. Milne from comment #8)
> I take it that if you ran
> 
>   sg_tur <device>
> 
> it would respond ready even though other commands do not.  This seems to
> be why the messages are being generated.
# for dev in /dev/sda /dev/sdb /dev/sdc /dev/sdd; do echo "[[sg_turs $dev]]"; sg_turs $dev; done
[[sg_turs /dev/sda]]
test unit ready: transport: Host_status=0x07 [DID_ERROR]
Driver_status=0x00 [DRIVER_OK]

[[sg_turs /dev/sdb]]
test unit ready: transport: Host_status=0x07 [DID_ERROR]
Driver_status=0x00 [DRIVER_OK]

[[sg_turs /dev/sdc]]
test unit ready: transport: Host_status=0x07 [DID_ERROR]
Driver_status=0x00 [DRIVER_OK]

[[sg_turs /dev/sdd]]
test unit ready: transport: Host_status=0x07 [DID_ERROR]
Driver_status=0x00 [DRIVER_OK]

> I think it is possible to adjust the polling interval, if this would help
> in the time being.
Yes, it would help a lot! Thank you.

Probably the fastest solution is to disconnect the device from MB. But it's a 4 hour drive to get to the location where PC is located. And are others issues then I get there... So, it's kind of inconvenient...

> Meanwhile I will continue to look into this but it may take a while as
> we are entering our end-of-year shutdown.
Take your time. Thank you.

> I think I have set the BZ to be unlocked as you asked.
Thank you.


Sincerely,
Gabriel

Comment 10 GV 2022-12-14 08:14:33 UTC
/var/log/messages when running sg_turs:

Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:0: [sda] tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:1: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:1: [sdb] tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:2: [sdc] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:2: [sdc] tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:3: [sdd] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 14 09:52:50 xxxxx kernel: sd 0:0:0:3: [sdd] tag#0 CDB: Test Unit Ready 00 00 00 00 00 00


Note You need to log in before you can comment on or make changes to this bug.