1818952 – [BISECTED] built-in laptop webcam no longer found on Sony Vaio on Fedora 31

Bug 1818952 - [BISECTED] built-in laptop webcam no longer found on Sony Vaio on Fedora 31

Summary: [BISECTED] built-in laptop webcam no longer found on Sony Vaio on Fedora 31

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	31
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-30 19:04 UTC by William Bader
Modified:	2020-11-24 17:16 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-11-24 17:16:30 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
kernel log from journalctl --no-hostname -k (80.07 KB, text/plain) 2020-03-30 19:04 UTC, William Bader	no flags	Details
kernel 5.5.13 log with journalctl --no-hostname -k (78.73 KB, text/plain) 2020-03-31 00:15 UTC, William Bader	no flags	Details
kernel 5.5.13 log with mitigation and no vbox with journalctl --no-hostname -k (79.31 KB, text/plain) 2020-03-31 07:17 UTC, William Bader	no flags	Details
photo of boot from Fedora 31 live CD with the webcam working (80.71 KB, image/jpeg) 2020-04-03 06:43 UTC, William Bader	no flags	Details
good and bad dmesg and lsusb output (25.82 KB, application/x-bzip) 2020-04-04 08:46 UTC, William Bader	no flags	Details
git-log-oneline-v5.4.10-v5.4.11.txt (10.63 KB, text/plain) 2020-04-04 13:09 UTC, Steve	no flags	Details
shell script to enable dyndbg for USB testing (commands are from Mathias Nyman) (569 bytes, application/x-shellscript) 2020-04-11 18:52 UTC, Steve	no flags	Details
shell script to enable dyndbg for USB testing (commands are from Mathias Nyman) (624 bytes, application/x-shellscript) 2020-04-11 19:27 UTC, Steve	no flags	Details
shell script to enable dyndbg for USB testing (based on commands from Mathias Nyman) (634 bytes, application/x-shellscript) 2020-04-12 03:32 UTC, Steve	no flags	Details
journalctl logs (2.32 KB, application/x-bzip) 2020-04-12 05:53 UTC, William Bader	no flags	Details
lsusb-v-5.4.10-200-good-Ricoh-only.txt ("lsusb -v" output for the Ricoh USB camera) (17.31 KB, text/plain) 2020-04-12 15:24 UTC, Steve	no flags	Details
*grep -B15 -n 'Code:' dmesg-all.txt** (11.50 KB, text/plain) 2020-04-12 18:09 UTC, William Bader	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	207219	0	None	None	None	2020-04-13 05:24:09 UTC

Description William Bader 2020-03-30 19:04:52 UTC

Created attachment 1674785 [details]
kernel log from journalctl --no-hostname -k

1. Please describe the problem:

The kernel does not find the built-in webcam.
My laptop is a Sony Vaio, product VPCCB4Q1E, model PCG-71D14M.

2. What is the Version-Release number of the kernel:

5.5.11-200.fc31.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

It worked fairly recently because I remember the webcam showing as an option in xsane. I don't normally use the webcam, but with the virus, I have had to use Zoom, and I noticed the Zoom doesn't find it, and other applications like cheese also don't find it. Going deeper, there is no /dev/video.

cheese gives the messages below:
** Message: 19:50:31.949: cheese-application.vala:214: Error during camera setup: No device found
(cheese:15772): cheese-CRITICAL **: 19:50:31.963: cheese_camera_device_get_name: assertion 'CHEESE_IS_CAMERA_DEVICE (device)' failed
(cheese:15772): GLib-CRITICAL **: 19:50:31.963: g_variant_new_string: assertion 'string != NULL' failed
(cheese:15772): GLib-CRITICAL **: 19:50:31.963: g_variant_ref_sink: assertion 'value != NULL' failed
(cheese:15772): GLib-GIO-CRITICAL **: 19:50:31.963: g_settings_schema_key_type_check: assertion 'value != NULL' failed
(cheese:15772): GLib-CRITICAL **: 19:50:31.963: g_variant_get_type_string: assertion 'value != NULL' failed
(cheese:15772): GLib-GIO-CRITICAL **: 19:50:31.963: g_settings_set_value: key 'camera' in 'org.gnome.Cheese' expects type 's', but a GVariant of type '(null)' was given
(cheese:15772): GLib-CRITICAL **: 19:50:31.963: g_variant_unref: assertion 'value != NULL' failed
** (cheese:15772): CRITICAL **: 19:50:31.963: cheese_preferences_dialog_setup_resolutions_for_device: assertion 'device != NULL' failed


4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below:

The kernel never finds the webcam. It happens every time the system boots.

I think that these are the relevant lines from /var/log/messages:

Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: new high-speed USB device number 4 using ehci-pci
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: config 247 has too many interfaces: 120, using maximum allowed: 32
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: config 247 descriptor has 1 excess byte, ignoring
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: config 247 has 0 interfaces, different from the descriptor's value: 120
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: New USB device found, idVendor=05ca, idProduct=18c0, bcdDevice= 7.32
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: Product: USB2.0 Camera
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: Manufacturer: Ricoh Company Ltd.
Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: can't set config #247, error -32

lsusb shows
Bus 002 Device 003: ID 093a:2510 Pixart Imaging, Inc. Optical Mouse
Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 004: ID 05ca:18c0 Ricoh Co., Ltd 
Bus 001 Device 005: ID 0489:e036 Foxconn / Hon Hai 
Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

I haven't tried, but I can try if someone requests.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No, not as far as I know.

Did Fedora 31 change something so I now need to load firmware like http://vaio-utils.org/camera/

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

ok

Comment 1 Steve 2020-03-30 20:22:14 UTC

Thanks for your report. There is a newer kernel in updates-testing:

kernel-5.5.13-200.fc31
https://bodhi.fedoraproject.org/updates/FEDORA-2020-809ff0b166

# dnf update kernel --enablerepo=updates-testing

There is also a new F32 kernel here:

kernel-5.6.0-300.fc32 and kernel-headers-5.6.0-300.fc32
https://bodhi.fedoraproject.org/updates/FEDORA-2020-e8b6474ee5

The full list is here:
https://bodhi.fedoraproject.org/updates/?packages=kernel

Comment 2 William Bader 2020-03-30 21:02:49 UTC

Updating from updates-testing would take the kernel from 5.5.11 to 5.5.13. The changelog for 5.5.12 doesn't show anything about cameras, and the changelog for 5.5.13 is very short. Is there any specific change in either one that you suspect would help?
Is there an easy way to test the 5.6 Fedora 32 kernel without updating to Fedora 32 beta?
The webcam is not critical, and I don't want to risk breaking Linux on my laptop.
Will the 5.6 kernel come to Fedora 31 soon?
Regards, William

Comment 3 Steve 2020-03-30 21:35:49 UTC

(In reply to William Bader from comment #2)
> Updating from updates-testing would take the kernel from 5.5.11 to 5.5.13.
> The changelog for 5.5.12 doesn't show anything about cameras, and the changelog for 5.5.13 is very short.
> Is there any specific change in either one that you suspect would help?

Testing whether a newer kernel fixes a problem is fairly standard operating procedure, but if you prefer to wait until 5.5.13 reaches "stable" that is fine.

> Is there an easy way to test the 5.6 Fedora 32 kernel without updating to Fedora 32 beta?

Yes, very easy, but I would suggest trying 5.5.13 first.

> The webcam is not critical, and I don't want to risk breaking Linux on my laptop.

The "karma" reports on Bodhi for 5.5.13 are all positive:
https://bodhi.fedoraproject.org/updates/FEDORA-2020-809ff0b166

And if there is a problem, you can still boot 5.5.11 from the grub2 menu. You have "rhgb quiet" on your kernel command-line, so you may need to press and hold, or repeatedly tap, the "Esc" key to see the grub2 menu. Please post back if you have trouble getting to the grub2 menu. (You can try that without installing a new kernel.)

> Will the 5.6 kernel come to Fedora 31 soon?

I don't know. The best that I can suggest is to monitor the kernel builds on Koji:
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

> Regards, William

Comment 4 William Bader 2020-03-31 00:15:47 UTC

Created attachment 1674891 [details]
kernel 5.5.13 log with journalctl --no-hostname -k

Updating to the 5.5.13-200.fc31.x86_64 kernel from updates-testing didn't help. It still has messages like 'usb 1-1.3: device descriptor read/64, error -32', which I think is the built-in webcam. I attached a new boot log.

I updated the kernel with the command 'dnf update kernel --enablerepo=updates-testing'
Do I have to do anything to eventually get back to the stable kernel?
Will dnfdragora switch to the stable kernel once the stable kernel has a higher version than 5.5.13?

I get a list of kernels for a few seconds when I reboot, so I don't have a pressing need to remove the updates-testing kernel.

Regards, William

Comment 5 Steve 2020-03-31 01:18:39 UTC

(In reply to William Bader from comment #4)
> Created attachment 1674891 [details]
> kernel 5.5.13 log with journalctl --no-hostname -k
> 
> Updating to the 5.5.13-200.fc31.x86_64 kernel from updates-testing didn't
> help. It still has messages like 'usb 1-1.3: device descriptor read/64,
> error -32', which I think is the built-in webcam. I attached a new boot log.

Thanks for testing and for attaching the log.

> I updated the kernel with the command 'dnf update kernel --enablerepo=updates-testing'
> Do I have to do anything to eventually get back to the stable kernel?
> Will dnfdragora switch to the stable kernel once the stable kernel has a higher version than 5.5.13?

Actually, 5.5.13 will become the stable kernel, unless there are serious problems with it, which there don't appear to be.

If you would prefer to go back to using 5.5.11, you can keep choosing it in the grub2 menu, or you can remove 5.5.13 and wait for the update.

To remove 5.5.13, you can use a very nice feature of dnf -- the "history" command:

# dnf history info last # This will show you the last thing done by dnf.

# dnf history undo last # This will undo the last thing done by dnf.

Just as with updates, dnf will ASK before doing anything, so you can always answer "N".

NB: Boot from 5.5.11 before undoing, because dnf won't let you remove the currently running kernel.

Documentation: "man dnf".

> I get a list of kernels for a few seconds when I reboot, so I don't have a pressing need to remove the updates-testing kernel.

Good. The default timer is for 5 seconds, but if you tap any key, the timer will stop counting down, so you can look at the grub2 menu for as long as you like. :-)

> Regards, William

Comment 6 Steve 2020-03-31 01:26:04 UTC

You can also use a transaction ID to undo a particular dnf transaction:

# dnf history | head

# dnf history undo NNN # Where NNN is a transaction ID from the above listing.

Comment 7 Steve 2020-03-31 01:37:08 UTC

(In reply to Steve from comment #3)
...
> > Is there an easy way to test the 5.6 Fedora 32 kernel without updating to Fedora 32 beta?
> 
> Yes, very easy, but I would suggest trying 5.5.13 first.
...

Actually, you could test F32-Beta as a Live image by downloading the ISO file and installing it on a USB flash drive or a DVD.

This has an F32-Beta download link and instructions for putting the Live image on a bootable device:
https://getfedora.org/en/workstation/download/

The F32-Beta Live ISO image file name is:

Fedora-Workstation-Live-x86_64-32_Beta-1.2.iso.

The kernel is:

$ uname -r
5.6.0-0.rc5.git0.2.fc32.x86_64

Comment 8 Steve 2020-03-31 02:57:06 UTC

Well, 5.5.13 seems to be even worse than 5.5.11. The "Manufacturer:" isn't even reported.

If you still have 5.5.13 installed, could you post the output from "lsusb" for comparison with Comment 0?

$ egrep -n 'Command line:|usb 1-1.3' dmesg-1.txt 
4:Mar 30 15:00:58 kernel: Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.11-200.fc31.x86_64 root=UUID=01ea3428-c96d-4f4c-af30-2072ce724031 ro rhgb quiet elevator=noop LANG=en_US.UTF-8 mitigations=off
749:Mar 30 15:00:58 kernel: usb 1-1.3: new high-speed USB device number 4 using ehci-pci
750:Mar 30 15:00:58 kernel: usb 1-1.3: config 247 has too many interfaces: 120, using maximum allowed: 32
751:Mar 30 15:00:58 kernel: usb 1-1.3: config 247 descriptor has 1 excess byte, ignoring
752:Mar 30 15:00:58 kernel: usb 1-1.3: config 247 has 0 interfaces, different from the descriptor's value: 120
753:Mar 30 15:00:58 kernel: usb 1-1.3: New USB device found, idVendor=05ca, idProduct=18c0, bcdDevice= 7.32
754:Mar 30 15:00:58 kernel: usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
755:Mar 30 15:00:58 kernel: usb 1-1.3: Product: USB2.0 Camera
756:Mar 30 15:00:58 kernel: usb 1-1.3: Manufacturer: Ricoh Company Ltd.
757:Mar 30 15:00:58 kernel: usb 1-1.3: can't set config #247, error -32

$ egrep -n 'Command line:|usb 1-1.3' dmesg-2.txt 
4:Mar 31 00:54:18 kernel: Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.13-200.fc31.x86_64 root=UUID=01ea3428-c96d-4f4c-af30-2072ce724031 ro rhgb quiet elevator=noop LANG=en_US.UTF-8 mitigations=off
750:Mar 31 00:54:18 kernel: usb 1-1.3: new full-speed USB device number 4 using ehci-pci
751:Mar 31 00:54:18 kernel: usb 1-1.3: device descriptor read/64, error -32
755:Mar 31 00:54:18 kernel: usb 1-1.3: device descriptor read/64, error -32
761:Mar 31 00:54:18 kernel: usb 1-1.3: new full-speed USB device number 5 using ehci-pci
763:Mar 31 00:54:18 kernel: usb 1-1.3: device descriptor read/64, error -32
764:Mar 31 00:54:18 kernel: usb 1-1.3: device descriptor read/64, error -32
777:Mar 31 00:54:19 kernel: usb 1-1.3: new full-speed USB device number 6 using ehci-pci
818:Mar 31 00:54:19 kernel: usb 1-1.3: device not accepting address 6, error -32
823:Mar 31 00:54:19 kernel: usb 1-1.3: new full-speed USB device number 7 using ehci-pci
824:Mar 31 00:54:20 kernel: usb 1-1.3: device not accepting address 7, error -32

Comment 9 Steve 2020-03-31 03:18:28 UTC

The kernel command-line has two non-standard options:

elevator=noop     # What is this for?
mitigations=off   # Disable all optional CPU mitigations.

Could you try removing them from the kernel command-line in grub2 and booting without them. (Press "e" while the grub2 menu is displayed to edit the kernel command-line. The change is not permanent.)

Also, let's make sure the kernel isn't tainted:

$ cat /proc/sys/kernel/tainted

("0" means untainted.)

Documentation:

The kernel’s command-line parameters
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

Tainted kernels
https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html

Comment 10 Steve 2020-03-31 04:20:02 UTC

(In reply to Steve from comment #9)
...
> Also, let's make sure the kernel isn't tainted:
> 
> $ cat /proc/sys/kernel/tainted
> 
> ("0" means untainted.)
...

Could you disable or remove the vboxdrv kernel module for all future testing:

$ fgrep -in taint dmesg*
dmesg-1.txt:864:Mar 30 15:01:02 kernel: vboxdrv: loading out-of-tree module taints kernel.
dmesg-1.txt:865:Mar 30 15:01:02 kernel: vboxdrv: module verification failed: signature and/or required key missing - tainting kernel

Comment 11 William Bader 2020-03-31 06:31:41 UTC

Thanks for the reply. I still have the 5.5.13 kernel booted.
$ lsusb
Bus 002 Device 003: ID 093a:2510 Pixart Imaging, Inc. Optical Mouse
Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 003: ID 0489:e036 Foxconn / Hon Hai 
Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

That is worse than before. It doesn't even identify the webcam.

$ cat /proc/sys/kernel/tainted
0

The new kernel didn't load the vboxdrv module. Maybe the 5.5.11 stable module isn't compatible with 5.5.13, but shouldn't dnf have warned about that?

Years ago when I got this laptop, I found a shop that let me boot a Fedora live CD to check that it would work. I have had other laptops that needed blobs with proprietary drivers for the video and the wifi.

elevator=noop     # What is this for?

It is for a simple FIFO I/O scheduler. When I got the laptop, I replaced the Windows drive with an SSD, and at that time, the default Linux I/O scheduler did a lot of work to optimize for hard disks that was unnecessary on an SSD. It made a difference then, but it is probably not needed now.

mitigations=off   # Disable all optional CPU mitigations.

It is a little risky, but my laptop lives on a home LAN. I have disabled all servers listening on outside ports. I have the firewall set to block every port that I don't need.
The laptop has an i5-2450M CPU, which has all of the flaws, and enabling mitigations makes it run a few percent slower and slightly hotter.

I made a file /etc/modprobe.d/blacklist-vboxdrv.conf with the line "blacklist vboxdrv".
I used VirtualBox for a project a few years ago, and I have no immediate need for it.

>Actually, 5.5.13 will become the stable kernel, unless there are serious problems with it, which there don't appear to be.

So I can just leave it installed, and the stable will eventually catch up.

>Actually, you could test F32-Beta as a Live image by downloading the ISO file and installing it on a USB flash drive or a DVD.

I can't because I am in an area with coronavirus lockdown, and I don't have a blank DVD and I don't have a flash drive large enough. That is why I want to be very cautious about what I do.

I'll reboot and see what happens.

Comment 12 William Bader 2020-03-31 07:17:40 UTC

Created attachment 1674987 [details]
kernel 5.5.13 log with mitigation and no vbox with journalctl --no-hostname -k

I rebooted 5.5.13 with the vbox module blacklisted and with the boot command edited to remove the options that you mentioned. It still didn't find the webcam.

Comment 13 Steve 2020-03-31 11:37:18 UTC

(In reply to William Bader from comment #12)
> Created attachment 1674987 [details]
> kernel 5.5.13 log with mitigation and no vbox with journalctl --no-hostname
> -k
> 
> I rebooted 5.5.13 with the vbox module blacklisted and with the boot command
> edited to remove the options that you mentioned. It still didn't find the
> webcam.

Thanks for testing that configuration, and for the "lsusb" output.

Let's try an older kernel. This is the F31 release kernel:

$ dnf -q repoquery kernel --repo=fedora
kernel-0:5.3.7-301.fc31.x86_64

If that is not listed in the grub2 menu, you can boot the "rescue" kernel. It's the one with the long number in the name. You can check the kernel version with:

$ file /boot/vmlinuz-0-rescue-*

Please do not post the full name with the long number, it is your machine id and is considered confidential: "man machine-id", "/etc/machine-id".

Comment 14 Steve 2020-03-31 14:09:59 UTC

(In reply to William Bader from comment #11)
...
> Years ago when I got this laptop, I found a shop that let me boot a Fedora live CD to check that it would work.
...

Do you have any older Live CDs or DVDs that you could try?

Also, there is no Zoom app in Fedora. Where is that from?

The idea would be to install Zoom into a Live image environment, assuming the Live image gives you a working camera.

Although it would not survive a reboot, if the Zoom app is compatible with the Live image, using a Live image might serve as a workaround.

Comment 15 William Bader 2020-04-03 06:43:38 UTC

Created attachment 1675918 [details]
photo of boot from Fedora 31 live CD with the webcam working

The 5.3.7-301.fc31.x86_64 kernel on the Fedora 31 live CD works.
It creates /dev/video0 and applications like 'cheese' work.

So the problem is a kernel regression somewhere between 5.3.7 and 5.5.11.
I have attached a photo as proof, and the photo has the kernel version numbers.

I am running zoom by browsing to the meeting URL on chrome. It then uses xdg to run an executable, I think from the package zoom-3.5.372466.0322-1.x86_64 but the underlying problem is a kernel regression that fails to create /dev/video. Theoretically I could install zoom on the live CD, but my laptop has only 8GB RAM, and there isn't much space left on the live filesystem, and I would rather not mount my system disk from the live CD.

I can use my phone for a zoom meeting work-around, but audio-only on my laptop is fine.

Is there a way to install kernel binaries to try to bisect the version with the webcam regression?

Is it OK to install only the kernel package, or do the corresponding kernel-core, kernel-headers, and kernel-modules also have to be installed?

I am on virus lockdown, and I don't have a blank DVD or a spare pen drive, or an easy way to get one. I've always tested the live system on a DVD because I have never been successful getting my laptop to boot from a pen drive. While I'm thinking about it, shutting down the live CD is a pain. Once my laptop powers down, it doesn't listen to the eject button, and if I reboot, it starts reading the CD before checking the eject button. The live CD should have a short boot delay to make time to eject the CD and boot from the system disk. I end up either ejecting the CD while the live system is running and using the power button to turn off the laptop or else shutting down and then poking the drive with a paperclip.

Regards, William

Comment 16 Steve 2020-04-03 10:06:34 UTC

(In reply to William Bader from comment #15)
> Created attachment 1675918 [details]
> photo of boot from Fedora 31 live CD with the webcam working
> 
> The 5.3.7-301.fc31.x86_64 kernel on the Fedora 31 live CD works.
> It creates /dev/video0 and applications like 'cheese' work.
> 
> So the problem is a kernel regression somewhere between 5.3.7 and 5.5.11.
> I have attached a photo as proof, and the photo has the kernel version numbers.

Thanks for testing with the Live CD and for the screenshot.

> I am running zoom by browsing to the meeting URL on chrome. It then uses xdg
> to run an executable, I think from the package zoom-3.5.372466.0322-1.x86_64
> but the underlying problem is a kernel regression that fails to create
> /dev/video. Theoretically I could install zoom on the live CD, but my laptop
> has only 8GB RAM, and there isn't much space left on the live filesystem,
> and I would rather not mount my system disk from the live CD.

You could install directly into the Live image without mounting your system disk.

However, "zoom" doesn't seem to be a Fedora package. Do you have some non-Fedora repos configured?

$ dnf repolist

> I can use my phone for a zoom meeting work-around, but audio-only on my laptop is fine.
> 
> Is there a way to install kernel binaries to try to bisect the version with the webcam regression?

Yes. I will post a followup comment on that subject.

> Is it OK to install only the kernel package, or do the corresponding
> kernel-core, kernel-headers, and kernel-modules also have to be installed?
> 
> I am on virus lockdown, and I don't have a blank DVD or a spare pen drive,
> or an easy way to get one. I've always tested the live system on a DVD
> because I have never been successful getting my laptop to boot from a pen
> drive. While I'm thinking about it, shutting down the live CD is a pain.
> Once my laptop powers down, it doesn't listen to the eject button, and if I
> reboot, it starts reading the CD before checking the eject button. The live
> CD should have a short boot delay to make time to eject the CD and boot from
> the system disk. I end up either ejecting the CD while the live system is
> running and using the power button to turn off the laptop or else shutting
> down and then poking the drive with a paperclip.

There is usually a special key that you can press to get a list of boot devices.

> Regards, William

Comment 17 Steve 2020-04-03 10:27:29 UTC

(In reply to William Bader from comment #15)
...
> Is there a way to install kernel binaries to try to bisect the version with the webcam regression?
> 
> Is it OK to install only the kernel package, or do the corresponding kernel-core, kernel-headers, and kernel-modules also have to be installed?
...

All of the kernel builds are here:
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

You probably only need kernel-core and kernel-modules. You might need kernel-modules-extra (It's hard to say without testing.) The "kernel" package is a meta-package that pulls in other packages. It doesn't have any files in it and you don't need to install it for testing purposes:

$ rpm -ql kernel-5.5.13-200.fc31.x86_64
(contains no files)

However, dnf will try to remove old kernels unless you make this change:

$ grep installonly_limit /etc/dnf/dnf.conf
#installonly_limit=3
installonly_limit=0

Download into an empty directory and install with:

# dnf install kernel*.rpm

Documentation: "man dnf.conf"

Comment 18 Steve 2020-04-03 10:51:49 UTC

(In reply to Steve from comment #16)
...
> There is usually a special key that you can press to get a list of boot devices.
...

Your manual would probably say which key to press. This Sony document says to repeatedly press the F11 key while booting (in the section titled "To recover from Recovery Media"):

https://www.sony.com/electronics/support/res/manuals/Z019/Z019650111.PDF

Comment 19 Steve 2020-04-03 14:12:38 UTC

(In reply to William Bader from comment #15)
...
> I am running zoom by browsing to the meeting URL on chrome. It then uses xdg
> to run an executable, I think from the package zoom-3.5.372466.0322-1.x86_64
> but the underlying problem is a kernel regression that fails to create
> /dev/video. Theoretically I could install zoom on the live CD, but my laptop
> has only 8GB RAM, and there isn't much space left on the live filesystem,
> and I would rather not mount my system disk from the live CD.
...

If "zoom" is a standalone package, you could put it on a USB flash drive and mount that in the Live environment.

Comment 20 William Bader 2020-04-03 20:30:57 UTC

Steve, thanks for the replies.

If I can find the kernel with the regression, is there any chance that someone would try to look into what happened?

A long time ago I wrote MSDOS programs that used C and masm to write to video ram. If I can narrow the regression to a few commits, I might get lucky and see what caused the problem.

The messages starting 'kernel: usb 1-1.3: config 247' come from https://github.com/torvalds/linux/blob/v5.4/drivers/usb/core/config.c#L582 or from the later https://github.com/torvalds/linux/blame/v5.5/drivers/usb/core/config.c#L631

Could the webcam hardware be doing something bad? https://github.com/torvalds/linux/commit/3dd550a2d36596a1b0ee7955da3b611c031d3873 (but this is for 5.3.0 and 5.3.7 works)
https://github.com/torvalds/linux/blame/v5.5/drivers/usb/core/config.c#L645 (but this was in since 2.6.12. Is it worth booting again with the live CD to see if it gets the 'config 247 descriptor has 1 excess byte, ignoring'?)

Do you think that any of the issues below could be the same as mine:
https://bugzilla.kernel.org/show_bug.cgi?id=111291 (uvcvideo 1-1.4:1.0: Entity type for entity Extension 4 was not initialized! / modified 2020-01-19)
https://bugzilla.kernel.org/show_bug.cgi?id=199715 (hp_accel: probe of HPQ6007:00 failed with error -22 (HP Envy x360) / modified 2020-04-01)
https://bugzilla.kernel.org/show_bug.cgi?id=205271 (Internal Webcams of Samsung Galaxy Tab (W728N) not working / reported 2019-10-20)
https://bugzilla.kernel.org/show_bug.cgi?id=206357 (Linux Kernel 5.4.7 - vgacon_invert_region use-after-free / reported 2020-01-30)

If I want to try building a kernel from source, is the best way https://fedoraproject.org/wiki/Building_a_custom_kernel#Building_a_Kernel_from_the_Fedora_source_tree ?

Regards, William

Comment 21 Steve 2020-04-03 21:16:30 UTC

(In reply to William Bader from comment #20)

Since you already know about git, and have a reliable reproducer, I would suggest doing a kernel bisection. That involves repeatedly building kernels with various git commits included or excluded:

Bisecting a bug
https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

Artem did that and identified the commit that caused an early boot hang:
Bug 1790115 - [CRITICAL REGRESSION] Fedora's configuration of kernel >= 5.4 is not bootable 

He also opened a bug upstream:
Bug_206175 - Fedora >= 5.4 kernels instantly freeze on boot without producing any display output 
https://bugzilla.kernel.org/show_bug.cgi?id=206175

But before doing any of that, I would suggest trying newer kernels from Bodhi:
https://bodhi.fedoraproject.org/updates/?packages=kernel

This one is in the pipeline:

kernel-5.5.15-200.fc31, kernel-headers-5.5.15-200.fc31, & 1 more 
https://bodhi.fedoraproject.org/updates/FEDORA-2020-666f3b1ac3

And doing "lite" bisection by trying various Fedora builds, including snapshot builds, from Koji:
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

Snapshot builds have "gitN" in the version, where "N" is a number starting with "0".

Comment 22 Steve 2020-04-03 22:02:00 UTC

(In reply to William Bader from comment #20)
...
> Is it worth booting again with the live CD to see if it gets the 'config 247 descriptor has 1 excess byte, ignoring'?
...

Attaching a log from 5.3.7 is a good idea. If you don't want to mount your system disk from the Live session, you could save the log to a USB flash instead.

The USB flash drive will probably be automounted. You can find the mountpoint with:

$ lsblk -f

$ cd /run/media/liveuser/XXX-YYY

$ ls

$ journalctl --no-hostname -k > dmesg-5.3.7.txt

$ cd

Unmount the USB flash drive with the "Files" app.

BTW, this is also a useful command:

$ findmnt /dev/sda1
TARGET                      SOURCE    FSTYPE OPTIONS
/run/media/liveuser/XXX-YYY /dev/sda1 vfat   rw,nosuid,nodev,relatime,uid=1000,gid=1000,fmask=0022,dmask=0022,codepage=437,ioc

NB: "XXX-YYY" is the obfuscated vfat file system label on my USB flash drive.

Tested in a VM with Fedora-Workstation-Live-x86_64-31-1.9.iso.

Comment 23 Steve 2020-04-04 02:03:20 UTC

(In reply to Steve from comment #22)
... 
> Attaching a log from 5.3.7 is a good idea. If you don't want to mount your
> system disk from the Live session, you could save the log to a USB flash
> instead.
...
> $ journalctl --no-hostname -k > dmesg-5.3.7.txt
...

While you are in the Live session, it might be a good idea to do a full USB dump too:

# lsusb -v > lsusb-v-5.3.7.txt # That's run as root.

Comment 24 Steve 2020-04-04 05:26:04 UTC

(In reply to Steve from comment #17)
...
> All of the kernel builds are here:
> https://koji.fedoraproject.org/koji/packageinfo?packageID=8
...

The "koji" command can be used to list kernel builds and to download specific packages:

# dnf install koji

The "--after" date in the following koji command is the build date for 5.3.7 taken from "uname -a" when booted from the F31 Live image:

$ uname -a
Linux localhost-live 5.3.7-301.fc31.x86_64 #1 SMP Mon Oct 21 19:18:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Get a list of kernel builds for "fc31".

$ koji list-builds --package=kernel --state=COMPLETE --after=2019-10-21 --reverse --quiet | fgrep '.fc31'
kernel-5.5.9-200.fc31                                    jforbes           COMPLETE
kernel-5.5.8-200.fc31                                    jforbes           COMPLETE
...
kernel-5.3.11-300.fc31                                   jforbes           COMPLETE
kernel-5.3.10-300.fc31                                   labbott           COMPLETE

Specific kernel packages can then be downloaded as desired. I would suggest starting with the earliest 5.4 build, which appears to be 5.4.2:

$ koji download-build --rpm kernel-core-5.4.2-300.fc31.x86_64.rpm
Downloading: kernel-core-5.4.2-300.fc31.x86_64.rpm
[====================================] 100%  31.12 MiB

$ koji download-build --rpm kernel-modules-5.4.2-300.fc31.x86_64.rpm
Downloading: kernel-modules-5.4.2-300.fc31.x86_64.rpm
[====================================] 100%  28.41 MiB

Comment 25 Steve 2020-04-04 05:47:36 UTC

NB: The packages are unsigned. However, they should pass this check:

$ rpmkeys --checksig kernel*.rpm
kernel-core-5.4.2-300.fc31.x86_64.rpm: digests OK
kernel-modules-5.4.2-300.fc31.x86_64.rpm: digests OK

Comment 26 Steve 2020-04-04 06:47:03 UTC

This will show you some snapshot (gitN) builds:

$ koji list-builds --package=kernel --state=COMPLETE --after=2019-09-01 --reverse --quiet | fgrep '.fc32'
                                                             ^^^^^^^^^^                               ^^

I moved the date back to get this build:

kernel-5.4.0-0.rc0.git1.1.fc32                           jcline            COMPLETE

Which is here:

kernel-5.4.0-0.rc0.git1.1.fc32
https://koji.fedoraproject.org/koji/buildinfo?buildID=1379422

Note, in particular, what the changelog says:

* Tue Sep 17 2019 Jeremy Cline <jcline> - 5.4.0-0.rc0.git1.1
- Linux v5.3-2061-gad062195731b
        ^^^^^^^^^^^^^^^^^^^^^^^

That is a commit ID that can be passed into a git command:

$ git show --oneline v5.3-2061-gad062195731b
ad06219573 Merge tag 'platform-drivers-x86-v5.4-1' of git://git.infradead.org/linux-platform-drivers-x86

The "git describe" command generates those strings:

$ git describe --abbrev=12 ad06219573
v5.3-2069-gad062195731b

The commit IDs in the changelog are very important because they are what you would use to start a kernel bisection.

Comment 27 William Bader 2020-04-04 08:46:51 UTC

Created attachment 1676176 [details]
good and bad dmesg and lsusb output

I used koji to try a few kernels back to 5.3.7-300, and they didn't work to find the webcam.
Then I tried the live cd, and it didn't work either, which was strange.
I had been selecting 'restart' to reboot.
I did a shutdown and a cold boot, and then the live cd found the webcam.
Then I retried some of the kernels. 5.4.10-200 is good. 5.4.11-200 is bad.
5.3.7-300 good
5.3.9-300 good
5.4.2-300 good
5.4.10-200 good
5.4.11-200 bad
5.4.11-201 bad
5.4.11-202 bad
5.4.13-200 bad
5.4.20-200 bad
5.5.2-200 bad
So there is a regression in 5.4.11-200, and in addition there might be a second bug not resetting something during warm boots.

I attached a tar with an example of 'journalctl --no-hostname -k' and 'lsusb -v' on good and bad kernel.

I tried F12 on reboot after shutting down the live cd.
The console eventually got a "No operating system" message, which was a bit scary, but it let me eject the CD with the button, and then I could reboot from my system disk. Maybe F12 looks for a hidden recovery area on the original Sony OEM Windows hard drive.

I added sort -Vr to your koji list-builds command
koji list-builds --package=kernel --state=COMPLETE --after=2019-10-21 --reverse --quiet | fgrep '.fc31' | sort -Vr

What is the next step?
I want to avoid building a lot of kernels because this laptop runs hot, and my last laptop burnt up building gcc.

It seems reproducible that a cold boot with 5.4.10-200 or earlier finds the webcam, while 5.4.11-200 or higher does not.
There was a 5.4.11-201 and 5.4.11-202, which could be a sign that 5.4.11 had some big changes that broke some things.

I set installonly_limit=0 in dnf.conf. If I put it back to 3, will it purge the kernel-core and kernel-modules that I downloaded from koji and installed? I want to keep the good 5.4.10-200 kernel but I need to let dnf purge 5.5 kernels because /boot will fill up. My /boot is 465MB, which is enough for normal updates but not big enough to keep every test kernel.

Regards, William

Comment 28 William Bader 2020-04-04 09:07:01 UTC

After all of that, I booted back to 5.5.13-200.fc31.x86_64, and the webcam is working.
I think that I did a warm boot from 5.4.10-200 (which I rechecked after 5.4.11-200 didn't work), so it seems like warm boots depend on whether the webcam was working before, while cold boots depend on the kernel that is booting. So 5.4.11-200 still has a regression, but there is a second bug, possibly hardware, that doesn't reinitialize something during warm boots.
Regards, William

Comment 29 Steve 2020-04-04 12:22:14 UTC

Very nice work investigating this problem, but first:

> My /boot is 465MB, which is enough for normal updates but not big enough to keep every test kernel.

You can use dnf to manually remove kernels you don't need:

Get a list of installed kernels:

# dnf list --installed kernel-core\*

Remove packages for a specific kernel version (use copy and paste to form the command-line from the list):

# dnf remove kernel\*-5.5.10-200.fc31
                   ^^

As usual, dnf will ask you before doing anything. The wildcard has to be in that exact position.

That's my procedure when I have "installonly_limit=0" in dnf.conf.

> I added sort -Vr to your koji list-builds command
> koji list-builds --package=kernel --state=COMPLETE --after=2019-10-21 --reverse --quiet | fgrep '.fc31' | sort -Vr
                                                                        ^^^^^^^^^                                  ^
Thanks! I didn't know about the "-V" option, even though it is in the man page. I don't know why the "koji" command doesn't do a version sort.

Your command-line reverses the "--reverse", so, if you prefer increasing order, both reversals can be removed:

$ koji list-builds --package=kernel --state=COMPLETE --after=2019-10-21 --quiet | fgrep '.fc31' | sort -V

Comment 30 Steve 2020-04-04 12:58:56 UTC

> 5.4.10-200 good
> 5.4.11-200 bad

Excellent. There are 165 commits in that range (the two "tag" commits don't count):

$ git log --oneline v5.4.10^..v5.4.11 | wc -l
167

These explicitly mention "usb":

$ git log --oneline --grep usb v5.4.10^..v5.4.11
7cbdf96cda usb: missing parentheses in USE_NEW_SCHEME
578289f847 USB: core: fix check for duplicate endpoints
158cbd970b usb: dwc3: gadget: Fix request complete check
72cd84ea52 net: usb: lan78xx: fix possible skb leak
e36491f117 usb: typec: fusb302: Fix an undefined reference to 'extcon_get_state'
27fc4a9e4a net: usb: lan78xx: Fix error message format specifier
61e861528e USB: dummy-hcd: use usb_urb_dir_in instead of usb_pipein

And these are the commit IDs for the two tags:

$ fig-tags.sh 'v5.4.1[01]'
fd74b603ed 2020-01-12 12:23:15 +0100 v5.4.11
7622136b11 2020-01-09 10:25:55 +0100 v5.4.10

fig-tags.sh is my shell script with this git command-line:

git tag --list --format='%(objectname:short) %(creatordate:iso) %(refname:short)' --sort='-creatordate' -- "$@"

("fig" is not "git", so I don't get confused ... :-))

Comment 31 Steve 2020-04-04 13:09:48 UTC

Created attachment 1676213 [details]
git-log-oneline-v5.4.10-v5.4.11.txt

$ git log --oneline v5.4.10^..v5.4.11 > /tmp/git-log-oneline-v5.4.10-v5.4.11.txt

Comment 32 Steve 2020-04-04 13:30:57 UTC

> I want to avoid building a lot of kernels because this laptop runs hot, and my last laptop burnt up building gcc.

Do you have a desktop system that you could use for doing kernel builds?

For kernel bisection, Artem used a desktop system to build the kernels and then transferred them to his laptop over the network. (Bug 1790115, Comment 55)

A USB flash drive could also be used to transfer kernels between systems.

Comment 33 William Bader 2020-04-04 19:05:11 UTC

I have remote access to CentOS 6 and 7 VMs on a server in the office.
Do I have to do anything special if I do a build on a more advanced CPU than my laptop?
The CentOS 6 VM has 15GB free, gcc 4.4.7 20120313, and 4 virtual cpus.
The CentOS 7 VM has 40GB free, gcc 4.8.5 20150623 (plus I have installed devtoolset-8), and 1 virtual cpu.
How do I gather the results to copy them? A long time ago, I built custom kernels with slackware, and it was something like 'make menuconfig;make;make modules_install;make install', which would be messy to copy. How can I build something similar to the kernel-core and kernel-modules rpms that I downloaded with koji?
I don't want to risk copying bzImage files and leaving my laptop unbootable, although I have the live cd and a usb drive with recent backup.
Regards, William

Comment 34 Steve 2020-04-04 20:10:12 UTC

(In reply to William Bader from comment #33)
> I have remote access to CentOS 6 and 7 VMs on a server in the office.

That's a good idea, but none of the Fedora tools have been tested on CentOS. Can anyone provision a Fedora 31 VM?

> Do I have to do anything special if I do a build on a more advanced CPU than my laptop?

If anything is to be configured, it would be in the kernel ".config" file. And, of course, the tools and packages needed to do the build. You will also need network access to install packages required for the build.

> The CentOS 6 VM has 15GB free, gcc 4.4.7 20120313, and 4 virtual cpus.
> The CentOS 7 VM has 40GB free, gcc 4.8.5 20150623 (plus I have installed devtoolset-8), and 1 virtual cpu.

How much physical memory is on the server and how much memory has been allocated to the VMs? I have multiple VMs configured on my desktop system, which has 8GB of memory, but I never try to run more than one VM at a time, since each VM is configured with 4GB of memory.

I'm not sure how well this would work, but could you create an F31 VM inside the CentOS VM?

> How do I gather the results to copy them? A long time ago, I built custom kernels with slackware, and
> it was something like 'make menuconfig;make;make modules_install;make install', which would be messy to copy.

If you are asking how to control your session, you could login with "ssh". Or you could use remote desktop sharing.

If you are asking how to transfer the kernel files from the VM to your laptop, you could use "scp". That is what I have done to transfer files from a VM on a Fedora host to the host: VM -> Host. I believe something like that would work for you too. Alternatively, you could set up an https or sftp server on the VM. (That, I don't have much experience with, though. And your sysadmin might have some concerns.) A third way would be to send the files up to the "cloud" and download them from there to your laptop.

> How can I build something similar to the kernel-core and kernel-modules rpms that I downloaded with koji?

That's a good question and I don't have a complete answer. I believe you would use a Fedora kernel config file. However, Fedora kernels are also built with Fedora patches, so I would suggest ignoring them and seeing if the problem occurs with a "vanilla" kernel build.

> I don't want to risk copying bzImage files and leaving my laptop unbootable, although I have the live cd and a usb drive with recent backup.

As long as you have grub2 configured and known-working kernels in /boot, you should be OK. I will have to research exactly how you configure grub2 to boot the test kernel, but I believe it would be as easy as adding a ".conf" file to /boot/loader/entries/. (Assuming you have BLS enabled.)

> Regards, William

Comment 35 Steve 2020-04-04 20:21:26 UTC

"Can anyone provision a Fedora 31 VM?"

BTW, there is a Fedora Server edition, which is bare bones compared to the Fedora Workstation edition, but that might work for a terminal/command-line only session just for building kernels:

Fedora Server.
https://getfedora.org/en/server/

Comment 36 Steve 2020-04-04 20:29:36 UTC

(In reply to Steve from comment #35)
> "Can anyone provision a Fedora 31 VM?"

Some companies offer free servers and, for a fee, you can get more memory and disk space. For example:

Amazon Lightsail
Virtual servers, storage, databases, and networking for a low, predictable price.
https://aws.amazon.com/lightsail/

Comment 37 Steve 2020-04-04 21:14:40 UTC

See, also:

Linux virtual machines in Azure
https://azure.microsoft.com/en-us/services/virtual-machines/linux/

Both AWS and Azure support Red Hat Linux.

Comment 38 William Bader 2020-04-04 22:22:53 UTC

Thanks, I will ask on Monday if someone in the office can provision a Fedora 31 VM.
I have VPN access and about 20Mbps effective throughput, so copying 100MB files isn't a problem.
When I asked about transferring kernel files, I meant whether it is just one vmlinuz file or a lot of little files with kernel modules, maps, ramfs images, modules, headers, that I have to locate, gather up, copy, and then install in the correct place on my laptop.
I would do the builds on off hours, so I might only be able to do one per day.
If I can build Fedora kernels, can I do bisections? If the rpmbuild script applies patches after pulling a git commit, will that confuse bisections?
You said earlier that I could get some snapshots from https://koji.fedoraproject.org/koji/packageinfo?packageID=8 Could I use those snapshots to narrow the range of commits to narrow the range to search? It looks like they went from 5.4.10 to 5.4.11 without any git snapshots in between, so maybe it won't help.

Comment 39 Steve 2020-04-04 23:44:13 UTC

(In reply to William Bader from comment #38)
> Thanks, I will ask on Monday if someone in the office can provision a Fedora 31 VM.

You might be able to remotely complete the install yourself after the VM is configured and the installer image is booted:

Installing Using VNC
https://docs.fedoraproject.org/en-US/fedora/f31/install-guide/advanced/VNC_Installations/

> I have VPN access and about 20Mbps effective throughput, so copying 100MB files isn't a problem.

OK.

> When I asked about transferring kernel files, I meant whether it is just one vmlinuz file or a lot of little files with kernel modules, maps, ramfs images, modules, headers, that I have to locate, gather up, copy, and then install in the correct place on my laptop.

OK, you are asking about how to install the kernel on the target system.

> I would do the builds on off hours, so I might only be able to do one per day.

Artem did a complete bisection in about an hour and a half. (Bug 1790115, Comment 55)

> If I can build Fedora kernels, can I do bisections? If the rpmbuild script applies patches after pulling a git commit, will that confuse bisections?

The Fedora build tools can't do bisections, because they simply download a kernel tarball. You need a clone of the kernel git repo. The "git bisect" command chooses which commits are built in each iteration.

> You said earlier that I could get some snapshots from https://koji.fedoraproject.org/koji/packageinfo?packageID=8
> Could I use those snapshots to narrow the range of commits to narrow the range to search?
> It looks like they went from 5.4.10 to 5.4.11 without any git snapshots in between, so maybe it won't help.

That's plenty narrow. Bisection is very efficient: log2(165 commits) is about 8 builds. (Comment 30)

Documentation:

$ git bisect --help

Bisecting a bug
https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

Subject: Re: git pull on Linux/ACPI release tree
From: Linus Torvalds <...>
Date: Tue, 10 Jan 2006 11:28:58 -0800 (PST)
https://lore.kernel.org/git/Pine.LNX.4.64.0601101111110.4939@g5.osdl.org/

Comment 40 Steve 2020-04-05 00:31:46 UTC

(In reply to Steve from comment #39)
> You need a clone of the kernel git repo. The "git bisect" command chooses which commits are built in each iteration.

This can be done on your laptop:

$ git clone --shallow-exclude=linux-5.3.y --branch linux-5.4.y https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4
Cloning into 'linux-5.4'...
remote: Enumerating objects: 256511, done.
remote: Counting objects: 100% (256511/256511), done.
remote: Compressing objects: 100% (122936/122936), done.
remote: Total 256511 (delta 166877), reused 181371 (delta 131845)
Receiving objects: 100% (256511/256511), 219.49 MiB | 1.80 MiB/s, done.
Resolving deltas: 100% (166877/166877), done.
Checking out files: 100% (65704/65704), done.

$ cd linux-5.4

$ git branch
* linux-5.4.y

$ git tag --list --sort=-version:refname

This is a simulated partial bisection. Note that git estimates the number of steps needed:

$ git bisect start
$ git bisect bad v5.4.11
$ git bisect good v5.4.10
Bisecting: 82 revisions left to test after this (roughly 6 steps)
[97d9e8620f57f28f415b23ad88b97c87b6d53390] bnx2x: Do not handle requests from VFs after parity

$ git branch
* (no branch, bisect started on linux-5.4.y)
  linux-5.4.y

$ git status
HEAD detached at 97d9e8620
You are currently bisecting, started from branch 'linux-5.4.y'.
  (use "git bisect reset" to get back to the original branch)

nothing to commit, working tree clean

At this point, you would do the first build.

Comment 41 Steve 2020-04-05 01:38:34 UTC

(In reply to William Bader from comment #38)
...
> You said earlier that I could get some snapshots from https://koji.fedoraproject.org/koji/packageinfo?packageID=8
...

Thanks for bringing that up. The short answer is that there are no snapshot builds for stable releases. Stable releases have a three-part version, such as 5.4.y, where y is greater than 0.

The snapshot builds are for pre-release mainline builds. Grepping for "git" shows that:

$ koji list-builds --package=kernel --state=COMPLETE --after=2019-10-21 --quiet | fgrep git | sort -Vr | less
...
kernel-5.5.0-0.rc0.git6.1.fc32                           jforbes           COMPLETE
kernel-5.4.0-0.rc8.git1.2.fc32                           labbott           COMPLETE
...

Comment 42 William Bader 2020-04-05 05:43:20 UTC

Thanks for the reply. If it could be done in an hour, I could try it on my laptop.
My laptop is an i5-2450M CPU 2.50GHz with 4 cores (or more precisely, I think 2 cores with 2 threads per core), 8GB RAM, and a 512MB SSD.
I think that if I limit 'make' to one job, it rotates the core to control the temperature.
Do you have an estimate for how long the build would take? 1 hour? 4 hours?
Long builds using all of the cores can bring the cpu up to 95C according to gkrellm. I don't know if my laptop will withstand an hour of that.
I have /tmp as a ram disk that can grow up to 4MB.
I have ccache set to use /tmp. I suppose that a kernel build would require moving ccache to use the SSD.
I have done git bisections to find bugs in ghostscript and poppler. With ccache, the builds go very fast near the end if headers haven't changed.
Thanks for the commands to start the bisection.
The issue with provisioning a VM is that I don't have permission. I have root to the VMs I use but not to the host. Opening a GUI is not a problem.
Is the procedure to build kernels at https://www.kernel.org/doc/html/latest/admin-guide/README.html ?
When I do 'make config', do I need to select or unselect any options? Can I use 'make localmodconfig'?
My /boot has config-5.*.fc31.x86_64 files. Is there a way to use /boot/config-5.4.10-200.fc31.x86_64 as the initial source for 'make oldconfig'?
The kernel.org page doesn't say anything about initramfs. Will the 'make' build it, and I just have to find it and copy it to /boot with bzImage and then create a new file in /boot/loader/entries ?
Regards, William

Comment 43 Steve 2020-04-05 07:21:24 UTC

(In reply to William Bader from comment #42)
> Thanks for the reply. If it could be done in an hour, I could try it on my laptop.
> My laptop is an i5-2450M CPU 2.50GHz with 4 cores (or more precisely, I think 2 cores with 2 threads per core), 8GB RAM, and a 512MB SSD.
> I think that if I limit 'make' to one job, it rotates the core to control the temperature.

You read the man page again. :-) I now see that "make" has a "-j" option to limit the number of jobs run simultaneously. When doing a kernel build on my 2 core/4 thread CPU, without "-j", "top" shows up to *four* CPUs in use. In "top", press "1" to see individual CPU use.

> Do you have an estimate for how long the build would take? 1 hour? 4 hours?

Closer to 1 hour.

> Long builds using all of the cores can bring the cpu up to 95C according to gkrellm. I don't know if my laptop will withstand an hour of that.

On my desktop system, the temp. gets up to ~50C, and I manually turn up the case fans to full speed. Can you configure any settings in the BIOS setup to change how the fan speed changes with load?

> I have /tmp as a ram disk that can grow up to 4MB.
> I have ccache set to use /tmp. I suppose that a kernel build would require moving ccache to use the SSD.
> I have done git bisections to find bugs in ghostscript and poppler. With ccache, the builds go very fast near the end if headers haven't changed.

OK.

> Thanks for the commands to start the bisection.

Evidently you know more about doing bisections than I do. :-)

> The issue with provisioning a VM is that I don't have permission. I have root to the VMs I use but not to the host. Opening a GUI is not a problem.

OK.

> Is the procedure to build kernels at https://www.kernel.org/doc/html/latest/admin-guide/README.html ?

That looks good.

> When I do 'make config', do I need to select or unselect any options? Can I use 'make localmodconfig'?

Try to use the same settings as Fedora.

> My /boot has config-5.*.fc31.x86_64 files. Is there a way to use /boot/config-5.4.10-200.fc31.x86_64 as the initial source for 'make oldconfig'?

Yes, but I have to research the details. I did a diff on the Fedora config files for 5.4.10 and 5.4.11, and they appear to be the same.

> The kernel.org page doesn't say anything about initramfs. Will the 'make' build it, and I just have to find it and copy it to /boot with bzImage and

Excellent question. Again, that is something I have to research. In Fedora, the "dracut" command is used to build an initramfs during a kernel update. The dates on the initramfs files in /boot should correspond to when you installed the kernels:

$ ls -l /boot/initramfs-5.5.8-200.fc31.x86_64.img

$ rpm -qi kernel-5.5.8-200.fc31.x86_64 | grep 'Install'

> then create a new file in /boot/loader/entries ?

If you always use "bzImage" as the kernel name, you will only need to create one ".conf" file that you can use repeatedly for each test boot. I have not tried this, but you might be able to use a link from "bzImage" to a uniquely named "bzImage" file, such as "bzImage-1".

> Regards, William

Comment 44 Steve 2020-04-05 07:38:09 UTC

Before starting the bisection, it would be a very good idea to do test builds of 5.4.10 and 5.4.11 to verify that they are indeed "good" and "bad", respectively, when you build them.

So that's two more builds than what "git bisect" says. (Comment 40)

Comment 45 Steve 2020-04-05 08:19:21 UTC

There is one difference in the config files for 5.4.10 and 5.4.11. They have different values for "CONFIG_BUILD_SALT". That seems to be an insignificant difference. However, for future reference here is the procedure:

This procedure does not require installing the kernel packages.

Download with "koji download-build --rpm" and extract with "rpmdev-extract" to get:

$ ls -1F
kernel-core-5.4.10-200.fc31.x86_64/
kernel-core-5.4.10-200.fc31.x86_64.rpm
kernel-core-5.4.11-202.fc31.x86_64/
kernel-core-5.4.11-202.fc31.x86_64.rpm

$ find . -name config
./kernel-core-5.4.10-200.fc31.x86_64/lib/modules/5.4.10-200.fc31.x86_64/config
./kernel-core-5.4.11-202.fc31.x86_64/lib/modules/5.4.11-202.fc31.x86_64/config

$ diff -u0 ./kernel-core-5.4.10-200.fc31.x86_64/lib/modules/5.4.10-200.fc31.x86_64/config ./kernel-core-5.4.11-202.fc31.x86_64/lib/modules/5.4.11-202.fc31.x86_64/config
...
@@ -3 +3 @@
-# Linux/x86_64 5.4.10-200.fc31.x86_64 Kernel Configuration
+# Linux/x86_64 5.4.11-202.fc31.x86_64 Kernel Configuration
@@ -30 +30 @@
-CONFIG_BUILD_SALT="5.4.10-200.fc31.x86_64"
+CONFIG_BUILD_SALT="5.4.11-202.fc31.x86_64"

Comment 46 William Bader 2020-04-05 09:17:21 UTC

Thanks. I kept the 5.4.10-200 and 5.4.11-200 kernels installed, so I can diff their config files in /boot. Sorry for not mentioning that.
My laptop idles at nearly 60C. powertop shows about 150 wakeups/sec with a few tabs open in chrome, including this one. Can I use Linux power management to control the fans? I tried pwmconfig but it said "There are no pwm-capable sensor modules installed." Every once in a while a new kernel makes it run hotter: https://bugzilla.redhat.com/show_bug.cgi?id=1329101
I'm not sure how powertop is calibrated, but when it says the cpu is 90C, the vent on the side of the laptop is too hot to touch.
I'll see if I can get a VM for builds on Monday.
Regards, William

Comment 47 Steve 2020-04-05 09:37:38 UTC

(In reply to William Bader from comment #42)
...
> My /boot has config-5.*.fc31.x86_64 files.
> Is there a way to use /boot/config-5.4.10-200.fc31.x86_64 as the initial source for 'make oldconfig'?
...

Since you have 5.4.10 and 5.4.11 installed, you can simply copy one of corresponding config files from /boot to ".config" in the kernel build directory. And then run "make oldconfig".

Here is the procedure for 5.4.10:

$ git checkout v5.4.10
Note: checking out 'v5.4.10'.
...
HEAD is now at 7a02c1932 Linux 5.4.10

$ git branch
* (HEAD detached at v5.4.10)
  linux-5.4.y

$ cp -ip config-5.4.10 .config  # config-5.4.10 is a link to the config file extracted from the kernel-core rpm, as in Comment 45.

$ make oldconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/confdata.o
  HOSTCC  scripts/kconfig/expr.o
  LEX     scripts/kconfig/lexer.lex.c
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTLD  scripts/kconfig/conf
scripts/kconfig/conf  --oldconfig Kconfig
#
# configuration written to .config
#

$ diff -u0 config-5.4.10 .config
...
@@ -3 +3 @@
-# Linux/x86_64 5.4.10-200.fc31.x86_64 Kernel Configuration
+# Linux/x86 5.4.10 Kernel Configuration
@@ -8315 +8314,0 @@
-CONFIG_LOCK_DOWN_IN_EFI_SECURE_BOOT=y
@@ -8319 +8317,0 @@
-CONFIG_ALLOW_LOCKDOWN_LIFT_BY_SYSRQ=y

Comment 48 Steve 2020-04-05 09:50:51 UTC

There doesn't seem to be any difference with v5.4.11:

$ git checkout v5.4.11

$ cp -ip config-5.4.11 .config
cp: overwrite '.config'? y

$ make oldconfig

$ diff -u0 config-5.4.11 .config
...
@@ -3 +3 @@
-# Linux/x86_64 5.4.11-202.fc31.x86_64 Kernel Configuration
+# Linux/x86 5.4.11 Kernel Configuration
@@ -8315 +8314,0 @@
-CONFIG_LOCK_DOWN_IN_EFI_SECURE_BOOT=y
@@ -8319 +8317,0 @@
-CONFIG_ALLOW_LOCKDOWN_LIFT_BY_SYSRQ=y

So the question is why are those config options in the Fedora configs, but not in the vanilla configs?

Comment 49 Steve 2020-04-05 10:13:43 UTC

(In reply to Steve from comment #48)
...
> So the question is why are those config options in the Fedora configs, but not in the vanilla configs?

Fedora kernels have some patches that are not in the vanilla kernel, and those config options are in such patches:

$ grep CONFIG_LOCK_DOWN_IN_EFI_SECURE_BOOT .*.patch
.efi-secureboot.patch:+#ifdef CONFIG_LOCK_DOWN_IN_EFI_SECURE_BOOT

$ grep CONFIG_ALLOW_LOCKDOWN_LIFT_BY_SYSRQ .*.patch
.lift-lockdown-sysrq.patch:+#ifdef CONFIG_ALLOW_LOCKDOWN_LIFT_BY_SYSRQ
.lift-lockdown-sysrq.patch:+#endif /* CONFIG_ALLOW_LOCKDOWN_LIFT_BY_SYSRQ */

Those patches were extracted from kernel-5.4.10-200.fc31.src.rpm in which there are 37 patches:

$ ls -1 .*.patch | wc -l
37

To authentically rebuild the Fedora kernel, those patches would also need to be applied. My suggestion is to see if you can reproduce the problem after building *without* those patches.

Comment 50 Steve 2020-04-05 10:41:50 UTC

I am now running a kernel build with:

$ time make -j 1

"top" shows only one "cc1" process running at a time, and there is, at most, high CPU utilization on only one CPU at a time (out of 4).

Further, the system is running as cool as a cucumber: ~40C to ~45C.

Comment 51 Steve 2020-04-05 14:00:13 UTC

(In reply to Steve from comment #50)
> I am now running a kernel build with:
> 
> $ time make -j 1

Build time is about 2 hours, 15 minutes with an Intel i3 desktop CPU (3.40GHz max, 2 cores/4 threads; per /proc/cpuinfo; no other loads):

real	134m39.631s
user	115m36.398s
sys	15m6.718s

> "top" shows only one "cc1" process running at a time, and there is, at most, high CPU utilization on only one CPU at a time (out of 4).
> 
> Further, the system is running as cool as a cucumber: ~40C to ~45C.

Perhaps that is a bit too optimistic. Idle temps are ~25C to ~30C. Case fans are set to ~1000 RPM with manual controllers (dual Zalman Fan Mates).

Comment 52 Steve 2020-04-05 14:26:16 UTC

Build results:

$ ls -1sh vmlinux*
698M vmlinux*
810M vmlinux.o

$ wc -l modules.builtin modules.order Module.symvers 
    258 modules.builtin
   3514 modules.order
  21317 Module.symvers
...

The value of CONFIG_BUILD_SALT appears to get embedded in the output files:

$ grep -l '5.4.11-202.fc31.x86_64' .config vmlinux*
.config
vmlinux
vmlinux.o

So it might be a good idea to change it, because the actual build was for:

$ git branch
* (HEAD detached at v5.4.10)
  linux-5.4.y

And possibly set:

$ grep -A1 'CONFIG_LOCALVERSION' .config
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT="5.4.11-202.fc31.x86_64"

Comment 53 Steve 2020-04-05 18:35:09 UTC

(In reply to Steve from comment #52)
...
> And possibly set:
> 
> $ grep -A1 'CONFIG_LOCALVERSION' .config
> CONFIG_LOCALVERSION=""
> # CONFIG_LOCALVERSION_AUTO is not set
> CONFIG_BUILD_SALT="5.4.11-202.fc31.x86_64"

To change those, I used:

$ make nconfig

That gives you a nice ncurses user interface. Press "F2" to see what configuration option you are actually setting.

Probably the only important setting is:

CONFIG_LOCALVERSION_AUTO=y

That inserts a version string into vmlinux and vmlinux.o:

$ fgrep -a -m1 'Linux version' vmlinux-5.4.10.localversion2
Linux version 5.4.10.localversion2 (xxx@yyy) (gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC)) #5 SMP ...
              ^^^^^^^^^^^^^^^^^^^^  ^^^^^^^

My user name and hostname were also being inserted. They are obfuscated here as "xxx@yyy". It turns out that environment variables are also used to configure the build. Specifically:

export KBUILD_BUILD_USER="test-user-1"
export KBUILD_BUILD_HOST="test-host-1"

Those strings ultimately end up in this file: include/generated/compile.h.

Here are all the changes I made and the changes made by "make oldconfig":

$ diff -u0 config-5.4.10 .config.EXP3
...
@@ -3 +3 @@
-# Linux/x86_64 5.4.10-200.fc31.x86_64 Kernel Configuration
+# Linux/x86 5.4.10 Kernel Configuration
@@ -28,3 +28,3 @@
-CONFIG_LOCALVERSION=""
-# CONFIG_LOCALVERSION_AUTO is not set
-CONFIG_BUILD_SALT="5.4.10-200.fc31.x86_64"
+CONFIG_LOCALVERSION=".localversion2"
+CONFIG_LOCALVERSION_AUTO=y
+CONFIG_BUILD_SALT="buildidsalt2"
@@ -43 +43 @@
-CONFIG_DEFAULT_HOSTNAME="(none)"
+CONFIG_DEFAULT_HOSTNAME="bisection-hostname"
@@ -8315 +8314,0 @@
-CONFIG_LOCK_DOWN_IN_EFI_SECURE_BOOT=y
@@ -8319 +8317,0 @@
-CONFIG_ALLOW_LOCKDOWN_LIFT_BY_SYSRQ=y

Comment 54 Steve 2020-04-05 19:34:37 UTC

This is a summary of the first simulated bisection build:

I removed the "-j 1" option, but caching in .ccache could explain some of the speedup:

$ time make
...
real	73m11.885s
user	60m20.050s
sys	8m2.752s

They have slightly different sizes:

$ ls -1s vmlinux*
713764 vmlinux*
713756 vmlinux-5.4.10.localversion2*
828660 vmlinux.o
828652 vmlinux.o-5.4.10.localversion2

$ grep -a 'Linux version' vmlinux | head
Linux version 5.4.10.localversion2-00083-g97d9e8620 (test-user-1@test-host-1) (gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC)) #6 SMP ...

That version string can be used directly in git commands:

$ git log --oneline -n1 5.4.10.localversion2-00083-g97d9e8620
97d9e8620 (HEAD) bnx2x: Do not handle requests from VFs after parity

Which matches:

$ git describe 97d9e8620
v5.4.10-83-g97d9e8620

The commit ID here:

$ git branch
* (no branch, bisect started on 9d61432ef)
  linux-5.4.y

Is:

$ git log --oneline -n1 9d61432ef
9d61432ef (tag: v5.4.11, refs/bisect/bad) Linux 5.4.11

Comment 55 Steve 2020-04-05 22:08:57 UTC

(In reply to Steve from comment #39)
...
> > When I asked about transferring kernel files, I meant whether it is just one vmlinuz file or a lot of little files with kernel modules, maps, ramfs images, modules, headers, that I have to locate, gather up, copy, and then install in the correct place on my laptop.
> 
> OK, you are asking about how to install the kernel on the target system.
...

"make help" shows a list of packaging targets, including "tarxz-pkg":

$ make help
...
Kernel packaging:
...
  tarxz-pkg           - Build the kernel as a xz compressed tarball
...

Since xz compression is what is used at kernel.org, that is what I chose. However, xz compression consumes nearly 100% of one CPU, and it takes a long time, but there is an uncompressed make target:

  tar-pkg             - Build the kernel as an uncompressed tarball

$ time make tarxz-pkg
...
  DEPMOD  5.4.10.localversion2-00083-g97d9e8620
'./System.map' -> './tar-install/boot/System.map-5.4.10.localversion2-00083-g97d9e8620'
'.config' -> './tar-install/boot/config-5.4.10.localversion2-00083-g97d9e8620'
'./vmlinux' -> './tar-install/boot/vmlinux-5.4.10.localversion2-00083-g97d9e8620'
'./arch/x86/boot/bzImage' -> './tar-install/boot/vmlinuz-5.4.10.localversion2-00083-g97d9e8620'
Tarball successfully created in ./linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar.xz

real	48m51.941s
user	43m21.183s
sys	1m48.391s

Compression reduced the size significantly:

$ ls -1sh *5.4.10.localversion2-00083-g97d9e8620*
750M linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar.xz
698M vmlinux-5.4.10.localversion2-00083-g97d9e8620*
810M vmlinux.o-5.4.10.localversion2-00083-g97d9e8620

Comment 56 William Bader 2020-04-05 22:46:58 UTC

Thanks again for the information. 

>My suggestion is to see if you can reproduce the problem after building *without* those patches.

I'll try that first. Also, building with patches might complicate the bisection.

>Build time is about 2 hours, 15 minutes

That might fry my laptop. I'll wait to see if I can get a VM on a server.

>tar-pkg - Build the kernel as an uncompressed tarball

Thanks, that looks useful. The tar is fine because scp -C uses gzip, so for a one-time transfer, the time to compress the tar with xz costs more than the transfer time it saves.

>$ time make tarxz-pkg
>...
>  DEPMOD  5.4.10.localversion2-00083-g97d9e8620
>'./System.map' -> './tar-install/boot/System.map-5.4.10.localversion2-00083-g97d9e8620'
>'.config' -> './tar-install/boot/config-5.4.10.localversion2-00083-g97d9e8620'
>'./vmlinux' -> './tar-install/boot/vmlinux-5.4.10.localversion2-00083-g97d9e8620'
>'./arch/x86/boot/bzImage' -> './tar-install/boot/vmlinuz-5.4.10.localversion2-00083-g97d9e8620'
>Tarball successfully created in ./linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar.xz

Do I need both vmlinux and vmlinuz?
If I don't need vmlinux, I could remove it from the tarball before downloading it.

Regards, William

Comment 57 Steve 2020-04-05 22:54:57 UTC

I don't know why you would need vmlinux, and there is another quirk about what is in the tar file:

Uncompress so we can see what is in the tar file:

$ time unxz linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar.xz

real	1m41.307s
user	1m8.632s
sys	0m3.018s

$ ls -sh linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar
5.2G linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar

The tar file includes two links that are absolute paths to the build and source directories:

$ tar -tvf linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar \*/build \*source
lrwxrwxrwx root/root         0 2020-04-05 14:03 lib/modules/5.4.10.localversion2-00083-g97d9e8620/build -> /home/[removed]/linux-5.4
lrwxrwxrwx root/root         0 2020-04-05 14:03 lib/modules/5.4.10.localversion2-00083-g97d9e8620/source -> /home/[removed]/linux-5.4

Both vmlinux and vmlinuz are included:

$ tar -tvf linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar boot
drwxrwxr-x root/root         0 2020-04-05 14:06 boot/
-rw-rw-r-- root/root  10623360 2020-04-05 14:07 boot/vmlinuz-5.4.10.localversion2-00083-g97d9e8620
-rw-rw-r-- root/root   4444488 2020-04-05 14:06 boot/System.map-5.4.10.localversion2-00083-g97d9e8620
-rwxrwxr-x root/root 738610672 2020-04-05 14:06 boot/vmlinux-5.4.10.localversion2-00083-g97d9e8620
-rw-rw-r-- root/root    213361 2020-04-05 14:06 boot/config-5.4.10.localversion2-00083-g97d9e8620

Modules are also included:

$ tar -tf linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar lib | wc -l
4353

Comment 58 Steve 2020-04-05 23:10:13 UTC

Since the tools create fully qualified names, "dracut" can be easily used to build an initramfs. Adapting the example in the "dracut" man page:

# dracut --kver 5.4.10.localversion2-00083-g97d9e8620

Disclaimer: Although I have used "dracut" to rebuild the initramfs for Fedora kernels, I have not tested it with self-built kernels.

NB: "Dracut" has a lot of options to control what goes into the initramfs.

Comment 59 Steve 2020-04-05 23:25:54 UTC

I believe "grubby" can build the initramfs and write the ".conf" file for the grub2 menu item. Grubby has a lot of options, and I have never tried to use it, so it may take several test runs to get a suitable command-line:

# dnf install grubby

$ rpm -q grubby
grubby-8.40-36.fc31.x86_64

$ man grubby

Comment 60 Steve 2020-04-06 05:09:16 UTC

I transferred my test kernel tar file to a VM (with "scp") and attempted to install with:

# tar -xvf linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar -C /

That filled up my VM's 12GB disk and somehow, while attempting to recover, all my Fedora kernel modules were lost.*

After resizing the VM's disk to 32GB with "qemu-img resize" and resizing the file system with "gparted" (from the F31 Live image**), I was able to successfully install the test kernel. But there is a big problem:

$ du -s /lib/modules/* | sort -n
79108	/lib/modules/5.3.7-301.fc31.x86_64
82548	/lib/modules/5.5.13-200.fc31.x86_64
4760148	/lib/modules/5.4.10.localversion2-00083-g97d9e8620

After drilling down into the directory, it appears that all of the kernel modules were built with debug info, which makes them huge. For example:

$ ls -lh rtl8723ae.ko 
-rw-rw-r--. 1 root root 7.2M Apr  5 14:05 rtl8723ae.ko

$ file rtl8723ae.ko
rtl8723ae.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=9f15845d60f2df736064391cd75878bcf17ca516, with debug_info, not stripped

* I recovered by reinstalling the packages.
** Which can be selected in the VM from the "SeaBIOS" boot device menu.

Comment 61 William Bader 2020-04-06 05:51:30 UTC

>it appears that all of the kernel modules were built with debug info

Is there a way to build with no debug info and also with a low optimization level like -O0 or -O1 to make the builds faster?
I didn't realize that it would be so complicated. Do kernel developers really have to do so much by hand, or is this because I can't use 'make install'?
Regards, William

Comment 62 Steve 2020-04-06 13:24:01 UTC

(In reply to William Bader from comment #61)
> >it appears that all of the kernel modules were built with debug info
> 
> Is there a way to build with no debug info and also with a low optimization level like -O0 or -O1 to make the builds faster?

Yes. Unset CONFIG_DEBUG_INFO in .config. See below.

> I didn't realize that it would be so complicated. Do kernel developers really have to do so much by hand, or is this because I can't use 'make install'?

It IS complicated. Fedora kernels are built for FIVE architectures and in non-debug and debug versions:

Information for build kernel-5.4.10-200.fc31
https://koji.fedoraproject.org/koji/buildinfo?buildID=1427976

See, in particular, the x86_64 "build.log" on that web page.

> Regards, William

Unsetting CONFIG_DEBUG_INFO (with "make nconfig") reduces the file sizes, but that is not what is in the kernel-core config file:

$ diff -u0 .config.EXP3 .config.EXP4
...
@@ -28 +28 @@
-CONFIG_LOCALVERSION=".localversion2"
+CONFIG_LOCALVERSION=".localversion3"
@@ -8744,6 +8744 @@
-CONFIG_DEBUG_INFO=y
-# CONFIG_DEBUG_INFO_REDUCED is not set
-# CONFIG_DEBUG_INFO_SPLIT is not set
-# CONFIG_DEBUG_INFO_DWARF4 is not set
-CONFIG_DEBUG_INFO_BTF=y
-# CONFIG_GDB_SCRIPTS is not set
+# CONFIG_DEBUG_INFO is not set

For the record:
$ time make
...
real	108m13.920s
user	95m13.409s
sys	12m27.289s

Comment 63 Steve 2020-04-06 13:51:29 UTC

The size is MUCH reduced with CONFIG_DEBUG_INFO unset:

$ ls -1sh linux-5.4.10.localversion*.tar
4.6G linux-5.4.10.localversion2-00083-g97d9e8620-x86.tar
332M linux-5.4.10.localversion3-00083-g97d9e8620-x86.tar

And building the uncompressed tar file is very fast:

$ time make tar-pkg
...
Tarball successfully created in ./linux-5.4.10.localversion3-00083-g97d9e8620-x86.tar

real	3m31.590s
user	2m50.464s
sys	0m55.519s

Comment 64 Steve 2020-04-06 19:08:16 UTC

(In reply to William Bader from comment #61)
...
> Do kernel developers really have to do so much by hand, or is this because I can't use 'make install'?

They use scripts. And I suspect that there is one for Fedora installs, but I haven't looked for it yet.

In the mean time, I wrote a shell script to complete the install and successfully booted with my kernel.

Since the shell script must be run as root, I am only going to post the essentials:

KVER is the kernel version, for example: "5.4.10.localversion3-00083-g97d9e8620"

MKINITRD is the "mkinitrd" command. "dracut" could be used instead.
GRUBBY is the "grubby" command.

BOOT="/boot"
VMLINUZ="$BOOT/vmlinuz-$KVER"
INITRAMFS="$BOOT/initramfs-$KVER.img"

# mkinitrd won't overwrite an existing initramfs file; see the "--force" option.

$MKINITRD "$INITRAMFS" "$KVER"

# grubby creates new ".conf" files if there are already ones present, so the grub2 menu may have replicated entries.
# That should be harmless, but see "man grubby" for options to handle various sorts of updates.

$GRUBBY --add-kernel="$VMLINUZ" --initrd="$INITRAMFS" --title="$VMLINUZ" --copy-default --make-default

Comment 65 Steve 2020-04-06 19:39:01 UTC

It sounds like you have only one computer. If so, I suggest borrowing, renting, or buying a second computer, so your system-under-test is not also your support/infrastructure/backup/what-do-I-do-now, computer.

Comment 66 Steve 2020-04-06 20:38:26 UTC

(In reply to Steve from comment #64)
> (In reply to William Bader from comment #61)
> ...
> > Do kernel developers really have to do so much by hand, or is this because I can't use 'make install'?
> 
> They use scripts. And I suspect that there is one for Fedora installs, but I haven't looked for it yet.
...

Here it is:

$ rpm -q --scripts kernel-core-5.5.15-200.fc31.x86_64
postinstall scriptlet (using /bin/sh):

if [ `uname -i` == "x86_64" -o `uname -i` == "i386" ] &&
   [ -f /etc/sysconfig/kernel ]; then
  /bin/sed -r -i -e 's/^DEFAULTKERNEL=kernel-smp$/DEFAULTKERNEL=kernel/' /etc/sysconfig/kernel || exit $?
fi
preuninstall scriptlet (using /bin/sh):
/bin/kernel-install remove 5.5.15-200.fc31.x86_64 /lib/modules/5.5.15-200.fc31.x86_64/vmlinuz || exit $?
posttrans scriptlet (using /bin/sh):
/bin/kernel-install add 5.5.15-200.fc31.x86_64 /lib/modules/5.5.15-200.fc31.x86_64/vmlinuz || exit $?

$ rpm -qf /bin/kernel-install
systemd-udev-243.8-1.fc31.x86_64

$ man kernel-install

"kernel-install is used to install and remove kernel and initramfs images to and from the boot loader partition, ..."

Comment 67 Steve 2020-04-06 21:28:11 UTC

Here is another reason for a difference in the sizes -- the Fedora kernel modules are compressed:

$ find 5.5.15-200.fc31.x86_64 -name uvcvideo\* | xargs ls -l
-rw-r--r--. 1 root root 46184 Apr  2 12:50 5.5.15-200.fc31.x86_64/kernel/drivers/media/usb/uvc/uvcvideo.ko.xz
                                                                                                          ^^^
$ find 5.4.10.localversion3-00083-g97d9e8620 -name uvcvideo\* | xargs ls -l
-rw-rw-r--. 1 root root 213417 Apr  6 06:45 5.4.10.localversion3-00083-g97d9e8620/kernel/drivers/media/usb/uvc/uvcvideo.ko

Comment 68 William Bader 2020-04-06 21:54:40 UTC

I got the Fedora 31 VM.

The first step is building vanilla 5.4.10 to test the build procedure and to confirm that the webcam works with the vanilla kernel.

$ scp laptop:/boot/config-5.4.10-200.fc31.x86_64 .
$ git clone --shallow-exclude=linux-5.3.y --branch linux-5.4.y https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4
$ cd linux-5.4/
$ git checkout v5.4.10
$ cp ../config-5.4.10-200.fc31.x86_64 .config
edit .config a bit
$ make oldconfig
$ diff -u0 ../config-5.4.10-200.fc31.x86_64 .config
--- ../config-5.4.10-200.fc31.x86_64    2020-01-09 15:12:02.000000000 -0500
+++ .config     2020-04-06 14:57:03.554427129 -0400
@@ -3 +3 @@
-# Linux/x86_64 5.4.10-200.fc31.x86_64 Kernel Configuration
+# Linux/x86 5.4.10 Kernel Configuration
@@ -7 +7 @@
-# Compiler: gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1)
+# Compiler: gcc (GCC) 9.3.1 20200317 (Red Hat 9.3.1-1)
@@ -10 +10 @@
-CONFIG_GCC_VERSION=90201
+CONFIG_GCC_VERSION=90301
@@ -28,3 +28,3 @@
-CONFIG_LOCALVERSION=""
-# CONFIG_LOCALVERSION_AUTO is not set
-CONFIG_BUILD_SALT="5.4.10-200.fc31.x86_64"
+CONFIG_LOCALVERSION=".localversion1"
+CONFIG_LOCALVERSION_AUTO=y
+CONFIG_BUILD_SALT="buildidsalt1"
@@ -43 +43 @@
-CONFIG_DEFAULT_HOSTNAME="(none)"
+CONFIG_DEFAULT_HOSTNAME="dev-william-1"
@@ -8315 +8314,0 @@
-CONFIG_LOCK_DOWN_IN_EFI_SECURE_BOOT=y
@@ -8319 +8317,0 @@
-CONFIG_ALLOW_LOCKDOWN_LIFT_BY_SYSRQ=y
@@ -8746,6 +8744 @@
-CONFIG_DEBUG_INFO=y
-# CONFIG_DEBUG_INFO_REDUCED is not set
-# CONFIG_DEBUG_INFO_SPLIT is not set
-# CONFIG_DEBUG_INFO_DWARF4 is not set
-CONFIG_DEBUG_INFO_BTF=y
-# CONFIG_GDB_SCRIPTS is not set
+# CONFIG_DEBUG_INFO is not set

$ make # 75 minutes
$ make targz-pkg # 2.5 minutes
$ scp vm:linux-5.4.10.localversion1-x86.tar.gz laptop: # 1.5 minutes

Now I have the tar on my laptop with
boot/System.map-5.4.10.localversion1
boot/config-5.4.10.localversion1
boot/vmlinuz-5.4.10.localversion1
lib/modules/5.4.10.localversion1/...

Is this the next step?
Unpack the tarball as root in /
KVER=5.4.10.localversion1
BOOT="/boot"
VMLINUZ="$BOOT/vmlinuz-$KVER"
INITRAMFS="$BOOT/initramfs-$KVER.img"
mkinitrd "$INITRAMFS" "$KVER"
grubby --add-kernel="$VMLINUZ" --initrd="$INITRAMFS" --title="$VMLINUZ" --copy-default # --make-default

What does '/bin/kernel-install add "$KVER" "$VMLINUZ"' do?

and then reboot.

>It sounds like you have only one computer. If so, I suggest borrowing, renting, or buying a second computer, so your system-under-test is not also your support/infrastructure/backup/what-do-I-do-now, computer.

It is a bit risky, but I did a backup to an external drive over the weekend, and I have the Fedora 31 Live CD. I think that if I mess up /boot, I can boot the live CD, mount the system disk, copy files from the live CD, and then reboot from the system disk and use rpm to restore the system files.
Back in the old days, I used to patch SCO Xenix kernels with microemacs.

Regards, William

Comment 69 Steve 2020-04-06 22:26:12 UTC

(In reply to William Bader from comment #68)
> I got the Fedora 31 VM.

Awesome!

> The first step is building vanilla 5.4.10 to test the build procedure and to
> confirm that the webcam works with the vanilla kernel.
> 
> $ scp laptop:/boot/config-5.4.10-200.fc31.x86_64 .
> $ git clone --shallow-exclude=linux-5.3.y --branch linux-5.4.y
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4
> $ cd linux-5.4/
...
> $ make # 75 minutes
> $ make targz-pkg # 2.5 minutes
> $ scp vm:linux-5.4.10.localversion1-x86.tar.gz laptop: # 1.5 minutes

OK, on all that.

> Now I have the tar on my laptop with
> boot/System.map-5.4.10.localversion1
> boot/config-5.4.10.localversion1
> boot/vmlinuz-5.4.10.localversion1
> lib/modules/5.4.10.localversion1/...

Looks good.

> Is this the next step?
> Unpack the tarball as root in /

# tar -xvf linux-5.4.10.localversion1-x86.tar.gz -C /
                                                 ^^^^-- With "-C", you can run the tar command from your current directory. Tested by me.

> KVER=5.4.10.localversion1
...

Don't do any of that. It's now "background" info. :-)

> What does '/bin/kernel-install add "$KVER" "$VMLINUZ"' do?

You are catching up with me. :-)

I just completed a test with /bin/kernel-install and will post an exact command-line separately.

> and then reboot.

Yes.

> >It sounds like you have only one computer. If so, I suggest borrowing, renting, or buying a second computer, so your system-under-test is not also your support/infrastructure/backup/what-do-I-do-now, computer.
> 
> It is a bit risky, but I did a backup to an external drive over the weekend,
> and I have the Fedora 31 Live CD. I think that if I mess up /boot, I can
> boot the live CD, mount the system disk, copy files from the live CD, and
> then reboot from the system disk and use rpm to restore the system files.
> Back in the old days, I used to patch SCO Xenix kernels with microemacs.

OK, you have taken all the precautions that I would. When I accidentally removed my kernel modules (in a VM, admittedly), that was a wake-up call.

> Regards, William

Comment 70 Steve 2020-04-06 22:40:41 UTC

After unpacking the tar file, run the following command as root (with your own kernel version, of course):

# /bin/kernel-install add 5.4.10.localversion3-00083-g97d9e8620 /boot/vmlinuz-5.4.10.localversion3-00083-g97d9e8620 
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(That's copied right from my root shell history in my F31 test VM. The kernel version is in the line twice, so it isn't as complicated as it looks.)

I didn't add the "--verbose" option, but that might be a good idea.

I got some error messages about missing shell scripts. There might be a package that needs to be installed.

After that, check that you have a new initramfs in /boot and a new grub2 ".conf" file in /boot/loader/entries/.

# ls -lt /boot/initramfs-*
# ls -lt /boot/loader/entries/

Reboot and test: "lsusb".

Comment 71 Steve 2020-04-06 23:48:17 UTC

(In reply to William Bader from comment #68)
> I got the Fedora 31 VM.
...

Could you go into more detail about that? What software is used to configure and manage VMs on CentOS? Who decided how it would be provisioned (disk, memory)? How did you do the F31 install? On site? Or remotely? What software did you have to enable or configure on the installed F31 VM to use it remotely?

My experience with VMs is on a desktop system with Fedora software:

$ rpm -q virt-manager qemu-kvm
virt-manager-2.1.0-2.fc30.noarch
qemu-kvm-3.1.1-2.fc30.x86_64

And those are fantastic tools, I must say.

Comment 72 William Bader 2020-04-07 04:20:08 UTC

>Could you go into more detail about that?

The office is on coronavirus lockdown. I am on lockdown also, stranded far from the office.
I showed this bug report to the person who manages the VMs, and he connected remotely and created a Fedora 31 VM with 8GB RAM, 44GB disk (11GB currently used by the OS and the kernel build), and one virtual cpu that shows as an Intel Skylake Processor.
Ten years ago, we had a computer room full of headless desktops and towers. We got a single big server and migrated everything to VMs on the big server.
We do daily backups of the important VMs, but it would still be a pain to lose one, so only a few people have access, and I am not on that list. It is better that way, so I don't get the blame if something breaks. I don't know what tool he uses or how he installed Fedora. He installed gcc, but I had to install make, flex, bison, and a few libraries.
Before I did an in-place update from Fedora 30 to 31 on my laptop, he made a Fedora 30 VM and tested the update, and I think that he has already set up a Fedora 31 VM for another project that needed an OS more recent than CentOS 7.
I have some VMs on my laptop under VirtualBox, but my laptop supports only 8GB RAM, so I can't do much in the VMs.
I suppose that since the webcam is a hardware issue, if it doesn't work on the OS on the bare metal, it won't work inside a VM.

Comment 73 William Bader 2020-04-07 05:27:55 UTC

I need help...
I did 'tar -xvf linux-5.4.10.localversion1-x86.tar.gz -C /'
Since /boot was almost full and I had a bunch of kernels from koji, I did 'sudo dnf remove kernel-modules-5.4.11-200.fc31.x86_64 kernel-core-5.4.11-200.fc31.x86_64'
It got some errors, and when I checked, /lib/modules had only 5.4.10.localversion1

Before starting, I made a tar of all of /boot and of /lib/modules/5.5.13-200.fc31.x86_64 so I put the 5.5.13 modules back.
To be safe, I downloaded kernel 5.5.14 core and modules 5.5.14 from koji and installed it.

What went wrong? Every file in the tarball has 5.4.10 in its name.

I also noticed that I can't run 32 bit executables. It wiped out all of /lib.

Something seems to have broken dnf, so when I removed kernel-modules-5.4.11-200.fc31.x86_64, it removed all of /lib.

Comment 74 William Bader 2020-04-07 05:48:39 UTC

Is there a way to validate that all of the files from all of the installed rpms are present?
I've probably lost some information by restoring a /lib that is a few days old.
After doing in-place Fedora updates, I sometimes run 'rpm --rebuilddb' and 'dnf distro-sync', but those probably won't help now.
I restored the files by going to /lib on my backup drive and running 'sudo tar cf - . | (cd /lib ; sudo tar xf -)'
I use rsync make a few cycles of backups. I know that there are more efficient ways to make backups, but this makes it easy to copy files back as needed.

Comment 75 William Bader 2020-04-07 06:05:58 UTC

Since I did the backup before installing koji modules, my backup didn't restore the modules for 5.4.10-200.fc31 (the last kernel that supported the webcam).
dnf wouldn't let me install it because it thought that it was already installed.
So I ran 'sudo dnf reinstall kernel-modules-5.4.10*.rpm' and it came back with some errors. Are those bad or normal for a reinstall?
Downloading Packages:
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                                                         1/1 
  Reinstalling     : kernel-modules-5.4.10-200.fc31.x86_64                                                                                                                   1/2 
  Running scriptlet: kernel-modules-5.4.10-200.fc31.x86_64                                                                                                                   1/2 
depmod: WARNING: could not open modules.order at /lib/modules/5.4.10-200.fc31.x86_64: No such file or directory
depmod: WARNING: could not open modules.builtin at /lib/modules/5.4.10-200.fc31.x86_64: No such file or directory

  Cleanup          : kernel-modules-5.4.10-200.fc31.x86_64                                                                                                                   2/2 
  Running scriptlet: kernel-modules-5.4.10-200.fc31.x86_64                                                                                                                   2/2 
depmod: WARNING: could not open modules.order at /lib/modules/5.4.10-200.fc31.x86_64: No such file or directory
depmod: WARNING: could not open modules.builtin at /lib/modules/5.4.10-200.fc31.x86_64: No such file or directory

  Verifying        : kernel-modules-5.4.10-200.fc31.x86_64                                                                                                                   1/2 
  Verifying        : kernel-modules-5.4.10-200.fc31.x86_64                                                                                                                   2/2 

Reinstalled:
  kernel-modules-5.4.10-200.fc31.x86_64                                                                                                                                          

Complete!


The modules.order and modules.builtin files are missing in /lib/modules/5.4.10-200.fc31.x86_64 but present in other module directories.
I can't reinstall kernel-core-5.4.10-200.fc31.x86_64 because I don't have enough space in /boot, and I don't want to try removing any other kernels until I hear back from you and have an answer about why dnf removed my /lib.

Comment 76 Steve 2020-04-07 06:28:51 UTC

(In reply to William Bader from comment #74)

There was a BZ mid-air collision.

> Is there a way to validate that all of the files from all of the installed rpms are present?

Yes. I would suggest verifying your kernel packages first, because that produced a lot of error messages when I did that after I thought I had reinstalled all of the kernel packages:

# rpm -Va kernel\* # You don't have to run that as root, but a few files will be flagged with "?", meaning they are unreadable (per "man rpm").

You can then verify all packages with:

# rpm -Va

I am running that right now and seeing a lot of missing files in /lib/. For example, this shows every file is missing:

$ rpm -V kbd-misc

What concerns me is whether running out of disk space caused the problem or if there is something wrong with the tar command being run as root.

> I've probably lost some information by restoring a /lib that is a few days old.
> After doing in-place Fedora updates, I sometimes run 'rpm --rebuilddb' and 'dnf distro-sync', but those probably won't help now.
> I restored the files by going to /lib on my backup drive and running 'sudo tar cf - . | (cd /lib ; sudo tar xf -)'
> I use rsync make a few cycles of backups. I know that there are more efficient ways to make backups, but this makes it easy to copy files back as needed.

That sounds like a good system.

Comment 77 Steve 2020-04-07 06:41:47 UTC

This could be the problem:

On my F30 primary system, /lib is a link:

$ ls -lF -d /lib
lrwxrwxrwx. 1 root root 7 Feb 11  2019 /lib -> usr/lib/

But in my F31 test VM, /lib is a directory:

$ ls -lF -d /lib
drwxrwxr-x. 3 root root 4096 Apr  6 06:44 /lib/

And here is where all the modules went:

$ ls -F /usr/lib/modules
5.3.7-301.fc31.x86_64/   5.5.11-200.fc31.x86_64/  5.5.8-200.fc31.x86_64/
5.5.10-200.fc31.x86_64/  5.5.13-200.fc31.x86_64/  5.5.9-200.fc31.x86_64/

The "/lib" link needs to be restored!

And that points the finger at the tar command.

Comment 78 Steve 2020-04-07 06:58:27 UTC

This is a bit convoluted, but I wanted to make sure the "ln" command produced the correct result:

# cd /
# ln -s usr/lib lib1
# mv -i lib lib2
# mv -i lib1 lib

Now, this verifies as expected:

# rpm -V kbd-misc

Comment 79 William Bader 2020-04-07 07:11:40 UTC

Thanks. I did cd /; mv lib lib-; ln -s usr/lib lib
If I don't have problems after a day or two, I'll remove lib-
I saw that the tarball had /lib, but I didn't realize that /lib on Fedora is a symlink.

Comment 80 Steve 2020-04-07 07:19:25 UTC

(In reply to William Bader from comment #79)
> Thanks. I did cd /; mv lib lib-; ln -s usr/lib lib
> If I don't have problems after a day or two, I'll remove lib-
> I saw that the tarball had /lib, but I didn't realize that /lib on Fedora is a symlink.

I didn't either. And there is a tar option to preserve the link:

$ man tar
...
   Overwrite control
       These options control tar actions when extracting a file over an existing copy on disk.
...
       --keep-directory-symlink
              Don't replace existing symlinks to directories when extracting.

Comment 81 William Bader 2020-04-07 07:36:33 UTC

I rebooted to 5.5.14-200.fc31.x86_64 so it seems as if nothing got messed up from having /lib changed.
I moved the modules created by the tarball in the saved /lib- to the correct place.

$ cd /
$ sudo /bin/kernel-install --verbose add 5.4.10.localversion1-x86 /boot/vmlinuz-5.4.10.localversion1-x86
Kernel image argument /boot/vmlinuz-5.4.10.localversion1-x86 not a file
$ sudo /bin/kernel-install --verbose add 5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/00-entry-directory.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/10-devicetree.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/20-grub.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/20-grubby.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/50-depmod.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
Running depmod -a 5.4.10.localversion1
+/usr/lib/kernel/install.d/50-dracut.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/51-dracut-rescue.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/90-loaderentry.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/95-akmodsposttrans.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/95-kernel-hooks.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 
+/usr/lib/kernel/install.d/99-grub-mkconfig.install add 5.4.10.localversion1 /boot/9f668c5979cb49f8ad387216c22a4693/5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 

It made the initramfs.

A directory list of /boot shows
-rwxrwxr-x  1 root root 76420472 Apr  6 21:57 vmlinux-5.4.10.localversion1 <- do I need this?
-rwxr-xr-x. 1 root root  5977368 Nov 18  2015 vmlinuz-0-rescue-9f668c5979cb49f8ad387216c22a4693
-rwxr-xr-x  1 root root 10285768 Jan  9 20:12 vmlinuz-5.4.10-200.fc31.x86_64
-rw-rw-r--  1 root root  9468288 Apr  6 21:57 vmlinuz-5.4.10.localversion1 <- why isn't this executable?
-rwxr-xr-x  1 root root 10846920 Mar 23 17:45 vmlinuz-5.5.11-200.fc31.x86_64
-rwxr-xr-x  1 root root 10842824 Mar 25 22:09 vmlinuz-5.5.13-200.fc31.x86_64
-rwxr-xr-x  1 root root 10842824 Apr  1 17:50 vmlinuz-5.5.14-200.fc31.x86_64

Can I remove vmlinux-5.4.10.localversion1?

Why isn't vmlinuz-5.4.10.localversion1 executable? (I just made it match the permissions of the vmlinuz files.)

/boot/loader/entries/9f668c5979cb49f8ad387216c22a4693-5.4.10.localversion1.conf
is
title Fedora (5.4.10.localversion1) 31 (Workstation Edition)
version 5.4.10.localversion1
linux /vmlinuz-5.4.10.localversion1
initrd /initramfs-5.4.10.localversion1.img
options $kernelopts
id fedora-20200407072107-5.4.10.localversion1
grub_users $grub_users
grub_arg --unrestricted
grub_class kernel

so it looks like it will use vmlinuz.

It looks like the tarball files are in place:

# tar df /u/william/linux-5.4.10.localversion1-x86.tar.gz 
boot/vmlinuz-5.4.10.localversion1: Mode differs <- because I made it executable
lib: File type differs <- because I restored the symlink
lib/modules: Mode differs
lib/modules/5.4.10.localversion1/modules.alias: Mod time differs
lib/modules/5.4.10.localversion1/modules.softdep: Mod time differs
lib/modules/5.4.10.localversion1/modules.devname: Mod time differs
lib/modules/5.4.10.localversion1/modules.dep.bin: Mod time differs
lib/modules/5.4.10.localversion1/modules.symbols.bin: Mod time differs
lib/modules/5.4.10.localversion1/modules.dep: Mod time differs
lib/modules/5.4.10.localversion1/modules.alias.bin: Mod time differs
lib/modules/5.4.10.localversion1/modules.builtin.bin: Mod time differs
lib/modules/5.4.10.localversion1/modules.symbols: Mod time differs

Comment 82 William Bader 2020-04-07 08:01:12 UTC

The 5.4.10 new kernel booted, and the webcam works.

$ uname -a
Linux laptop37 5.4.10.localversion1 #1 SMP Mon Apr 6 14:59:42 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
$ l /dev/video*
crw-rw----+ 1 root video 81, 0 Apr  7 08:37 /dev/video0
crw-rw----+ 1 root video 81, 1 Apr  7 08:37 /dev/video1

So next, I'll build 5.4.11 to confirm that it doesn't work with the webcam, and then start the bisections.

Questions:

To clean generated files before a build, should I 'make distclean' or is clean or mrproper better?

I suspect that I need 'make distclean' before each 'git checkout' or 'git bisect', and then I have to remake the config file.

How do I remove the old kernels from my laptop?
sudo /bin/kernel-install --verbose remove 5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1
and then hunt down the files in the tarball?

Comment 83 Steve 2020-04-07 12:20:49 UTC

(In reply to William Bader from comment #81)

> [partial session transcript]

$ cd /

It would probably be OK to continue from the directory with the tar file.

$ sudo /bin/kernel-install --verbose add 5.4.10.localversion1-x86 /boot/vmlinuz-5.4.10.localversion1-x86
Kernel image argument /boot/vmlinuz-5.4.10.localversion1-x86 not a file
                                                        ^^^^
That is in the tar file name, but not in the kernel file name. Presumably, builds for different architectures could be done on one platform, so the tar files would need to have distinct names, but the kernel files would not.

> Can I remove vmlinux-5.4.10.localversion1?

AFAIK, that isn't needed. However, I would suggest leaving files in place until you need to do a cleanup to recover disk space, at which point you can remove all of the files with something like:

# rm -i *localversion*

> Why isn't vmlinuz-5.4.10.localversion1 executable? (I just made it match the permissions of the vmlinuz files.)

Good catch. That doesn't seem to affect the boot process, but it is not clear where those permissions are set.

> [annotated session transcript]

# tar df /u/william/linux-5.4.10.localversion1-x86.tar.gz 
boot/vmlinuz-5.4.10.localversion1: Mode differs <- because I made it executable
lib: File type differs <- because I restored the symlink
lib/modules: Mode differs

OK. I didn't know about the "d" option for verifying a tar install. That sounds very useful.

There doesn't seem to be a tar option to uninstall ...

Comment 84 Steve 2020-04-07 12:52:04 UTC

(In reply to William Bader from comment #82)
> The 5.4.10 new kernel booted, and the webcam works.

Fantastic!

> $ uname -a
> Linux laptop37 5.4.10.localversion1 #1 SMP Mon Apr 6 14:59:42 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
> $ l /dev/video*
> crw-rw----+ 1 root video 81, 0 Apr  7 08:37 /dev/video0
> crw-rw----+ 1 root video 81, 1 Apr  7 08:37 /dev/video1

Nice.

> So next, I'll build 5.4.11 to confirm that it doesn't work with the webcam, and then start the bisections.
> 
> Questions:
> 
> To clean generated files before a build, should I 'make distclean' or is clean or mrproper better?
> 
> I suspect that I need 'make distclean' before each 'git checkout' or 'git bisect', and then I have to remake the config file.

I will post separately on that subject.

> How do I remove the old kernels from my laptop?

# cd /boot
# rm -i *localversion*

> sudo /bin/kernel-install --verbose remove 5.4.10.localversion1 /boot/vmlinuz-5.4.10.localversion1 and then hunt down the files in the tarball?

That would probably work, but the files are in only two places: /boot/*localversion* and /lib/modules/*localversion*/, so suitable "rm" commands should suffice.

If the removal process were for a released product, it could be more automated. However, I suggest leaving as many files in place for as long as possible until you need to recover disk space. From what you said before, that would probably only be in /boot.

As for the grub2 ".conf" files, they can be left in place even if there is no kernel to run. Grub2 will simply send you back to the menu. If you want to save them, I believe they can be moved to a subdirectory and grub2 will ignore them.

However, after the disappearing "/lib" scare, I believe it would be a good idea to write a shell script for the install process, so that no options are forgotten or mistyped. In outline:

Accept one argument: the kernel version string.
Run the tar command.
Run the kernel-install command.

It should be possible to run it from the directory with the tar file.

If this were for a released product, a removal option could be added. But since this script must be run as root, testing and debugging a removal option would be risky and time-consuming.

This is starting to sound like a software development project with specifications. :-)

Comment 85 Steve 2020-04-07 13:29:22 UTC

> How do I remove the old kernels from my laptop?

# cd /boot
# rm -i *localversion*

The kernel file names should all be *distinct* because:

CONFIG_LOCALVERSION_AUTO=y

So, it should be possible to *save* all the bisection kernels and module directories in one directory on your backup drive. (Or the tar files.)

NB: The number N in ".localversionN" doesn't need to be incremented unless the config file itself changes.

$ less init/Kconfig
...
config LOCALVERSION_AUTO
        bool "Automatically append version information to the version string"
...
          This will try to automatically determine if the current tree is a
          release tree by looking for git tags that belong to the current
          top of tree revision.

          A string of the format -gxxxxxxxx will be added to the localversion [Comment 70 has an example.]
          if a git-based tree is found.  The string generated by this will be
          appended after any matching localversion* files, and after the value
          set in CONFIG_LOCALVERSION.

          (The actual string used here is the first eight characters produced
          by running the command:

            $ git rev-parse --verify HEAD

          which is done within the script "scripts/setlocalversion".)
...

Comment 86 Steve 2020-04-07 13:47:57 UTC

(In reply to Steve from comment #85)
...
>           A string of the format -gxxxxxxxx will be added to the localversion [Comment 70 has an example.]
...

As a side note, the Fedora kernel snapshot builds use "gitN", where "N" is a small integer. Those are easier to read and they can be sorted numerically. The actual commit ID is in the changelog (Comment 26).

"git bisect" saves a log, so it should be possible to figure out which build a specific kernel is for by looking at the log:

$ git bisect log
git bisect start
# bad: [9d61432efb21c224b710f397809f3a4fef281f9c] Linux 5.4.11
git bisect bad 9d61432efb21c224b710f397809f3a4fef281f9c
# good: [7a02c193298ec15f2ba1344b6bcd5d578a41b2e0] Linux 5.4.10
git bisect good 7a02c193298ec15f2ba1344b6bcd5d578a41b2e0

Comment 87 Steve 2020-04-07 14:54:09 UTC

(In reply to William Bader from comment #82)
...
> To clean generated files before a build, should I 'make distclean' or is clean or mrproper better?
...

If you did some scratch builds while experimenting with the config file and such-like, starting with "make mrproper" would be a good idea.*

Since "make mrproper" removes the ".config" file, you would need to ensure that you have a "master" copy of your ".config" file. I use a crude form of version control. If I were starting over, I would make sure that the 'N' in '.localversionN' matches the 'N' in '.EXPN':

$ fgrep 'CONFIG_LOCALVERSION=' .config.EXP*
.config.EXP1:CONFIG_LOCALVERSION="localversion1"
.config.EXP2:CONFIG_LOCALVERSION=".localversion2"
.config.EXP3:CONFIG_LOCALVERSION=".localversion2"
.config.EXP4:CONFIG_LOCALVERSION=".localversion3"
.config.EXP4.old:CONFIG_LOCALVERSION=".localversion3"

NB: I believe that the ".old" file was generated by one of the kernel config commands.

As for the actual bisection builds, don't clean anything. "Make" is supposed figure out what needs to be rebuilt. And that should save a lot of time on subsequent bisection builds (depending, of course, on what actually changed).

For the second simulated bisection build, the build time was only 14 minutes:

$ git bisect bad
Bisecting: 41 revisions left to test after this (roughly 5 steps)
[110440a0eb4e340a0f353f9df86783aa4365f899] ARM: exynos_defconfig: Restore debugfs support

$ time make
...
real	14m0.614s
...

$ fgrep -a 'Linux version' vmlinux
Linux version 5.4.10.localversion3-00041-g110440a0e (test-user-1@test-host-1) (gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC)) #10 SMP ...
                                          ^^^^^^^^^--This matches the beginning of the commit ID above.

* Re "make distclean": We shouldn't have any patch files. Depending on your editor, there could be some editor backup files, but those shouldn't affect the builds. This will show you what it actually does:

$ grep -A7 'distclean: mrproper' Makefile

Comment 88 Steve 2020-04-07 23:06:41 UTC

(In reply to Steve from comment #85)
...
> So, it should be possible to *save* all the bisection kernels and module directories in one directory on your backup drive. (Or the tar files.)
...

Here is a non-destructive process:

# cd /boot
# mkdir ARCHIVE
# mv -i *localversion* ARCHIVE

# cd loader/entries/
# mkdir ARCHIVE
# mv -i *localversion* ARCHIVE

If you need to offload the kernel and intramfs from your /boot partition, you could make /boot/ARCHIVE a link to a directory on your backup drive.

And grub2 nicely ignores the ARCHIVE directory. :-)

NB: I am ignoring the modules on the assumption that they don't cause any problems in /lib/modules/.

Comment 89 Steve 2020-04-07 23:50:17 UTC

# tar --keep-directory-symlink -xvf linux-5.4.10.localversion3-00083-g97d9e8620-x86.tar -C /
      ^^^^^^^^^^^^^^^^^^^^^^^^

That worked as expected -- /lib was a link before running the command and the link was still there after running the command.

If you want to be more cautious, you could unpack the tar file into a local directory and then manually move the files and directories to /boot and /lib/modules/:

# mkdir linux-5.4.10.localversion3-00083-g97d9e8620
# tar -xvf linux-5.4.10.localversion3-00083-g97d9e8620-x86.tar -C ./linux-5.4.10.localversion3-00083-g97d9e8620/

Comment 90 Steve 2020-04-08 13:39:10 UTC

(In reply to William Bader from comment #82)
...
> How do I remove the old kernels from my laptop?
...

The kernel can be built as an rpm package and installed with dnf:

$ make binrpm-pkg

$ ls -1sh ~/rpmbuild/RPMS/x86_64/
total 74M
 73M kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_64.rpm
1.3M kernel-headers-5.4.10.localversion3_00041_g110440a0e-10.x86_64.rpm

[scp to the target system]

On the target system:

# dnf install kernel*.rpm

$ rpm -qi kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_64
Name        : kernel
Version     : 5.4.10.localversion3_00041_g110440a0e
...

The initramfs file is built, the grub2 ".conf" file is written, and it boots.

After rebooting with another kernel:

# dnf list --installed kernel\*localversion\*
Installed Packages
kernel.x86_64                              5.4.10.localversion3_00041_g110440a0e-10                              @@commandline

The remove command succeeds, although there are several error messages about missing files:

# dnf remove kernel-5.4.10.localversion3_00041_g110440a0e-10
Dependencies resolved.
==============================================================================================================================
 Package            Architecture       Version                                                Repository                 Size
==============================================================================================================================
Removing:
 kernel             x86_64             5.4.10.localversion3_00041_g110440a0e-10               @@commandline             270 M

Transaction Summary
==============================================================================================================================
Remove  1 Package

Freed space: 270 M
Is this ok [y/N]: y
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                      1/1 
  Running scriptlet: kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_64                                               1/1 
  Erasing          : kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_64                                               1/1 
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.symbols.bin: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.symbols: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.softdep: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.devname: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.dep.bin: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.dep: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.builtin.bin: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.alias.bin: remove failed: No such file or directory
warning: file /lib/modules/5.4.10.localversion3-00041-g110440a0e/modules.alias: remove failed: No such file or directory
warning: file /boot/vmlinuz-5.4.10.localversion3-00041-g110440a0e: remove failed: No such file or directory
warning: file /boot/config-5.4.10.localversion3-00041-g110440a0e: remove failed: No such file or directory
warning: file /boot/System.map-5.4.10.localversion3-00041-g110440a0e: remove failed: No such file or directory

  Running scriptlet: kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_# dnf remove kernel-5.4.10.localversion3_00041_g110440a0e-10

64                                               1/1 
  Verifying        : kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_64                                               1/1 

Removed:
  kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_64                                                                      

Complete!

Comment 91 Steve 2020-04-08 14:00:31 UTC

(In reply to Steve from comment #90)
...
> $ ls -1sh ~/rpmbuild/RPMS/x86_64/
...

"~/rpmbuild" was a pre-existing directory from previous experiments with building rpm packages.

As a test, I renamed it and ran "make binrpm-pkg" again. The "~/rpmbuild" directory was recreated.

Comment 92 Steve 2020-04-08 15:27:36 UTC

(In reply to Steve from comment #67)
> Here is another reason for a difference in the sizes -- the Fedora kernel modules are compressed:
> 
> $ find 5.5.15-200.fc31.x86_64 -name uvcvideo\* | xargs ls -l -rw-r--r--. 1 root root 46184 Apr  2 12:50
> 5.5.15-200.fc31.x86_64/kernel/drivers/media/usb/uvc/uvcvideo.ko.xz                                                                             
...

There appears to be another discrepancy between the packaged config file and what is actually built:

$ fgrep 'CONFIG_MODULE_COMPRESS' config-5.4.10
# CONFIG_MODULE_COMPRESS is not set

So, try a build with:

$ diff -u0 .config.EXP4 .config.EXP5
...
@@ -28 +28 @@
-CONFIG_LOCALVERSION=".localversion3"
+CONFIG_LOCALVERSION=".localversion5" # I bumped the number to match the config file version number.
@@ -854 +854,3 @@
-# CONFIG_MODULE_COMPRESS is not set
+CONFIG_MODULE_COMPRESS=y
+# CONFIG_MODULE_COMPRESS_GZIP is not set
+CONFIG_MODULE_COMPRESS_XZ=y

$ time make
...
real	12m7.546s
...

$ time make binrpm-pkg
...
real	5m9.922s
...

Unfortunately, module compression seems to have made the package bigger:

$ ls -1sh kernel-5.4.10.localversion*_00041_g110440a0e-*.x86_64.rpm
73M kernel-5.4.10.localversion3_00041_g110440a0e-10.x86_64.rpm
75M kernel-5.4.10.localversion5_00041_g110440a0e-11.x86_64.rpm

Comment 93 Steve 2020-04-08 16:25:34 UTC

The version with compressed modules boots:

$ uname -r
5.4.10.localversion5-00041-g110440a0e

The compressed module directory is smaller than the Fedora kernel module directories:

$ du -s /lib/modules/* | sort -n
65412	/lib/modules/5.4.10.localversion5-00041-g110440a0e
79108	/lib/modules/5.3.7-301.fc31.x86_64
...
82600	/lib/modules/5.5.15-200.fc31.x86_64
258152	/lib/modules/5.4.10.localversion3-00083-g97d9e8620

That could be because the Fedora kernel module directories have additional files, including a copy of "vmlinuz":

$ diff -q ./5.3.7-301.fc31.x86_64/ ./5.4.10.localversion5-00041-g110440a0e/ | cat -n
     1	Only in ./5.3.7-301.fc31.x86_64/: bls.conf
     2	Only in ./5.3.7-301.fc31.x86_64/: build
...
    26	Only in ./5.3.7-301.fc31.x86_64/: vmlinuz
    27	Only in ./5.3.7-301.fc31.x86_64/: .vmlinuz.hmac

$ ls -sh ./5.3.7-301.fc31.x86_64/vmlinuz 
8.9M ./5.3.7-301.fc31.x86_64/vmlinuz

$ file ./5.3.7-301.fc31.x86_64/vmlinuz
./5.3.7-301.fc31.x86_64/vmlinuz: Linux kernel x86 boot executable bzImage, version 5.3.7-301.fc31.x86_64 ...

Comment 94 William Bader 2020-04-08 18:19:46 UTC

Thanks for the research.

>That is in the tar file name, but not in the kernel file name.

I typed it wrong and accidentally included it when I did a cut and paste from my shell window.

>I didn't know about the "d" option for verifying a tar install.

I used to use the old single letter options a lot.
A long time ago, I modified a version of pdtar (the predecessor to gtar) to build on MSDOS and read and write SCO Xenix-compatible tar files to floppies using BIOS calls for IO.

>I would make sure that the 'N' in '.localversionN' matches the 'N' in '.EXPN'

I saved my config with
cp -p .config ../config-`grep Linux .config | head -1 | awk '{print $3}'`-`grep -i CONFIG_LOCALVERSION= .config | sed -e 's/.*=".//' -e 's/"//g'`

>rm -i *localversion*

I removed the old kernel with the commands below.
I've started deleting files by moving them to /tmp instead of deleting them.
It makes typos easier to fix.
I reboot my laptop every day, which clears out /tmp.

mv -iv /boot/*localversion1 /boot/*localversion1.img /boot/loader/entries/*localversion1.conf /tmp/
mv -i /lib/modules/*localversion1/ /tmp/

My laptop is booted from 5.5.14-200.fc31.x86_64 from koji.
Even idle, the cpu is 80C. I don't see anything unusual in top or gkrellm, and powertop shows 100 wakeups/second and 1.1% CPU use.

[later] I updated to 5.5.15, and it seems ok. The cpu is below 60C.

>I don't know why you would need vmlinux, and there is another quirk about what is in the tar file:
>Both vmlinux and vmlinuz are included:

Removing the large vmlinux kernel from /boot works.
Just the vmlinuz kernel referenced in the /boot/loader/entries/ config file is enough.
The vmlinuz kernel does not need to be executable.

My /boot is only about 400MB, and the vmlinux file is about 100MB, so being able to remove it helps a lot.

>The compressed module directory is smaller than the Fedora kernel module directories:

I noticed the size difference, but as long as it boots, I didn't look into it.
I only need it running enough to boot and test the webcam video.

>make binrpm-pkg

I can try that on the next bisection.

I am making progress with the bisections.
5.4.10 good
5.4.11 bad
[97d9e8620f57f28f415b23ad88b97c87b6d53390] bnx2x: Do not handle requests from VFs after parity / good

Regards, William

Comment 95 Steve 2020-04-08 21:39:30 UTC

> I typed it wrong and accidentally included it when I did a cut and paste from my shell window.

Actually, that was very helpful, because it showed that the architecture is needed in some file names, but not in others.

> I used to use the old single letter options a lot.

The "ps" man page shows a lot of those "x" options, along "-x" and "--xyz" options. That should make everyone happy, until they get confused about whether "x" means the same thing as "-x" or "--xyz". :-)

> A long time ago, I modified a version of pdtar (the predecessor to gtar) to build on MSDOS and read and write SCO Xenix-compatible tar files to floppies using BIOS calls for IO.

"tar" has been useful for a long time. Now, with various standards, compatibility problems don't seem to be as common.

> I saved my config with
> cp -p .config ../config-`grep Linux .config | head -1 | awk '{print $3}'`-`grep -i CONFIG_LOCALVERSION= .config | sed -e 's/.*=".//' -e 's/"//g'`

Automating that is a good idea, although debugging something like that would take me a "few" tries. :-)

I used to write big awk programs, so here is how I got the local version number with just awk:

$ cat /tmp/foo/.config | awk '/CONFIG_LOCALVERSION=/ { print "\nDEBUG:", $0; gsub("[=\"]+", " "); sub(".localversion", ""); lver=$2 }; END { print ".config.EXP" lver }'

DEBUG: CONFIG_LOCALVERSION=".localversion5"
.config.EXP5

In a regular expression, "." matches any character, so the second regular expression ("sub") is not very robust.

> I removed the old kernel with the commands below.
> I've started deleting files by moving them to /tmp instead of deleting them.
> It makes typos easier to fix.
> I reboot my laptop every day, which clears out /tmp.

> mv -iv /boot/*localversion1 /boot/*localversion1.img /boot/loader/entries/*localversion1.conf /tmp/
> mv -i /lib/modules/*localversion1/ /tmp/

Excellent idea. You are using /tmp something like the trash can in desktop environments. I use /tmp a lot, but I never thought of using it that way.

> My laptop is booted from 5.5.14-200.fc31.x86_64 from koji.
> Even idle, the cpu is 80C. I don't see anything unusual in top or gkrellm, and powertop shows 100 wakeups/second and 1.1% CPU use.

There could be a mechanical problem -- have you tried vacuuming the vents?

[later] I updated to 5.5.15, and it seems ok. The cpu is below 60C.

I've seen bug reports about fans running continuously, so it could happen the other way. Can you monitor fan speeds with gkrellm?

I have a shell script for monitoring temperatures and fan speeds. Here is a snippet:

$ egrep -n 'temp1|fan1' ~/bin/mbmon3
14:    TEMP1=$(cat /sys/class/hwmon/hwmon1/temp1_input)
26:    FAN1=$(cat /sys/class/hwmon/hwmon2/fan1_input)

> Removing the large vmlinux kernel from /boot works.
> Just the vmlinuz kernel referenced in the /boot/loader/entries/ config file is enough.
> The vmlinuz kernel does not need to be executable.

The kernel has an odd status: it is certainly executable code, but the boot loader is the only program that starts it.

> My /boot is only about 400MB, and the vmlinux file is about 100MB, so being able to remove it helps a lot.

~500MB used to be the recommended size for /boot. Now I standardize on ~1000MB.

> I noticed the size difference, but as long as it boots, I didn't look into it.
> I only need it running enough to boot and test the webcam video.

Agreed.

>>make binrpm-pkg

> I can try that on the next bisection.

OK.

> I am making progress with the bisections.
> 5.4.10 good
> 5.4.11 bad
> [97d9e8620f57f28f415b23ad88b97c87b6d53390] bnx2x: Do not handle requests from VFs after parity / good

Excellent. That is the same commit that I got for the first bisection build, except that I called it "bad". (per "git bisect log")

The "git bisect" man page has a section on how to fix "a mistake in specifying the status of a revision" by using the "git bisect replay" subcommand.

> Regards, William

Comment 96 Steve 2020-04-08 22:10:36 UTC

(In reply to William Bader from comment #94)
...
> >I would make sure that the 'N' in '.localversionN' matches the 'N' in '.EXPN'
> 
> I saved my config with cp -p .config ../config-`grep Linux .config | head -1 | awk '{print $3}'`-`grep -i CONFIG_LOCALVERSION= .config | sed -e 's/.*=".//' -e 's/"//g'`
...

As a side note, I was actually doing just the opposite:

.config.EXP4 -> "make nconfig" -> .config.EXP5 -> "cp" -> .config.

The problem with my procedure is that I could not remember the various numbers, so I had to background "make nconfig" to check the file names and then foreground it again to save the updated .config.EXP[N+1] file.

Your approach is more like doing a git commit: Change the file and then save a copy with the new version number.

Comment 97 William Bader 2020-04-10 03:41:11 UTC

Making rpms and combined with a virus stay-at-home order for Apr 9-13 helped finish the bisection.
The first bad commit is 7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430 usb: missing parentheses in USE_NEW_SCHEME
The bad commit is
drivers/usb/core/hub.c
-#define USE_NEW_SCHEME(i, scheme)      ((i) / 2 == (int)scheme)
+#define USE_NEW_SCHEME(i, scheme)      ((i) / 2 == (int)(scheme))
That seems to be a reasonable change, because the macro is used only once in use_new_scheme(struct usb_device *udev, int retry, struct usb_port *port_dev)
  return USE_NEW_SCHEME(retry, old_scheme_first_port || old_scheme_first || quick_enumeration);
hub_port_init(struct usb_hub *hub, struct usb_device *udev, int port1, int retry_counter) has
                if (use_new_scheme(udev, retry_counter, port_dev)) {
hub_port_connect(struct usb_hub *hub, int port1, u16 portstatus, u16 portchange) has
        for (i = 0; i < SET_CONFIG_TRIES; i++) {
                status = hub_port_init(hub, udev, port1, i);
#define SET_CONFIG_TRIES        (2 * (use_both_schemes + 1))

My guess is that either old_scheme_first or quick_enumeration is set, which makes the old USE_NEW_SCHEME true for i = 0 and the new USE_NEW_SCHEME false.
Maybe the USB errors in my first comment happen on the initial try if USE_NEW_SCHEME is false, but the errors aren't hard enough to make it retry the new scheme or happen too late to be able to retry.


Notes since the last comment:

>Unfortunately, module compression seems to have made the package bigger:

Probably the archive is compressed, and compressing compressed files doesn't usually have much gain.
At least for now, the transfer is fast enough, and my system disk (with /lib/modules) is large enough, that I don't need to worry about compressing modules.
My main problem was the uncompressed vmlinux filling up my /boot.
The RPM does not have the vmlinux, plus installing and removing the RPM is safer and easier than typing tar and rm commands.

I think you were right when you suggested that the laptop could be dusty inside.
The last time I opened it up was a few years ago to install a larger SSD, and I blew out a lot of dust.

I hear the fans spinning but something from the other side sounds like a hard disk, and my laptop doesn't have a hard disk.
[later] It didn't stop when I shut down my laptop. It might be a hot water pipe for a radiator on the other side of a wall.
[more later] I called a serviceman because the bathroom had no hot water. The serviceman couldn't fix it, and now the heat for the radiators is broken also.

As far as I know, the hardware on my laptop does not expose the fan speed, and gkrellm does not show anything under 'Fans'.

On 7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430 , I accidentally downloaded and installed the headers rpm instead of the kernel.
It replaced the 5.5.15-200 headers, and when I tried removing kernel-headers-5.4.10.localversion8_00165_g7cbdf96cd-1.x86_64.rpm , it wanted to remove a lot of packages.
I reinstalled the 5.5.15-200 headers, and then I could remove the 5.4.10 headers. I was one 'y' away from doing some damage, although 'dnf history rollback #' would probably have fixed it.

After I downloaded the rpm, I cleared the rpms from rpmbuild/RPMS/x86_64/ , and when I rebuilt the rpms with 'binrpm-pkg', the header rpm came out a slightly different size.
It is a little worrying that repeating 'make binrpm-pkg' on the same kernel build produces rpms with different sizes. Does it embed timestamps or logs?

Here are my bisections:
5.4.10 good
5.4.11 bad
2 good [97d9e8620f57f28f415b23ad88b97c87b6d53390] bnx2x: Do not handle requests from VFs after parity
3 good 5.4.10.localversion3-00124-g43b0b3300-rpm
4 good [72cd84ea52407323b241571691b2426fb25c41ef] net: usb: lan78xx: fix possible skb leak / 5.4.10.localversion4-00145-g72cd84ea5
5 good [f479506e5164cb9eff4c60531bd48026dd433e4a] macb: Don't unregister clks unconditionally
6 good [caef8a716245726ede87417113db03f045fc1989] net/mlx5e: Fix hairpin RSS table size
7 good [578289f8476c3044f73ff15e138bfca555567ffe] USB: core: fix check for duplicate endpoints
8 bad [7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430] usb: missing parentheses in USE_NEW_SCHEME
9 good [093d658a06cd1831c629ceeee207572895c1a872] USB: serial: option: add Telit ME910G1 0x110a composition

I had to remove the installed kernels from /boot, but I saved the rpms.

Here is the message from git on reaching the end:

7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430 is the first bad commit
commit 7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430
Author: Qi Zhou <atmgnd>
Date:   Sat Jan 4 11:02:01 2020 +0000

    usb: missing parentheses in USE_NEW_SCHEME

    commit 1530f6f5f5806b2abbf2a9276c0db313ae9a0e09 upstream.

    According to bd0e6c9614b9 ("usb: hub: try old enumeration scheme first
    for high speed devices") the kernel will try the old enumeration scheme
    first for high speed devices.  This can happen when a high speed device
    is plugged in.

    But due to missing parentheses in the USE_NEW_SCHEME define, this logic
    can get messed up and the incorrect result happens.

    Acked-by: Alan Stern <stern.edu>
    Signed-off-by: Qi Zhou <atmgnd>
    Link: https://lore.kernel.org/r/ht4mtag8ZP-HKEhD0KkJhcFnVlOFV8N8eNjJVRD9pDkkLUNhmEo8_cL_sl7xy9mdajdH-T8J3TFQsjvoYQT61NFjQXy469Ed_BbBw_x4S1E=@protonmail.com
    [ fixup changelog text - gregkh]
    Cc: stable <stable.org>
    Fixes: bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices")
    Signed-off-by: Greg Kroah-Hartman <gregkh>

 drivers/usb/core/hub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 98 Steve 2020-04-10 09:25:39 UTC

> Making rpms and combined with a virus stay-at-home order for Apr 9-13 helped finish the bisection.
> The first bad commit is 7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430 usb: missing parentheses in USE_NEW_SCHEME

Awesome! Please put "[BISECTED]" at the beginning of the bug summary:

"[BISECTED] built-in laptop webcam no longer found on Sony Vaio on Fedora 31"

Comment 99 Steve 2020-04-10 10:10:53 UTC

> when I tried removing kernel-headers-5.4.10.localversion8_00165_g7cbdf96cd-1.x86_64.rpm , it wanted to remove a lot of packages.

The "--noautoremove" option to dnf can sometimes reduce the number of packages that dnf wants to remove.

BTW, I also accidentally transferred the headers package to my F31 test VM, but I didn't install it. My mistake was in creating a link in /tmp/ that made the path passed to "scp" simpler. From my shell history:

  963  ln -s `pwd`/kernel-headers-5.4.10.localversion5_00124_g43b0b3300-12.x86_64.rpm /tmp/  # WRONG
  968  ln -s `pwd`/kernel-5.4.10.localversion5_00124_g43b0b3300-12.x86_64.rpm /tmp/          # RIGHT

Those commands were run from "~/rpmbuild/RPMS/x86_64".

I didn't install the headers package because I ran some validation checks after transferring the rpm package. From the shell history in my F31 test VM:

  960  rpmkeys --checksig -v kernel-headers-5.4.10.localversion5_00124_g43b0b3300-12.x86_64.rpm
  963  rpm -qi -p kernel-headers-5.4.10.localversion5_00124_g43b0b3300-12.x86_64.rpm

IIRC, the description from the second command caught my attention as being wrong. Interestingly, the "headers" in "kernel-headers ..." never caught my attention. I believe that is due to the file names being so long and tab-completion being so easy.

Comment 100 Steve 2020-04-10 10:22:57 UTC

(In reply to Steve from comment #99)
> IIRC, the description from the second command caught my attention as being wrong.

It was because the "scp" transfer seemed unusually fast. That provoked me to investigate further.

Comment 101 Steve 2020-04-10 10:32:01 UTC

> My guess is that either old_scheme_first or quick_enumeration is set, which makes the old USE_NEW_SCHEME true for i = 0 and the new USE_NEW_SCHEME false.
> Maybe the USB errors in my first comment happen on the initial try if USE_NEW_SCHEME is false, but the errors aren't hard enough to make it retry the new scheme or happen too late to be able to retry.

You could be right about that. The whole approach may need to be rethought. But guessing in software about how long some otherwise unknown piece of hardware should take to do something is an insoluble problem. I believe that some hardware can export expected timings for various commands -- but first you have to talk to the hardware.

Anyway, you have enough to open an upstream bug if you want to:

https://bugzilla.kernel.org/

Comment 102 Steve 2020-04-10 11:02:52 UTC

(In reply to Steve from comment #101)
> Anyway, you have enough to open an upstream bug if you want to:
> 
> https://bugzilla.kernel.org/

I guess you have to register to open a new bug. A search of bugs under Drivers/USB found some instructions for collecting debug info:

https://bugzilla.kernel.org/show_bug.cgi?id=203419#c21

I'm not sure if they are exactly applicable when the problem occurs while booting and the device is built in, but some of those options could probably be put on the kernel command-line.

You probably don't need to do the first step because:

$ mount -t debugfs
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime,seclabel)

Tested in an F31 VM with:

$ uname -r
5.5.16-200.fc31.x86_64

Comment 103 Steve 2020-04-10 15:04:35 UTC

Thanks for updating the bug summary with "[BISECTED]".

(In reply to Steve from comment #102)
...
> You probably don't need to do the first step because:
> 
> $ mount -t debugfs
> debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime,seclabel)
...

No problem:

$ grep CONFIG_DYNAMIC_DEBUG .config
CONFIG_DYNAMIC_DEBUG=y

Dynamic debug
https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

See, in particular, the section titled "Debug messages during Boot Process" and the examples at the end under "Kernel command line:".

To set up "dyndbg" from the kernel command-line, this appears to work (Mathias adds some additional options that I haven't worked out yet):

$ grep '^GRUB_CMDLINE_LINUX' /etc/default/grub
GRUB_CMDLINE_LINUX="dyndbg='module xhci_hcd =p ; module usbcore =p'"

Although this looks wrong (the first quote should be after "dyndbg=".):

$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.16-200.fc31.x86_64 root=UUID=54f79645-f858-46e0-af7a-97aecc88ff87 ro "dyndbg=module xhci_hcd =p ; module usbcore =p"

Verify with:

# egrep 'xhci_hcd|usbcore' /sys/kernel/debug/dynamic_debug/control | less  # Look for "=p" in the third field.

Comment 104 Steve 2020-04-10 15:40:29 UTC

(In reply to Steve from comment #103)
...
$ grep '^GRUB_CMDLINE_LINUX' /etc/default/grub
GRUB_CMDLINE_LINUX="dyndbg='module xhci_hcd =p ; module usbcore =p'"
...

I repeatedly used the following process as I worked out that kernel command-line:

$ sudo su

Edit GRUB_CMDLINE_LINUX in /etc/default/grub.

# cd /boot/grub2

# grub2-mkconfig -o grub.cfg  # Rebuild grub.cfg. (Make a backup before starting.)

# grep kernelopts grub.cfg    # See what grub2-mkconfig actually generated.

# reboot

Press "e" when the grub2 menu is displayed and verify that the kernel command-line looks correct. I had various problems with nested quotes and what appeared to be incorrect interpretation of the ";" in the command-line.

Press "ctrl-x" to boot.

Login as usual, start a terminal session, and see what is actually on the kernel command-line:

$ cat /proc/cmdline  # As noted in Comment 103, the quotes end up like this "dyndbg=xxx" instead of like this dyndbg="xxx".
                                                                            ^          ^                             ^   ^

$ sudo su

# egrep 'xhci_hcd|usbcore' /sys/kernel/debug/dynamic_debug/control | less  # Look for "=p" in the third field.

Comment 105 William Bader 2020-04-10 15:50:03 UTC

Thanks for the reply.

>CONFIG_DYNAMIC_DEBUG=y

That was already set in my .config, which came from a Fedora config. I would have expected that it would be off by default for performance reasons.

>grep '^GRUB_CMDLINE_LINUX' /etc/default/grub

Is there a way to set it other than /etc/default/grub ? I am worried that if I mess up that file, I could leave my laptop unbootable.

Comment 106 Steve 2020-04-10 16:48:03 UTC

(In reply to William Bader from comment #105)
> Thanks for the reply.
> 
> >CONFIG_DYNAMIC_DEBUG=y
> 
> That was already set in my .config, which came from a Fedora config. I would have expected that it would be off by default for performance reasons.

Good point. Although, based on previous observations, the packaged config file does not match what is in the actual kernel. However, this much is certain:

$ mount -t debugfs
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime,seclabel)

> >grep '^GRUB_CMDLINE_LINUX' /etc/default/grub
> 
> Is there a way to set it other than /etc/default/grub ? I am worried that if I mess up that file, I could leave my laptop unbootable.

You could type it all in by hand and "mess" that up instead. :-) If the command-line were simple, typing it in by hand would definitely be the best way to change the kernel options.

As far as making your system unbootable, that is possible, but the kernel is very lenient about bogus command-line options -- it ignores them and doesn't even tell you. :-) And grub2 is helpful too -- it will drop you into its rescue shell. See below.

However, there are two recovery strategies:

1. Boot from the F31 Live image, mount /boot, and replace the bad grub.cfg with the backup grub.cfg.

2. Use the grub2 "normal" command. (NB: That works great but it would be a good idea to read the grub2 docs first and do a few practice runs before having to use it "in extremis". That is a perfect application for a VM -- practice breaking things and then fixing them. :-))

Comment 107 Steve 2020-04-10 16:52:04 UTC

Further testing of the dyndbg configuration. NB: This is in a VM, so it may not be entirely realistic.

# tail -f /sys/kernel/debug/dynamic_debug/control

Plug in a known-good USB flash drive (The one used for this test has a Fedora 31 Live image on it).

There is no change in the "tail -f" display.

So "ctrl-c" "tail -f" and run:

# tail -14 /sys/kernel/debug/dynamic_debug/control
net/netfilter/xt_MASQUERADE.c:28 [xt_MASQUERADE]masquerade_tg_check =_ "bad rangesize %u\012"
net/netfilter/xt_MASQUERADE.c:24 [xt_MASQUERADE]masquerade_tg_check =_ "bad MAP_IPS.\012"
drivers/usb/storage/usb.c:1127 [usb_storage]storage_probe =_ "Use Bulk-Only transport with the Transparent SCSI protocol for dynamic id: 0x%04x 0x%04x\012"
drivers/usb/storage/usb.c:1064 [usb_storage]usb_stor_probe2 =_ "waiting for device to settle before scanning\012"
drivers/usb/storage/usb.c:914 [usb_storage]usb_stor_scan_dwork =_ "scan complete\012"
drivers/usb/storage/usb.c:896 [usb_storage]usb_stor_scan_dwork =_ "starting scan\012"
drivers/usb/storage/sierra_ms.c:110 [usb_storage]truinst_show =_ "SWIMS: failed SWoC query\012"
drivers/usb/storage/sierra_ms.c:89 [usb_storage]debug_swoc =_ "SWIMS: Linux Version: %04X\012"
drivers/usb/storage/sierra_ms.c:88 [usb_storage]debug_swoc =_ "SWIMS: Linux SKU: %04X\012"
drivers/usb/storage/sierra_ms.c:87 [usb_storage]debug_swoc =_ "SWIMS: SWoC Rev: %02d\012"
drivers/usb/storage/sierra_ms.c:69 [usb_storage]sierra_get_swoc_info =_ "SWIMS: Attempting to get TRU-Install info\012"
drivers/usb/storage/sierra_ms.c:51 [usb_storage]sierra_set_ms_mode =_ "SWIMS: %s"
drivers/usb/storage/uas.c:126 [uas]uas_scan_work =_ "scan complete\012"
drivers/usb/storage/uas.c:124 [uas]uas_scan_work =_ "starting scan\012"

None of those are for "usbcore", so get nasty:

# grep 'usbcore' /sys/kernel/debug/dynamic_debug/control | wc -l
152

# grep -A8 -n 'deviceremovable' /sys/kernel/debug/dynamic_debug/control
1322:drivers/usb/core/hub.c:6073 [usbcore]usb_hub_adjust_deviceremovable =p "DeviceRemovable is changed to 1 according to platform information.\012"
1323:drivers/usb/core/hub.c:6057 [usbcore]usb_hub_adjust_deviceremovable =p "DeviceRemovable is changed to 1 according to platform information.\012"
1324-drivers/usb/core/hub.c:5906 [usbcore]usb_reset_device =p "%s for root hub!\012"
1325-drivers/usb/core/hub.c:5899 [usbcore]usb_reset_device =p "device reset not allowed in state %d\012"
1326-drivers/usb/core/hub.c:5736 [usbcore]usb_reset_and_verify_device =p "device reset not allowed in state %d\012"
1327-drivers/usb/core/hub.c:5595 [usbcore]hub_event =p "over-current change\012"
1328-drivers/usb/core/hub.c:5583 [usbcore]hub_event =p "power change\012"
1329-drivers/usb/core/hub.c:5543 [usbcore]hub_event =p "error resetting hub: %d\012"
1330-drivers/usb/core/hub.c:5539 [usbcore]hub_event =p "resetting for error %d\012"
1331-drivers/usb/core/hub.c:5530 [usbcore]hub_event =p "Can't autoresume: %d\012"

There is more than that, but my main observation is that there is no clear indication as to which device those messages apply to.

Notes:

1. qemu automatically forwards USB device detection from the host to the VM, which is a VERY nice feature, but it may not reflect the behavior of bare metal.
2. "14" and "8" are empirically determined numbers for demonstrative purposes only.

Comment 108 Steve 2020-04-10 17:20:33 UTC

On second thought, it might not be so bad to type this in:

dyndbg="module xhci_hcd =p ; module usbcore =p"

Disclaimer: I have not tested that by hand-typing it. 

NB: The "=" sign is a flag change operator, of which there are three: "-", "+", "=". The doc explains what they do in the section that starts:

"The flags specification comprises a change operation followed by one or more flag characters."

Dynamic debug
https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

Comment 109 Steve 2020-04-10 17:53:04 UTC

OK, NOW I tested:  :-)

dyndbg="module usbcore =p"

As before, the quote ends up in the wrong place:

$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.16-200.fc31.x86_64 root=UUID=54f79645-f858-46e0-af7a-97aecc88ff87 ro "dyndbg=module usbcore =p"

However, the "=p" here means that it worked:

# grep -m1 'usbcore' /sys/kernel/debug/dynamic_debug/control
drivers/usb/core/hub.c:6073 [usbcore]usb_hub_adjust_deviceremovable =p "DeviceRemovable is changed to 1 according to platform information.\012"

Comment 110 Steve 2020-04-10 18:08:57 UTC

(In reply to Steve from comment #106)
... 
> 2. Use the grub2 "normal" command. (NB: That works great but it would be a good idea to read the grub2 docs first and do a few practice runs before having to use it "in extremis".
> That is a perfect application for a VM -- practice breaking things and then fixing them. :-))

Getting out of this "fix" is a perfect exercise in recovering from a grub2 boot failure. I made the following changes *intentionally*, but it was still a bit alarming to get the grub2 rescue prompt:

# diff -u0 grub.cfg.BAK2 grub.cfg.EXP-UNBOOTABLE-1
...
@@ -128,2 +128,2 @@
-insmod blscfg
-blscfg
+#insmod blscfg
+#blscfg

In a VM, however ...

Comment 111 Steve 2020-04-10 21:21:38 UTC

I just noticed that Mathias posted a patch *yesterday*:

testpatch that doesn't clear TT buffer after protocol STALL
https://bugzilla.kernel.org/show_bug.cgi?id=203419#c27

It applies to two files:

$ grep diff 0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol-1.patch 
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c

There is a git command to apply the patch:

$ git apply --help

And there is an option to "--reverse" the patch.

Comment 112 Steve 2020-04-10 21:39:39 UTC

I'm doing a build now with the patch from Mathias:

Go back to 5.4.11:

$ git bisect reset
Previous HEAD position was 43b0b3300 arm64: cpu_errata: Add Hisilicon TSV110 to spectre-v2 safe list
HEAD is now at 9d61432ef Linux 5.4.11

Check that the patch applies cleanly to 5.4.11:

$ git apply --check -v 0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol-1.patch

Apply the patch:

$ git apply -v 0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol-1.patch

Verify that git knows that the code has been modified:

$ git status
HEAD detached at 9d61432ef
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   drivers/usb/host/xhci-ring.c
	modified:   drivers/usb/host/xhci.c
...

$ time make
...

Comment 113 Steve 2020-04-11 00:56:09 UTC

(In reply to Steve from comment #112)
...
> $ git apply -v 0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol-1.patch
...

There is a typo in that patch that causes a build failure, so a second patch is needed to patch the first patch:

$ git apply -v 0002-Fix-a-typo-in-prevoius-patch.patch

That second patch is linked from Vincenzo's comment:
https://bugzilla.kernel.org/show_bug.cgi?id=203419#c28

After building the rpm, we are informed that our build is "dirty", presumably because the patch created uncommitted files:

$ ls -1sh kernel-5.4.11.localversion6_dirty-14.x86_64.rpm
75M kernel-5.4.11.localversion6_dirty-14.x86_64.rpm

After booting and enabling all of Mathias's debug settings*, insert and remove USB devices. Or see if the video works. :-)

Mathias's debug messages should be visible:

# grep 'TT' /sys/kernel/debug/dynamic_debug/control
# grep 'xhci_hcd' /sys/kernel/debug/dynamic_debug/control | less

Tested in a VM and on bare metal (a laptop).

* From https://bugzilla.kernel.org/show_bug.cgi?id=203419#c21.
(Actually, the "usbcore" logging doesn't seem to be needed.)

Comment 114 Steve 2020-04-11 04:33:42 UTC

> As far as I know, the hardware on my laptop does not expose the fan speed, and gkrellm does not show anything under 'Fans'.

The "lm_sensors" package has a "sensors-detect" command that can search for hardware monitoring chips and then configure the appropriate kernel modules to be loaded. Indeed, there is a whole "hwmon" directory for such kernel modules:

$ find /lib/modules/`uname -r`/kernel/drivers/hwmon | sort

This will find info supplied by modules that have already been loaded:

$ find -L /sys/class/hwmon -maxdepth 2 2>/dev/null | xargs grep -s '' | sort | egrep 'temp|fan|label|name'

Comment 115 William Bader 2020-04-11 06:02:42 UTC

>grub.cfg

I didn't do this today because I have been up all night the last few nights doing bisections, and I don't want to do it when I am tired and risk making a mistake.

>I'm doing a build now with the patch from Mathias:

Since the 5.5.15 kernel has the problem, and since the patch is only a few days old, would it be better if I tried applying the patch to that kernel?
If it fixes the problem on the old kernel, I would still have to retest it on the current kernel, so trying the new kernel first could save a step.

>After booting and enabling all of Mathias's debug settings, insert and remove USB devices. Or see if the video works. :-)

The webcam is integrated with the laptop.
Wouldn't I need to enable the debug settings in /etc/default/grub?
It is currently
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="rhgb quiet elevator=noop LANG=en_US.UTF-8 mitigations=off"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

Did you want me to change GRUB_CMDLINE_LINUX to
GRUB_CMDLINE_LINUX="rhgb quiet elevator=noop LANG=en_US.UTF-8 mitigations=off dyndbg='module xhci_hcd =p ; module usbcore =p'"
Do I still have an option to boot and edit the command line at boot time and add dyndbg='module xhci_hcd =p ; module usbcore =p' ?
You said before that I could press 'e' from the grub menu to edit the command line. I think that it is safer because if I type it wrong, hopefully the worst that can happen is that it won't boot, and I can just power off and reboot.

I have in my notes that after I added mitigations=off to /etc/default/grub, I ran "grub2-mkconfig -o /boot/grub2/grub.cfg".
I still have the tar of /boot I made on Apr 6.

>The "lm_sensors" package has a "sensors-detect" command

I have a log below. Should I try YES for any of the risky probes?
Is the main risk of the probe a crash or could it also do damage or write on the SSD?

----------
$ sudo /usr/sbin/sensors-detect 
# sensors-detect revision $Revision$ <-- wrong RCS option to do the build?
# System: Sony Corporation VPCCB4Q1E [C60A58AK] (laptop)
# Board: Sony Corporation VAIO
# Kernel: 5.5.15-200.fc31.x86_64 x86_64
# Processor: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (6/42/7)

This program will help you determine which kernel modules you need
to load to use lm_sensors most effectively. It is generally safe
and recommended to accept the default answers to all questions,
unless you know what you're doing.

Some south bridges, CPUs or memory controllers contain embedded sensors.
Do you want to scan for them? This is totally safe. (YES/no): YES
Silicon Integrated Systems SIS5595...                       No
VIA VT82C686 Integrated Sensors...                          No
VIA VT8231 Integrated Sensors...                            No
AMD K8 thermal sensors...                                   No
AMD Family 10h thermal sensors...                           No
AMD Family 11h thermal sensors...                           No
AMD Family 12h and 14h thermal sensors...                   No
AMD Family 15h thermal sensors...                           No
AMD Family 16h thermal sensors...                           No
AMD Family 17h thermal sensors...                           No
AMD Family 15h power sensors...                             No
AMD Family 16h power sensors...                             No
Intel digital thermal sensor...                             Success!
    (driver `coretemp')
Intel AMB FB-DIMM thermal sensor...                         No
Intel 5500/5520/X58 thermal sensor...                       No
VIA C7 thermal sensor...                                    No
VIA Nano thermal sensor...                                  No

Some Super I/O chips contain embedded sensors. We have to write to
standard I/O ports to probe them. This is usually safe.
Do you want to scan for Super I/O sensors? (YES/no): no

Some hardware monitoring chips are accessible through the ISA I/O ports.
We have to write to arbitrary I/O ports to probe them. This is usually
safe though. Yes, you do have ISA I/O ports even if you do not have any
ISA slots! Do you want to scan the ISA I/O ports? (YES/no): no

Lastly, we can probe the I2C/SMBus adapters for connected hardware
monitoring devices. This is the most risky part, and while it works
reasonably well on most systems, it has been reported to cause trouble
on some systems.
Do you want to probe the I2C/SMBus adapters now? (YES/no): no

Now follows a summary of the probes I have just done.
Just press ENTER to continue: 
Driver `coretemp':
  * Chip `Intel digital thermal sensor' (confidence: 9)

Do you want to overwrite /etc/sysconfig/lm_sensors? (YES/no): no
To load everything that is needed, add this to one of the system
initialization scripts (e.g. /etc/rc.d/rc.local):

#----cut here----
# Chip drivers
modprobe coretemp
/usr/bin/sensors -s
#----cut here----

You really should try these commands right now to make sure everything
is working properly. Monitoring programs won't work until the needed
modules are loaded.
----------

I did "modprobe coretemp" and "/usr/bin/sensors -s" manually.

$ find -L /sys/class/hwmon -maxdepth 2 2>/dev/null | xargs grep -s '' | sort | egrep 'temp|fan|label|name'
/sys/class/hwmon/hwmon0/name:ADP1
/sys/class/hwmon/hwmon0/temp1_label:temp ambient
/sys/class/hwmon/hwmon0/temp2_label:temp
/sys/class/hwmon/hwmon1/name:acpitz
/sys/class/hwmon/hwmon1/temp1_crit:96000
/sys/class/hwmon/hwmon1/temp1_input:58000
/sys/class/hwmon/hwmon1/temp2_crit:96000
/sys/class/hwmon/hwmon1/temp2_input:58000
/sys/class/hwmon/hwmon2/name:BAT0
/sys/class/hwmon/hwmon2/temp1_label:temp ambient
/sys/class/hwmon/hwmon2/temp2_label:temp
/sys/class/hwmon/hwmon3/name:radeon
/sys/class/hwmon/hwmon3/temp1_crit:120000
/sys/class/hwmon/hwmon3/temp1_crit_hyst:90000
/sys/class/hwmon/hwmon4/name:coretemp
/sys/class/hwmon/hwmon4/temp1_crit:100000
/sys/class/hwmon/hwmon4/temp1_crit_alarm:0
/sys/class/hwmon/hwmon4/temp1_input:60000
/sys/class/hwmon/hwmon4/temp1_label:Package id 0
/sys/class/hwmon/hwmon4/temp1_max:86000
/sys/class/hwmon/hwmon4/temp2_crit:100000
/sys/class/hwmon/hwmon4/temp2_crit_alarm:0
/sys/class/hwmon/hwmon4/temp2_input:59000
/sys/class/hwmon/hwmon4/temp2_label:Core 0
/sys/class/hwmon/hwmon4/temp2_max:86000
/sys/class/hwmon/hwmon4/temp3_crit:100000
/sys/class/hwmon/hwmon4/temp3_crit_alarm:0
/sys/class/hwmon/hwmon4/temp3_input:59000
/sys/class/hwmon/hwmon4/temp3_label:Core 1
/sys/class/hwmon/hwmon4/temp3_max:86000

hwmon has a lot of files:
$ sudo find /lib/modules/`uname -r`/kernel/drivers/hwmon | wc -l
162

None of them say 'fan'. Is it aspeed-pwm-tacho?
$ lc -R /lib/modules/5.5.15-200.fc31.x86_64/kernel/drivers/hwmon/
/lib/modules/5.5.15-200.fc31.x86_64/kernel/drivers/hwmon/:
abituguru.ko.xz         adt7462.ko.xz           f71882fg.ko.xz      k10temp.ko.xz      ltc2945.ko.xz       max6642.ko.xz          sht15.ko.xz        vt1211.ko.xz
abituguru3.ko.xz        adt7470.ko.xz           f75375s.ko.xz       k8temp.ko.xz       ltc2947-core.ko.xz  max6650.ko.xz          sht21.ko.xz        vt8231.ko.xz
acpi_power_meter.ko.xz  adt7475.ko.xz           fam15h_power.ko.xz  lineage-pem.ko.xz  ltc2947-i2c.ko.xz   max6697.ko.xz          sht3x.ko.xz        w83627ehf.ko.xz
ad7314.ko.xz            adt7x10.ko.xz           fschmd.ko.xz        lm63.ko.xz         ltc2947-spi.ko.xz   mcp3021.ko.xz          shtc1.ko.xz        w83627hf.ko.xz
ad7414.ko.xz            amc6821.ko.xz           ftsteutates.ko.xz   lm70.ko.xz         ltc2990.ko.xz       mlxreg-fan.ko.xz       sis5595.ko.xz      w83773g.ko.xz
ad7418.ko.xz            applesmc.ko.xz          g760a.ko.xz         lm73.ko.xz         ltc4151.ko.xz       nct6683.ko.xz          smsc47b397.ko.xz   w83781d.ko.xz
adc128d818.ko.xz        asb100.ko.xz            g762.ko.xz          lm75.ko.xz         ltc4215.ko.xz       nct6775.ko.xz          smsc47m1.ko.xz     w83791d.ko.xz
adcxx.ko.xz             asc7621.ko.xz           gl518sm.ko.xz       lm77.ko.xz         ltc4222.ko.xz       nct7802.ko.xz          smsc47m192.ko.xz   w83792d.ko.xz
adm1021.ko.xz           aspeed-pwm-tacho.ko.xz  gl520sm.ko.xz       lm78.ko.xz         ltc4245.ko.xz       nct7904.ko.xz          tc654.ko.xz        w83793.ko.xz
adm1025.ko.xz           asus_atk0110.ko.xz      hwmon-vid.ko.xz     lm80.ko.xz         ltc4260.ko.xz       npcm750-pwm-fan.ko.xz  tc74.ko.xz         w83795.ko.xz
adm1026.ko.xz           atxp1.ko.xz             i5500_temp.ko.xz    lm83.ko.xz         ltc4261.ko.xz       ntc_thermistor.ko.xz   thmc50.ko.xz       w83l785ts.ko.xz
adm1029.ko.xz           coretemp.ko.xz          i5k_amb.ko.xz       lm85.ko.xz         max1111.ko.xz       pc87360.ko.xz          tmp102.ko.xz       w83l786ng.ko.xz
adm1031.ko.xz           dell-smm-hwmon.ko.xz    ibmaem.ko.xz        lm87.ko.xz         max16065.ko.xz      pc87427.ko.xz          tmp103.ko.xz
adm9240.ko.xz           dme1737.ko.xz           ibmpex.ko.xz        lm90.ko.xz         max1619.ko.xz       pcf8591.ko.xz          tmp108.ko.xz
ads7828.ko.xz           ds1621.ko.xz            ina209.ko.xz        lm92.ko.xz         max1668.ko.xz       pmbus                  tmp401.ko.xz
ads7871.ko.xz           ds620.ko.xz             ina2xx.ko.xz        lm93.ko.xz         max197.ko.xz        powr1220.ko.xz         tmp421.ko.xz
adt7310.ko.xz           emc1403.ko.xz           ina3221.ko.xz       lm95234.ko.xz      max31722.ko.xz      sch5627.ko.xz          tmp513.ko.xz
adt7410.ko.xz           emc6w201.ko.xz          it87.ko.xz          lm95241.ko.xz      max31790.ko.xz      sch5636.ko.xz          via-cputemp.ko.xz
adt7411.ko.xz           f71805f.ko.xz           jc42.ko.xz          lm95245.ko.xz      max6639.ko.xz       sch56xx-common.ko.xz   via686a.ko.xz

/lib/modules/5.5.15-200.fc31.x86_64/kernel/drivers/hwmon/pmbus:
adm1275.ko.xz  lm25066.ko.xz  ltc3815.ko.xz   max20751.ko.xz  max8688.ko.xz  pmbus_core.ko.xz  tps53679.ko.xz  ucd9200.ko.xz
bel-pfe.ko.xz  ltc2978.ko.xz  max16064.ko.xz  max34440.ko.xz  pmbus.ko.xz    tps40422.ko.xz    ucd9000.ko.xz   zl6100.ko.xz

Comment 116 Steve 2020-04-11 12:06:32 UTC

>>I'm doing a build now with the patch from Mathias:

> Since the 5.5.15 kernel has the problem, and since the patch is only a few days old, would it be better if I tried applying the patch to that kernel?

I looked at the upstream bug reports, and it wasn't clear what kernel version they were using to test the patch, so I did what was easiest:

$ git bisect reset

And then applied the two patches, ran "make clean", rebuilt, and tested.

> If it fixes the problem on the old kernel, I would still have to retest it on the current kernel, so trying the new kernel first could save a step.

You are on the upstream 5.4.y branch, which is up to 5.4.31 (per kernel.org). To test against 5.5.15, you would need to clone the 5.5.y branch. You would also need to update the ".config" file.

Anyway, I would go with testing against 5.4.11, since you already know it is "bad". Doing that takes the incremental approach -- make one change at a time.

Clarification: You don't need to make any of the dyndbg kernel command-line changes to test the patch. Just test the same way you tested when doing the bisection. I apologize for not making that clear.

And please take your time. You already did a great job getting the VM set up and doing the bisection, which confirms that the problem is indeed in the USB code.

NB: I don't know for certain that the patch will fix the problem you are seeing. Since the patch applies to USB code, and it fixes problems with some other USB devices (per upstream bug reports), it seems like the patch is worth testing.

Specifically, the patch applies to these two files in the USB code:

$ grep diff 0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol-1.patch 
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c

Comment 117 Steve 2020-04-11 12:24:30 UTC

(In reply to Steve from comment #116)
> ... it fixes problems with some other USB devices (per upstream bug reports) ...

Important note: The patch applies to ALL USB devices, it is not device-specific.

Indeed, Alan explicitly refers to "the USB-2.0 specification" here:

https://bugzilla.kernel.org/show_bug.cgi?id=207065#c6

NB: Mathias posted his patch in TWO bug reports:

This one, which was opened 2020-04-02:

Bug_207065 - C-media USB audio device stops working from 5.2.0-rc3 onwards 
https://bugzilla.kernel.org/show_bug.cgi?id=207065

And this one, which was opened 2019-04-25:

Bug_203419 - Logitech Group USB audio stopped working in 5.1-rc6 
https://bugzilla.kernel.org/show_bug.cgi?id=203419

Comment 118 Steve 2020-04-11 14:40:54 UTC

>>The "lm_sensors" package has a "sensors-detect" command

>I have a log below. Should I try YES for any of the risky probes?

I don't know, although I answered "yes" to everything when I ran "sensors-detect" on my systems (desktop, laptop).

>Is the main risk of the probe a crash or could it also do damage or write on the SSD?

According to the "sensors-detect" man page:

"sensors-detect needs to access the hardware for most of the chip detections. By definition, it doesn't know which chips are there before it manages to identify them. This means that it can access chips in a way these chips do not like, causing problems ranging from SMBus lockup to permanent hardware damage (a rare case, thankfully.)"

I don't think "sensors-detect" would access the SSD, because the SSD is a SATA device on a SATA controller on the PCI bus, which wouldn't be a likely candidate for a sensors device:

$ lspci | grep SATA

In the past, after I ran "memtest86+", I discovered that my BIOS settings were slightly changed.

There is a separate program for monitoring HDD temperatures, and it uses SMART commands:

$ rpm -qi hddtemp

>I did "modprobe coretemp" and "/usr/bin/sensors -s" manually.

If you run "sensors" without any arguments, it should report some sensors data, depending on what kernel modules are loaded. On my desktop system:

$ sensors
nct6791-isa-0290
Adapter: ISA adapter
Vcore:                  +0.87 V  (min =  +0.00 V, max =  +1.74 V)
in1:                    +1.01 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
...
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +29.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:        +28.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:        +27.0°C  (high = +80.0°C, crit = +100.0°C)

NB: I posted the "find" command so you could see where programs actually get sensors data.

>hwmon has a lot of files [that are kernel modules]:

There are a lot of sensors devices out there, so there are a lot of kernel modules to support them. :-)

>None of them say 'fan'. Is it aspeed-pwm-tacho?

Unless you know exactly what sensors chips are in your system, you are better off running "sensors-detect". However, if you want to know more about a specific kernel module:

$ modinfo aspeed-pwm-tacho

Comment 119 Steve 2020-04-11 15:04:03 UTC

Sometimes there are vendor-specific kernel modules, but they don't always work too well. This found some modules:

$ find /lib/modules/`uname -r`/ -name \*sony\*

Some of them might already be loaded:

$ lsmod | grep sony

Comment 120 Steve 2020-04-11 17:59:24 UTC

> Wouldn't I need to enable the debug settings in /etc/default/grub?

Yes, to enable dyndbg logging for the camera while booting.

However, dyndbg logging can *tested* after booting by running the commands given by Mathias in a terminal window, running this in a full-screen terminal window:

$ journalctl --no-hostname -k -f

And then inserting and removing a USB device. That won't test your camera, but it will ensure that dyndbg is correctly enabled.

> It is currently
> ...
> GRUB_CMDLINE_LINUX="rhgb quiet elevator=noop LANG=en_US.UTF-8 mitigations=off"
> ...
> Did you want me to change GRUB_CMDLINE_LINUX to
> GRUB_CMDLINE_LINUX="rhgb quiet elevator=noop LANG=en_US.UTF-8 mitigations=off dyndbg='module xhci_hcd =p ; module usbcore =p'"

Yes, except that I would remove "rhgb quiet". Those options just make debugging harder. I ALWAYS remove "rhgb quiet" after installing a new system.

What I usually do is make a "back-up" copy of the line I am about to modify and put a "#" in front so that it becomes a comment:

#GRUB_CMDLINE_LINUX="list of options"          <<<< This is the "back-up" copy preserved as a comment.
GRUB_CMDLINE_LINUX="modified list of options"

> Do I still have an option to boot and edit the command line at boot time and add dyndbg='module xhci_hcd =p ; module usbcore =p' ?

Yes, and you can remove them if they are already there.

> You said before that I could press 'e' from the grub menu to edit the command line. I think that it is safer because if I type it wrong, hopefully the worst that can happen is that it won't boot, and I can just power off and reboot.

OK, and you can check your typing with:

$ cat /proc/cmdline

The main concern is reliable reproducibility during testing.

> I have in my notes that after I added mitigations=off to /etc/default/grub, I ran "grub2-mkconfig -o /boot/grub2/grub.cfg".
> I still have the tar of /boot I made on Apr 6.

If you modify /etc/default/grub, you need to run grub2-mkconfig to update grub.cfg. It's a cumbersome process, admittedly.

Comment 121 Steve 2020-04-11 18:52:40 UTC

Created attachment 1678085 [details]
shell script to enable dyndbg for USB testing (commands are from Mathias Nyman)

# Enable dyndbg logging for USB testing.
#
# This command must be run as root.
#
# Usage:
#
# sudo dyndbg-1.sh
#
# < connect the USB device >
#
# Attach to the bug report the output from:
#
# journalctl --no-hostname -k > /tmp/dmesg-1.txt
# sudo cat /sys/kernel/debug/tracing/trace > /tmp/trace-1.txt

Comment 122 Steve 2020-04-11 18:55:18 UTC

(In reply to Steve from comment #121)
> Created attachment 1678085 [details]
> shell script to enable dyndbg for USB testing

Those commands are from:

Mathias Nyman 2020-04-02 13:34:44 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=203419#c21

Comment 123 Steve 2020-04-11 19:27:53 UTC

Created attachment 1678098 [details]
shell script to enable dyndbg for USB testing (commands are from Mathias Nyman)

# Enable dyndbg logging for USB testing.
#
# Commands are from Mathias Nyman
# https://bugzilla.kernel.org/show_bug.cgi?id=203419#c21
#
# Usage:
#
# sudo dyndbg-1.sh
#
# < connect the USB device >
#
# Attach to the bug report the output from:
#
# journalctl --no-hostname -k > /tmp/dmesg-1.txt
# sudo cat /sys/kernel/debug/tracing/trace > /tmp/trace-1.txt

Comment 124 Steve 2020-04-11 20:53:38 UTC

I completed a test of enabling dyndbg from the kernel command-line on *bare metal* (a laptop) with the built-in web cam enabled.

I have this in /etc/default/grub:

$ grep '^GRUB_CMDLINE_LINUX' /etc/default/grub
GRUB_CMDLINE_LINUX="dyndbg='module xhci_hcd =p ; module usbcore =p'"

After running grub2-mkconfig and rebooting, the kernel command-line shows:

$ cat /proc/cmdline
BOOT_IMAGE=(hd1,msdos6)/vmlinuz-5.5.16-100.fc30.x86_64 root=/dev/mapper/[removed] ro "dyndbg=module xhci_hcd =p ; module usbcore =p"

And this is what is in the log for the built-in web cam:

$ journalctl --no-hostname -k | grep -v audit | grep -C1 -n 'usb 1-1.2' > /tmp/dmesg-usb-dyndbg-1.txt

871-Apr 11 06:24:29 kernel: usb 2-1-port2: status 0101, change 0000, 12 Mb/s
872:Apr 11 06:24:29 kernel: usb 1-1.2: new high-speed USB device number 3 using ehci-pci
873-Apr 11 06:24:29 kernel: random: systemd: uninitialized urandom read (16 bytes read)
--
882-Apr 11 06:24:29 kernel: usb 2-1.2: new high-speed USB device number 3 using ehci-pci
883:Apr 11 06:24:29 kernel: usb 1-1.2: skipped 1 descriptor after configuration
884:Apr 11 06:24:29 kernel: usb 1-1.2: skipped 6 descriptors after interface
885:Apr 11 06:24:29 kernel: usb 1-1.2: skipped 1 descriptor after endpoint
886:Apr 11 06:24:29 kernel: usb 1-1.2: skipped 10 descriptors after interface
887:Apr 11 06:24:29 kernel: usb 1-1.2: default language 0x0409
888:Apr 11 06:24:29 kernel: usb 1-1.2: udev 3, busnum 1, minor = 2
889:Apr 11 06:24:29 kernel: usb 1-1.2: New USB device found, idVendor=13d3, idProduct=5710, bcdDevice=11.30
890:Apr 11 06:24:29 kernel: usb 1-1.2: New USB device strings: Mfr=3, Product=1, SerialNumber=2
891:Apr 11 06:24:29 kernel: usb 1-1.2: Product: USB 2.0 UVC VGA WebCam
892:Apr 11 06:24:29 kernel: usb 1-1.2: Manufacturer: Azurewave
893:Apr 11 06:24:29 kernel: usb 1-1.2: SerialNumber: 0x0001
894:Apr 11 06:24:29 kernel: usb 1-1.2: usb_probe_device
895:Apr 11 06:24:29 kernel: usb 1-1.2: configuration #1 chosen from 1 choice
896:Apr 11 06:24:29 kernel: usb 1-1.2: adding 1-1.2:1.0 (config #1, interface 0)
897:Apr 11 06:24:29 kernel: usb 1-1.2: adding 1-1.2:1.1 (config #1, interface 1)
898-Apr 11 06:24:29 kernel: usb 2-1.2: USB quirks for this device: 400
--
1066-Apr 11 13:24:59 kernel: intel_rapl_common: Found RAPL domain uncore
1067:Apr 11 13:24:59 kernel: usb 1-1.2: usb auto-suspend, wakeup 0
1068-Apr 11 13:24:59 kernel: hub 1-1:1.0: hub_suspend

Comment 125 Steve 2020-04-11 21:46:31 UTC

Let's amend the command-line by adding the "m" and "f" flags to show the module and function names (per doc, Comment 108):

GRUB_CMDLINE_LINUX="dyndbg='module xhci_hcd =pmf ; module usbcore =pmf'"
                                              ^^                    ^^

The output is MUCH more informative and allows for more precise grepping, if needed. Note that module ("usbcore") and function names (e.g. "usb_parse_configuration") are now logged:

$ cat /proc/cmdline
BOOT_IMAGE=(hd1,msdos6)/vmlinuz-5.5.16-100.fc30.x86_64 root=/dev/mapper/[removed] ro "dyndbg=module xhci_hcd =pmf ; module usbcore =pmf"

$ journalctl --no-hostname -k | grep -v audit | grep -C1 -n 'usb 1-1.2' > /tmp/dmesg-usb-dyndbg-2.txt

847-Apr 11 07:12:14 systemd[1]: Running in initial RAM disk.
848:Apr 11 07:12:14 kernel: usb 1-1.2: new high-speed USB device number 3 using ehci-pci
849-Apr 11 07:12:14 systemd[1]: Set hostname to <removed>.
850:Apr 11 07:12:14 kernel: usbcore:usb_parse_configuration: usb 1-1.2: skipped 1 descriptor after configuration
851:Apr 11 07:12:14 kernel: usbcore:usb_parse_interface: usb 1-1.2: skipped 6 descriptors after interface
852:Apr 11 07:12:14 kernel: usbcore:usb_parse_endpoint: usb 1-1.2: skipped 1 descriptor after endpoint
853:Apr 11 07:12:14 kernel: usbcore:usb_parse_interface: usb 1-1.2: skipped 10 descriptors after interface
854:Apr 11 07:12:14 kernel: usbcore:usb_get_langid: usb 1-1.2: default language 0x0409
855:Apr 11 07:12:14 kernel: usbcore:usb_new_device: usb 1-1.2: udev 3, busnum 1, minor = 2
856:Apr 11 07:12:14 kernel: usb 1-1.2: New USB device found, idVendor=13d3, idProduct=5710, bcdDevice=11.30
857:Apr 11 07:12:14 kernel: usb 1-1.2: New USB device strings: Mfr=3, Product=1, SerialNumber=2
858:Apr 11 07:12:14 kernel: usb 1-1.2: Product: USB 2.0 UVC VGA WebCam
859:Apr 11 07:12:14 kernel: usb 1-1.2: Manufacturer: Azurewave
860:Apr 11 07:12:14 kernel: usb 1-1.2: SerialNumber: 0x0001
861:Apr 11 07:12:14 kernel: usbcore:usb_probe_device: usb 1-1.2: usb_probe_device
862:Apr 11 07:12:14 kernel: usbcore:usb_choose_configuration: usb 1-1.2: configuration #1 chosen from 1 choice
863:Apr 11 07:12:14 kernel: usbcore:usb_set_configuration: usb 1-1.2: adding 1-1.2:1.0 (config #1, interface 0)
864:Apr 11 07:12:14 kernel: usbcore:usb_set_configuration: usb 1-1.2: adding 1-1.2:1.1 (config #1, interface 1)
865-Apr 11 07:12:14 kernel: random: systemd: uninitialized urandom read (16 bytes read)
--
1067-Apr 11 14:12:32 kernel: input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1b.0/sound/card0/input18
1068:Apr 11 14:12:34 kernel: usbcore:usb_port_suspend: usb 1-1.2: usb auto-suspend, wakeup 0
1069-Apr 11 14:12:34 kernel: usbcore:hub_suspend: hub 1-1:1.0: hub_suspend

Comment 126 Steve 2020-04-11 22:33:23 UTC

Here is a command-line for cloning 5.5.y:

$ time git clone --shallow-exclude=linux-5.4.y --branch linux-5.5.y https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.5
Cloning into 'linux-5.5'...
...
real	2m28.679s
...

$ cd linux-5.5

$ git branch
* linux-5.5.y

$ git tag --list | sort -Vr
v5.5.16
v5.5.15
...
v5.5-rc1
v5.5

Comment 127 Steve 2020-04-11 23:47:42 UTC

># sensors-detect revision $Revision$ <-- wrong RCS option to do the build?

I'm glad you pointed that out. I was trying to figure out a way to get the kernel ".config" file under version control without intermingling it with the kernel git repo. And RCS seemed like it might work. And it does:

$ cd linux-5.4
$ mkdir RCS

$ ci -l .config
RCS/.config,v  <--  .config
enter description, terminated with single '.' or end of file:
NOTE: This is NOT the log message!
>> .
initial revision: 1.1
done

$ make nconfig # change localversion to 8

$ rcsdiff -u0 .config
===================================================================
RCS file: RCS/.config,v
retrieving revision 1.1
diff -u0 -r1.1 .config
...
@@ -26 +26 @@
-CONFIG_LOCALVERSION=".localversion7"
+CONFIG_LOCALVERSION=".localversion8"

And for the real test:

$ git status

That doesn't even mention the RCS/ directory.

NB: I've been using RCS for years for important files, such as system config files, and for notes, some of which have nothing to do with software.

$ rpm -q rcs
rcs-5.9.4-11.fc30.x86_64

Comment 128 Steve 2020-04-12 03:32:40 UTC

Created attachment 1678187 [details]
shell script to enable dyndbg for USB testing (based on commands from Mathias Nyman)

# Enable dyndbg logging for USB testing.
#
# Based on commands from Mathias Nyman:
# https://bugzilla.kernel.org/show_bug.cgi?id=203419#c21
#
# Usage:
#
# sudo dyndbg-1.sh
#
# < connect the USB device >
#
# Attach to the bug report the output from:
#
# journalctl --no-hostname -k > /tmp/dmesg-1.txt
# sudo cat /sys/kernel/debug/tracing/trace > /tmp/trace-1.txt

This version adds the "m" and "f" flags to show modules and functions:

-echo 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
-echo 'module usbcore =p'  > /sys/kernel/debug/dynamic_debug/control
+echo 'module xhci_hcd =pmf' > /sys/kernel/debug/dynamic_debug/control
+echo 'module usbcore =pmf'  > /sys/kernel/debug/dynamic_debug/control

Comment 129 William Bader 2020-04-12 05:53:40 UTC

Created attachment 1678188 [details]
journalctl logs

The patches didn't seem to help.

I attached the results of journalctl --no-hostname -k | grep -v audit | grep -C1 -n 'usb 1-1.3' > dmesg-usb-dyndbg-`uname -r`-`date '+%Y%m%d-%H%M%S'`.txt

I saved but did not attach the entire journalctl log without the grep, so I can do other searches without rebooting. I wasn't sure if the full log had anything that needed to be redacted.

The bad logs are all similar to each other, and the good logs are all similar to each other.

dmesg-usb-nodyndbg-5.4.10.localversion10-00164-g093d658a0-dirty-old-good-after-patch-20200412-000716.txt <- final good kernel from the bisection + the patch

dmesg-usb-dyndbg-5.4.10.localversion9-00164-g093d658a0-old-good-before-patch-20200412-001250.txt <- final good kernel from the bisection, no patch

dmesg-usb-dyndbg-5.5.15-200.fc31.x86_64-bad-nopatch-20200412-003043.txt <- Fedora 5.5.15 stable kernel built by Fedora, bad

dmesg-usb-dyndbg-5.4.10.localversion8-00165-g7cbdf96cd-old-bad-nopatch-20200412-003754.txt <- the kernel with the 'usb: missing parentheses in USE_NEW_SCHEME' commit that broke the webcam

dmesg-usb-dyndbg-5.4.11.localversion11-dirty-bad-patch-20200412-062831.txt <- 5.4.11 kernel built from 'git checkout v5.4.11' with the two usb patches. (I did not reapply the patches. 'git status' and 'git diff' show that git maintained them through the 'git bisect reset' and the checkout.)

I still have patched 5.4.11 kernel in my build area, and I can do checks on it if you want to be sure that I applied the patches correctly. I suppose that I can also put in some debug code.

Comment 130 William Bader 2020-04-12 05:59:30 UTC

Here are some of the sensor tests.
I tried answering yes to everything in sensors-detect.

$ sudo sensors-detect
# sensors-detect revision $Revision$
# System: Sony Corporation VPCCB4Q1E [C60A58AK] (laptop)
# Board: Sony Corporation VAIO
# Kernel: 5.5.15-200.fc31.x86_64 x86_64
# Processor: Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (6/42/7)

This program will help you determine which kernel modules you need
to load to use lm_sensors most effectively. It is generally safe
and recommended to accept the default answers to all questions,
unless you know what you're doing.

Some south bridges, CPUs or memory controllers contain embedded sensors.
Do you want to scan for them? This is totally safe. (YES/no): YES
Silicon Integrated Systems SIS5595...                       No
VIA VT82C686 Integrated Sensors...                          No
VIA VT8231 Integrated Sensors...                            No
AMD K8 thermal sensors...                                   No
AMD Family 10h thermal sensors...                           No
AMD Family 11h thermal sensors...                           No
AMD Family 12h and 14h thermal sensors...                   No
AMD Family 15h thermal sensors...                           No
AMD Family 16h thermal sensors...                           No
AMD Family 17h thermal sensors...                           No
AMD Family 15h power sensors...                             No
AMD Family 16h power sensors...                             No
Intel digital thermal sensor...                             Success!
    (driver `coretemp')
Intel AMB FB-DIMM thermal sensor...                         No
Intel 5500/5520/X58 thermal sensor...                       No
VIA C7 thermal sensor...                                    No
VIA Nano thermal sensor...                                  No

Some Super I/O chips contain embedded sensors. We have to write to
standard I/O ports to probe them. This is usually safe.
Do you want to scan for Super I/O sensors? (YES/no): YES
Probing for Super-I/O at 0x2e/0x2f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      No
Probing for Super-I/O at 0x4e/0x4f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      No

Some hardware monitoring chips are accessible through the ISA I/O ports.
We have to write to arbitrary I/O ports to probe them. This is usually
safe though. Yes, you do have ISA I/O ports even if you do not have any
ISA slots! Do you want to scan the ISA I/O ports? (YES/no): YES
Probing for `National Semiconductor LM78' at 0x290...       No
Probing for `National Semiconductor LM79' at 0x290...       No
Probing for `Winbond W83781D' at 0x290...                   No
Probing for `Winbond W83782D' at 0x290...                   No

Lastly, we can probe the I2C/SMBus adapters for connected hardware
monitoring devices. This is the most risky part, and while it works
reasonably well on most systems, it has been reported to cause trouble
on some systems.
Do you want to probe the I2C/SMBus adapters now? (YES/no): YES
Using driver `i2c-i801' for device 0000:00:1f.3: Intel Cougar Point (PCH)
Module i2c-dev loaded successfully.

Next adapter: i915 gmbus ssc (i2c-0)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: i915 gmbus vga (i2c-1)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: i915 gmbus panel (i2c-2)
Do you want to scan it? (yes/NO/selectively): yes
Client found at address 0x28
Probing for `National Semiconductor LM78'...                No
Probing for `National Semiconductor LM79'...                No
Probing for `National Semiconductor LM80'...                No
Probing for `National Semiconductor LM96080'...             No
Probing for `Winbond W83781D'...                            No
Probing for `Winbond W83782D'...                            No
Probing for `Nuvoton NCT7802Y'...                           No
Probing for `Winbond W83627HF'...                           No
Probing for `Winbond W83627EHF'...                          No
Probing for `Winbond W83627DHG/W83667HG/W83677HG'...        No
Probing for `Asus AS99127F (rev.1)'...                      No
Probing for `Asus AS99127F (rev.2)'...                      No
Probing for `Asus ASB100 Bach'...                           No
Probing for `Analog Devices ADM1029'...                     No
Probing for `ITE IT8712F'...                                No

Next adapter: i915 gmbus dpc (i2c-3)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: i915 gmbus dpb (i2c-4)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: i915 gmbus dpd (i2c-5)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x90 (i2c-6)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x91 (i2c-7)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x92 (i2c-8)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x93 (i2c-9)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x94 (i2c-10)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x95 (i2c-11)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x96 (i2c-12)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: Radeon i2c bit bus 0x97 (i2c-13)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: DPDDC-C (i2c-14)
Do you want to scan it? (yes/NO/selectively): yes

Next adapter: SMBus I801 adapter at e040 (i2c-15)
Do you want to scan it? (YES/no/selectively): yes
Client found at address 0x50
Probing for `Analog Devices ADM1033'...                     No
Probing for `Analog Devices ADM1034'...                     No
Probing for `SPD EEPROM'...                                 Yes
    (confidence 8, not a hardware monitoring chip)
Probing for `EDID EEPROM'...                                No
Client found at address 0x52
Probing for `Analog Devices ADM1033'...                     No
Probing for `Analog Devices ADM1034'...                     No
Probing for `SPD EEPROM'...                                 Yes
    (confidence 8, not a hardware monitoring chip)

Now follows a summary of the probes I have just done.
Just press ENTER to continue: 

Driver `coretemp':
  * Chip `Intel digital thermal sensor' (confidence: 9)

Do you want to overwrite /etc/sysconfig/lm_sensors? (YES/no): YES
Unloading i2c-dev... OK

$ sudo cat /etc/sysconfig/lm_sensors
# Generated by sensors-detect on Sat Apr 11 22:14:25 2020
# This file is sourced by /etc/init.d/lm_sensors and defines the modules to
# be loaded/unloaded.
#
# The format of this file is a shell script that simply defines variables:
# HWMON_MODULES for hardware monitoring driver modules, and optionally
# BUS_MODULES for any required bus driver module (for example for I2C or SPI).

HWMON_MODULES="coretemp"

$ lspci | grep SATA
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Mobile SATA AHCI Controller (rev 04)

>There is a separate program for monitoring HDD temperatures, and it uses SMART commands:

$ rpm -qi hddtemp
package hddtemp is not installed

I am more worried about the temperature of the CPU than the SSD.

>If you run "sensors" without any arguments, it should report some sensors data, depending on what kernel modules are loaded.

$ sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +54.0°C  (high = +86.0°C, crit = +100.0°C)
Core 0:        +54.0°C  (high = +86.0°C, crit = +100.0°C)
Core 1:        +52.0°C  (high = +86.0°C, crit = +100.0°C)

BAT0-acpi-0
Adapter: ACPI interface
in0:              N/A  

radeon-pci-0100
Adapter: PCI adapter
temp1:            N/A  (crit = +120.0°C, hyst = +90.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +54.0°C  (crit = +96.0°C)
temp2:        +54.0°C  (crit = +96.0°C)

>Sometimes there are vendor-specific kernel modules, but they don't always work too well.

$ find /lib/modules/`uname -r`/ -name \*sony\*
/lib/modules/5.5.15-200.fc31.x86_64/kernel/drivers/hid/hid-sony.ko.xz <- support for the Sony PS3 BD Remote Control
/lib/modules/5.5.15-200.fc31.x86_64/kernel/drivers/media/rc/ir-sony-decoder.ko.xz <- IR port and decoder for Sony IR Pulse/Space protocol
/lib/modules/5.5.15-200.fc31.x86_64/kernel/drivers/media/i2c/sony-btf-mpx.ko.xz <- support for the internal mux of the Sony BTF-PG472Z video tuner
/lib/modules/5.5.15-200.fc31.x86_64/kernel/drivers/platform/x86/sony-laptop.ko.xz

I got this laptop on short notice in 2012 after my previous laptop (a 2008 Lenovo ThinkPad T61p) burned up building gcc while I was traveling.
It had reasonable specs for the time and a good keyboard, and the shop let me test it with a Fedora live CD before buying it.
It has some multimedia options that I have never used, including an IR port.

>Some of them might already be loaded

$ lsmod | grep sony
sony_laptop            65536  0
rfkill                 28672  7 bluetooth,cfg80211,sony_laptop
video                  53248  2 i915,sony_laptop

Comment 131 Steve 2020-04-12 12:36:00 UTC

> Created attachment 1678188 [details]
> journalctl logs

Thanks. "xarchiver" makes it very easy to open that (dmesg-12apr20.tar.bz2) from Bugzilla:

$ rpm -q xarchiver
xarchiver-0.5.4.14-1.fc30.x86_64

> The patches didn't seem to help.

That's disappointing, but you did a very nice job collecting those logs.

> I attached the results of journalctl --no-hostname -k | grep -v audit | grep -C1 -n 'usb 1-1.3' > dmesg-usb-dyndbg-`uname -r`-`date '+%Y%m%d-%H%M%S'`.txt

Great idea to put a timestamp in the file names.

> I saved but did not attach the entire journalctl log without the grep, so I can do other searches without rebooting. I wasn't sure if the full log had anything that needed to be redacted.

Thanks for saving the entire logs. Attaching them probably won't be necessary, but the "Code:" line in three of the attached snippets looks very peculiar. The line is the same in all three cases:

$ grep -h -n 'Code:' *.txt  # The "-h" option suppresses the file names, so that the lines all align for easy comparison.

Could you look at the preceding log entries to see where they are coming from:

$ grep -B10 -n 'Code:' *.txt

"10" is a complete guess. Please adjust as you see fit.

> The bad logs are all similar to each other, and the good logs are all similar to each other.

Thanks for the annotated list:

dmesg-usb-nodyndbg-5.4.10.localversion10-00164-g093d658a0-dirty-old-good-after-patch-20200412-000716.txt <- final good kernel from the bisection + the patch

dmesg-usb-dyndbg-5.4.10.localversion9-00164-g093d658a0-old-good-before-patch-20200412-001250.txt <- final good kernel from the bisection, no patch

dmesg-usb-dyndbg-5.5.15-200.fc31.x86_64-bad-nopatch-20200412-003043.txt <- Fedora 5.5.15 stable kernel built by Fedora, bad

dmesg-usb-dyndbg-5.4.10.localversion8-00165-g7cbdf96cd-old-bad-nopatch-20200412-003754.txt <- the kernel with the 'usb: missing parentheses in USE_NEW_SCHEME' commit that broke the webcam

dmesg-usb-dyndbg-5.4.11.localversion11-dirty-bad-patch-20200412-062831.txt <- 5.4.11 kernel built from 'git checkout v5.4.11' with the two usb patches. (I did not reapply the patches. 'git status' and 'git diff' show that git maintained them through the 'git bisect reset' and the checkout.)

> I still have patched 5.4.11 kernel in my build area, and I can do checks on it if you want to be sure that I applied the patches correctly.
> I suppose that I can also put in some debug code.

I know very little about the USB protocol, but based in these lines, I would say that the driver is reading corrupt or unexpected data from the USB device:

$ less dmesg-usb-dyndbg-5.5.15-200.fc31.x86_64-bad-nopatch-20200412-003043.txt
...
943:Apr 12 00:29:30 kernel: usbcore:usb_probe_device: usb 1-1.3: usb_probe_device
944:Apr 12 00:29:30 kernel: usbcore:usb_choose_configuration: usb 1-1.3: configuration #247 chosen from 1 choice
945:Apr 12 00:29:30 kernel: usb 1-1.3: can't set config #247, error -32
...

IMO, this should be taken upstream. The only further thing we could try is to enable dyndbg tracing from the kernel command-line, but I haven't figured out how to do that yet.

Comment 132 Steve 2020-04-12 12:49:54 UTC

> IMO, this should be taken upstream.

Here is a proposed bug summary:

[BISECTED] usb 1-1.3: can't set config #247, error -32

Here is a list of kernel bugs under Component: USB, Product: Drivers:

https://bugzilla.kernel.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=USB&order=bug_id%20DESC&product=Drivers&query_format=advanced

Comment 133 Steve 2020-04-12 13:20:50 UTC

> ... I would say that the driver is reading corrupt or unexpected data from the USB device:

$ less dmesg-usb-dyndbg-5.5.15-200.fc31.x86_64-bad-nopatch-20200412-003043.txt
...
943:Apr 12 00:29:30 kernel: usbcore:usb_probe_device: usb 1-1.3: usb_probe_device
944:Apr 12 00:29:30 kernel: usbcore:usb_choose_configuration: usb 1-1.3: configuration #247 chosen from 1 choice
945:Apr 12 00:29:30 kernel: usb 1-1.3: can't set config #247, error -32
...

Compare that with the "good" case (unpatched):

$ less dmesg-usb-dyndbg-5.4.10.localversion9-00164-g093d658a0-old-good-before-patch-20200412-001250.txt
...
956:Apr 12 00:10:37 kernel: usbcore:usb_probe_device: usb 1-1.3: usb_probe_device
957:Apr 12 00:10:37 kernel: usbcore:usb_choose_configuration: usb 1-1.3: configuration #1 chosen from 1 choice
958:Apr 12 00:10:37 kernel: usbcore:usb_set_configuration: usb 1-1.3: adding 1-1.3:1.0 (config #1, interface 0)
959:Apr 12 00:10:37 kernel: usbcore:usb_set_configuration: usb 1-1.3: adding 1-1.3:1.1 (config #1, interface 1)
...

BTW, "#247" could have significance:

$ python -c 'print hex(247), bin(247)'
0xf7 0b11110111

Comment 134 Steve 2020-04-12 13:53:39 UTC

> I suppose that I can also put in some debug code.

Feel free. It's open source, so you can modify it any way you want. :-)

Unfortunately, the USB 2.0 specification doesn't appear to be "free" -- I can't see a link to it here:

USB 2.0 Specification
https://usb.org/document-library/usb-20-specification

Comment 135 Steve 2020-04-12 14:13:01 UTC

I extracted the "lsusb -v" output* for the Ricoh camera and found this warning:

$ grep -C7 -n short lsusb-v-5.4.10-200-good-Ricoh-only.txt
414-        bmControls           0x00040004
415-          Auto-Exposure Priority
416-          Privacy
417-      VideoControl Interface Descriptor:
418-        bLength                11
419-        bDescriptorType        36
420-        bDescriptorSubtype      5 (PROCESSING_UNIT)
421:      Warning: Descriptor too short <<<<<<<<<<<<<<<<<<<<<<<<<<<<
422-        bUnitID                 2
423-        bSourceID               1
424-        wMaxMultiplier          0
425-        bControlSize            2
426-        bmControls     0x0000177f
427-          Brightness
428-          Contrast

* lsusb-v-5.4.10-200-good.txt
lsusb-v-5.4.10-200-good.txt
Attachment 1676176 [details]

Comment 136 Steve 2020-04-12 15:24:12 UTC

Created attachment 1678267 [details]
lsusb-v-5.4.10-200-good-Ricoh-only.txt ("lsusb -v" output for the Ricoh USB camera)

This is the "lsusb -v" output for the Ricoh USB camera extracted from:

lsusb-v-5.4.10-200-good.txt
dmesg-lsusb.tar.bz2
Attachment 1676176 [details]

$ wc -l lsusb-v-5.4.10-200-good-Ricoh-only.txt
427 lsusb-v-5.4.10-200-good-Ricoh-only.txt

$ grep -A10 -n 'Configuration Descriptor' lsusb-v-5.4.10-200-good-Ricoh-only.txt
17:  Configuration Descriptor:
18-    bLength                 9
19-    bDescriptorType         2
20-    wTotalLength       0x0265
21-    bNumInterfaces          2
22-    bConfigurationValue     1 <<<< Bogus "#247" value is from here.*
23-    iConfiguration          0 
24-    bmAttributes         0x80
25-      (Bus Powered)
26-    MaxPower              200mA
27-    Interface Association:

* per usb_choose_configuration() in drivers/usb/core/generic.c:

Comment 137 William Bader 2020-04-12 18:09:25 UTC

Created attachment 1678293 [details]
grep -B15 -n 'Code:' dmesg-all*.txt

>"#247" could have significance:

When I first had the error, I looked for 247 and F7 in the kernel bugzilla.
I was thinking that the 247 came from reading into another data structure because it comes up every time.

$ grep -h 'Code:' dmesg-all*.txt
Apr 12 00:36:41 kernel: Code: 1f 80 00 00 00 00 e8 9b c2 ff ff 48 8d bd 38 ff ff ff be 3d 00 00 00 48 89 85 28 ff ff ff 48 89 85 38 ff ff ff e8 2c f6 ff ff <80> 38 69 0f 85 2b 02 00 00 80 78 01 70 0f 85 21 02 00 00 0f b6 58
Apr 12 00:10:37 kernel: Code: 1f 80 00 00 00 00 e8 9b c2 ff ff 48 8d bd 38 ff ff ff be 3d 00 00 00 48 89 85 28 ff ff ff 48 89 85 38 ff ff ff e8 2c f6 ff ff <80> 38 69 0f 85 2b 02 00 00 80 78 01 70 0f 85 21 02 00 00 0f b6 58
Apr 12 06:27:09 kernel: Code: 1f 80 00 00 00 00 e8 9b c2 ff ff 48 8d bd 38 ff ff ff be 3d 00 00 00 48 89 85 28 ff ff ff 48 89 85 38 ff ff ff e8 2c f6 ff ff <80> 38 69 0f 85 2b 02 00 00 80 78 01 70 0f 85 21 02 00 00 0f b6 58
Apr 12 00:29:30 kernel: Code: 1f 80 00 00 00 00 e8 9b c2 ff ff 48 8d bd 38 ff ff ff be 3d 00 00 00 48 89 85 28 ff ff ff 48 89 85 38 ff ff ff e8 2c f6 ff ff <80> 38 69 0f 85 2b 02 00 00 80 78 01 70 0f 85 21 02 00 00 0f b6 58

The attachment is the result of grep -B15 -n 'Code:' dmesg-all*.txt

Comment 138 Steve 2020-04-12 18:52:36 UTC

(In reply to William Bader from comment #137)
> Created attachment 1678293 [details]
> grep -B15 -n 'Code:' dmesg-all*.txt
> 
> >"#247" could have significance:
> 
> When I first had the error, I looked for 247 and F7 in the kernel bugzilla. I was thinking that the 247 came from reading into another data structure because it comes up every time.

Good idea to search for it. Could you explain what you mean by "reading into another data structure"?

> $ grep -h 'Code:' dmesg-all*.txt
...
> The attachment is the result of grep -B15 -n 'Code:' dmesg-all*.txt

Thanks. Your system is overheating:

... kernel: mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)

("mce" probably means "machine check event" -- "man mcelog".)

And nm-initrd-generator is crashing:

... kernel: nm-initrd-gener[353]: segfault at 0 ip 000055f1d5068d24 sp 00007ffe3bab86c0 error 4 in nm-initrd-generator[55f1d5063000+64000]

journalctl can pick out errors and display them in bright red:

$ journalctl --no-hostname -b -p err  # The journalctl man page lists other priorities for the "-p" option.

ABRT might have saved some crash dumps ("Problem Reporting").

Comment 139 Steve 2020-04-12 19:06:19 UTC

> dmesg-usb-dyndbg-5.4.11.localversion11-dirty-bad-patch-20200412-062831.txt <- 5.4.11 kernel built from 'git checkout v5.4.11' with the two usb patches. (I did not reapply the patches. 'git status' and 'git diff' show that git maintained them through the 'git bisect reset' and the checkout.)

Thanks for pointing that out. I may have misunderstood how "git bisect reset" works, because it left me with a detached HEAD:

$ git branch
* (no branch, bisect started on 9d61432ef)
  linux-5.4.y

$ git bisect reset
Previous HEAD position was 97d9e8620 bnx2x: Do not handle requests from VFs after parity
HEAD is now at 9d61432ef Linux 5.4.11

$ git branch
* (HEAD detached at 9d61432ef)
  linux-5.4.y

So to get to the main branch:

$ git checkout linux-5.4.y
Previous HEAD position was 9d61432ef Linux 5.4.11
Switched to branch 'linux-5.4.y'
Your branch is up to date with 'origin/linux-5.4.y'.

$ git branch
* linux-5.4.y

And then checkout, apply the patches, and attempt to checkout a different commit:

$ git checkout v5.4.11
Note: checking out 'v5.4.11'.

You are in 'detached HEAD' state. ... (This has a longer informational note.)

$ git apply 0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol-1.patch
$ git apply 0002-Fix-a-typo-in-prevoius-patch.patch

Git lets you checkout with only an informational note:

$ git checkout v5.4.10
M	drivers/usb/host/xhci-ring.c
M	drivers/usb/host/xhci.c
Previous HEAD position was 9d61432ef Linux 5.4.11
HEAD is now at 7a02c1932 Linux 5.4.10

$ git branch
* (HEAD detached at v5.4.10)
  linux-5.4.y

So it seems like there is some potential for mistakes when mixing checkouts and patches.

Comment 140 Steve 2020-04-12 19:28:14 UTC

(In reply to Steve from comment #138)
...
> And nm-initrd-generator is crashing:
...

The man page says: "nm-initrd-generator scans the command line for options relevant to network configuration ..."

It's not clear what "command line" that refers to, but if it were the _kernel_ command line, that would mean that nm-initrd-generator can't handle the dyndbg options on the _kernel_ command line.

Later, the man page actually does refer to "the kernel command line".

Now I have to check my logs ... :-)

Comment 141 Steve 2020-04-12 20:23:01 UTC

(In reply to Steve from comment #138)
> And nm-initrd-generator is crashing:

Bug 1823217 - nm-initrd-gener[333]: segfault at 0 ...

Comment 142 William Bader 2020-04-12 21:16:04 UTC

>Could you explain what you mean by "reading into another data structure"?

I was guessing that something passed a bad length because one of the first logs I posted had the lines below.
"Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: config 247 has too many interfaces: 120, using maximum allowed: 32"
"Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: config 247 descriptor has 1 excess byte, ignoring"

>Your system is overheating

I know. The cpu work from shutting down and then booting can make it overheat. It happened when it was new. That is why I have avoided doing kernel builds on it.

>nm-initrd-generator is crashing

I've given up trying to debug NetworkManager.

>https://bugzilla.redhat.com/show_bug.cgi?id=1823217

Thanks for researching it. Does that mean that there is a bug in the nm debug code?

Normal user mode programs shouldn't see any difference when dyndbg is enabled, right?

I was wondering if dyndbg could make the usb code run slightly slower, which might possibly fix errors due to timing issues.

>So it seems like there is some potential for mistakes when mixing checkouts and patches.

I had expect that git checkout would overwrite all of the changed files and drop the patches. I was surprised when it kept the patches. Maybe 'git apply' does more than just running 'patch'.

Is it worth starting fresh with 5.5 and applying the two usb patches?

Comment 143 Steve 2020-04-12 22:22:19 UTC

>>Could you explain what you mean by "reading into another data structure"?

>I was guessing that something passed a bad length because one of the first logs I posted had the lines below.
>"Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: config 247 has too many interfaces: 120, using maximum allowed: 32"
>"Mar 30 15:00:58 scslaptop37 kernel: usb 1-1.3: config 247 descriptor has 1 excess byte, ignoring"

Thanks. Now I see. As can be seen in the "lsusb -v" output, the USB data structures use numerous lengths and counts. If any one of those is corrupt, everything else will be corrupt too. And that makes me wonder whether USB has any built-in support for integrity checking, such as checksums or hashes. That bogus "#247" value would never be interpreted if the integrity check failed.

>>Your system is overheating

>I know. The cpu work from shutting down and then booting can make it overheat. It happened when it was new. That is why I have avoided doing kernel builds on it.

I can see why.

>>nm-initrd-generator is crashing

>I've given up trying to debug NetworkManager.

>>https://bugzilla.redhat.com/show_bug.cgi?id=1823217

>Thanks for researching it. Does that mean that there is a bug in the nm debug code?

Possibly. But the allowed kernel command-line syntax could be inadequately specified, so dyndbg could be using what seems to be legal syntax that "nm" isn't expecting. That would indicate an interface specification bug.

>Normal user mode programs shouldn't see any difference when dyndbg is enabled, right?

Presumably.

>I was wondering if dyndbg could make the usb code run slightly slower, which might possibly fix errors due to timing issues.

I was wondering about that too.

>>So it seems like there is some potential for mistakes when mixing checkouts and patches.

>I had expect that git checkout would overwrite all of the changed files and drop the patches. I was surprised when it kept the patches. Maybe 'git apply' does more than just running 'patch'.

git tries not to trash any working files. While experimenting, I found that git, in some cases, refuses to checkout files that would overwrite working files (and git provides an informative error message saying so).

>Is it worth starting fresh with 5.5 and applying the two usb patches?

No. We need to turn this over to the USB experts upstream. :-) If they have a patch, we are now perfectly positioned to test it.

Comment 144 Steve 2020-04-12 22:37:25 UTC

>HWMON_MODULES="coretemp"

I ran "sensors-detect" on my laptop and, despite answering "yes" for every probe, got exactly the same result. Idle temps are 42C to 48C.

"sensors" shows a "cpu_fan" that always seems to be at 2000 RPM or 2100 RPM, so I don't know if that speed is a credible value.

That section of the output is headed "asus-isa-0000", so believe it is coming from an asus-specific kernel module.

Comment 145 Steve 2020-04-12 23:03:14 UTC

>While experimenting, I found that git, in some cases, refuses to checkout files that would overwrite working files (and git provides an informative error message saying so).

Here is an example:

$ git branch
* linux-5.4.y

$ git apply 0001-xhci-testpatch-Don-t-clear-TT-buffer-on-ep0-protocol-1.patch
$ git apply 0002-Fix-a-typo-in-prevoius-patch.patch

$ git checkout v5.4.11
error: Your local changes to the following files would be overwritten by checkout:
	drivers/usb/host/xhci-ring.c
Please commit your changes or stash them before you switch branches.
Aborting

$ git stash --help
...
"Use git stash when you want to record the current state of the working directory and the index, but want to go back to a clean working directory."

Comment 146 William Bader 2020-04-12 23:36:37 UTC

I submitted a report at https://bugzilla.kernel.org/show_bug.cgi?id=207219

Comment 147 Steve 2020-04-13 03:59:41 UTC

(In reply to William Bader from comment #146)
> I submitted a report at https://bugzilla.kernel.org/show_bug.cgi?id=207219

Thanks! Including the log entries for the "good" kernel was a good idea.

You can also include a link to it at the top of the Fedora BZ page in the "Links" section. The pull-down menu has a "Linux Kernel" item that looks like the relevant choice.

Interestingly, the "Links" menu is not visible in this bug report, even when logged in, so I had to look at one of my own bug reports to see the menu.

Comment 148 Steve 2020-04-13 05:05:41 UTC

Replying to William in Comment 72:

>>Could you go into more detail about that [VM configuration]?

>The office is on coronavirus lockdown. I am on lockdown also, stranded far from the office.

Thank goodness for the internet.

>I showed this bug report to the person who manages the VMs, and he connected remotely and created a Fedora 31 VM with 8GB RAM, 44GB disk (11GB currently used by the OS and the kernel build), and one virtual cpu that shows as an Intel Skylake Processor.

Wow! All you had to do is ask, and you got a "desktop" VM. :-)

>Ten years ago, we had a computer room full of headless desktops and towers. We got a single big server and migrated everything to VMs on the big server.

That's a lot more convenient to administer, but it seems like the old-school redundancy has some advantages.

>We do daily backups of the important VMs, but it would still be a pain to lose one, so only a few people have access, and I am not on that list. 

OK.

>It is better that way, so I don't get the blame if something breaks.

That's why I try not to release shell scripts that must be run as root. :-) (Comment 64)

>I don't know what tool he uses or how he installed Fedora.

OK. I was mainly interested how it is done on "big iron".

>He installed gcc, but I had to install make, flex, bison, and a few libraries.
>Before I did an in-place update from Fedora 30 to 31 on my laptop, he made a Fedora 30 VM and tested the update, and I think that he has already set up a Fedora 31 VM for another project that needed an OS more recent than CentOS 7.

Testing the F30->F31 system upgrade in a VM was a really good idea. In another bug report, the reporter had a boot failure with a kernel panic after a system upgrade. (Bug 1815102)

>I have some VMs on my laptop under VirtualBox, but my laptop supports only 8GB RAM, so I can't do much in the VMs.

My desktop system has 8GB RAM, but I usually run only one VM at a time. The few times I ran more than one, I got completely confused about which was which. :-)

>I suppose that since the webcam is a hardware issue, if it doesn't work on the OS on the bare metal, it won't work inside a VM.

VMs are great for testing, but testing on bare metal is sometimes the only option. Sometimes even seemingly similar hardware doesn't behave the same way. (Bug 1814810, Comment 14)

Comment 149 William Bader 2020-04-13 05:48:27 UTC

>You can also include a link to it at the top of the Fedora BZ page in the "Links" section.

Thanks, I added the link.

The kernel bugzilla doesn't have anyone entered in the mailing list. Are there any chances that someone will look into it?

I could probably submit a patch for the call to the USE_NEW_SCHEME macro, but I suspect that it would not be accepted.

I saw something about quirks. Is there a way to tell the kernel that webcams with USB 05ca:18c0 need special initialization?

>Testing the F30->F31 system upgrade in a VM was a really good idea. In another bug report, the reporter had a boot failure with a kernel panic after a system upgrade.

Our hardware person has another rule that we don't update any operating systems (other than expendable test VMs) until a month after a release so the serious problems are shaken out.

Before doing an in-place Fedora upgrade, I make a backup of my laptop, and then I boot from the Live CD and check that hardware like the wifi, speakers and microphone work.

Comment 150 Steve 2020-04-13 07:04:26 UTC

(In reply to William Bader from comment #149)
> >You can also include a link to it at the top of the Fedora BZ page in the "Links" section.
> 
> Thanks, I added the link.

WFM.

> The kernel bugzilla doesn't have anyone entered in the mailing list. Are there any chances that someone will look into it?

If you are referring to the assignee, I believe that "Default virtual assignee for Drivers/USB" means bugs are posted to a mailing list.

Anyway, having "[BISECTED]" in the bug summary should attract attention because it means that:

1. Maintainers won't have to pull teeth to get essential information.
2. You know what you are doing. :-)

> I could probably submit a patch for the call to the USE_NEW_SCHEME macro, but I suspect that it would not be accepted.

The USB specification is notoriously complicated, so I would suggest letting the USB maintainers figure out what to do.

> I saw something about quirks. Is there a way to tell the kernel that webcams with USB 05ca:18c0 need special initialization?

Again -- leave it to the maintainers.

> >Testing the F30->F31 system upgrade in a VM was a really good idea. In another bug report, the reporter had a boot failure with a kernel panic after a system upgrade.
> 
> Our hardware person has another rule that we don't update any operating systems (other than expendable test VMs) until a month after a release so the serious problems are shaken out.

That's a good rule. And that is why my prime system is F30, not F31. But that didn't save me from a system lockup: Bug 1806747.

> Before doing an in-place Fedora upgrade, I make a backup of my laptop, and then I boot from the Live CD and check that hardware like the wifi, speakers and microphone work.

That's a very good idea.

Comment 151 Steve 2020-04-13 07:14:37 UTC

(In reply to William Bader from comment #149)
...
> The kernel bugzilla doesn't have anyone entered in the mailing list. Are there any chances that someone will look into it?
...

Many Linux patches are posted on Patchwork. You can draw your own conclusions after looking at this list:

Linux USB - Patchwork
https://patchwork.kernel.org/project/linux-usb/list/

Comment 152 Steve 2020-04-13 07:18:56 UTC

Replying to William in Comment 15:

"... I have never been successful getting my laptop to boot from a pen drive."

That could depend on the pen drive. I had one low-end, name-brand pen drive that was unusable with Linux.

Now, for boot drives, I try to get drives for which the manufacturer reports read and write speeds. Based on such specs, the drives with larger capacities seem to be faster too.

This is my current "premium" USB flash drive:

Kingston 64GB DataTraveler Elite G2 Black Metal Casing Fast 180MB/s R, 70MB/W USB 3.1 Flash Drive with LED light indicator (DTEG2/64GB)

I have also had good luck with these lower-end drives:

SanDisk 16GB 2.0 Flash Cruzer Glide USB Drive

SanDisk 32GB Ultra Fit USB 3.1 Flash Drive
(This is small, so it can be left always plugged into a laptop. I use it as grub2 boot drive for my laptop: "man grub2-mkrescue".)

Comment 153 Steve 2020-04-13 07:48:56 UTC

If you just want a working camera, you could "git revert" the "bad" commit.

For this test, I already had "v5.4.11" checked out.

Since "git revert" creates new commits, start with a new branch:

$ git branch test-1
$ git checkout test-1

$ git branch
  linux-5.4.y
* test-1

Verify that HEAD is at v5.4.11 in our new branch:

$ git log --oneline -n1 HEAD
9d61432ef (HEAD -> test-1, tag: v5.4.11) Linux 5.4.11

Verify that we have the right commit for the revert:

$ git log --oneline -n1 7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430
7cbdf96cd usb: missing parentheses in USE_NEW_SCHEME

This will throw you into an editor. Just write and quit:

$ git revert 7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430
[test-1 a23bddd0e] Revert "usb: missing parentheses in USE_NEW_SCHEME"
 1 file changed, 1 insertion(+), 1 deletion(-)

Now we should have a "working" kernel:

$ git log --oneline -n2 HEAD
a23bddd0e (HEAD -> test-1) Revert "usb: missing parentheses in USE_NEW_SCHEME"
9d61432ef (tag: v5.4.11) Linux 5.4.11

$ git show -q HEAD
commit a23bddd0e1e15f92d17ae16e787d768ac8e7b029 (HEAD -> test-1)
Author: No Name <noemail>
Date:   Mon Apr 13 00:42:12 2020 -0700

    Revert "usb: missing parentheses in USE_NEW_SCHEME"
    
    This reverts commit 7cbdf96cda1fbffb17ec26ea65e1fe63c9aed430.

Comment 154 Steve 2020-04-13 08:26:12 UTC

(In reply to Steve from comment #153)
...
> Since "git revert" creates new commits, start with a new branch:
...

The "-n" option lets you undo the "bad" commit without creating a new commit -- it just modifies the "working tree and the index".

That is similar to what "git apply" does with patches, except that "git apply" lets you control what combination of changes to files and to the index are made.

Comment 155 Steve 2020-04-13 17:11:55 UTC

BTW, if you want to know more about USB at a technical level, this is a good book:

"USB Complete: The Developer's Guide", Fifth Edition, by Jan Axelson (2015).

(available at a well-known online seller)

And, based on the online index, USB supports some "error-checking". (re my Comment 143)

But even the error checking can have bugs:

USB: core: fix check for duplicate endpoints
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e4f8e21c4f27bcf30a48486b9dcc269512b79ff

USB: fix problems with duplicate endpoint addresses
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a8fd1346254974c3a852338508e4a4cddbb35f1

Comment 156 William Bader 2020-04-14 22:13:32 UTC

Thanks for the additional information.
The kernel people think that the problem was an earlier commit that caused an issue only after the commit that you helped me find by bisection.
They asked me to build a bad kernel without that commit. https://bugzilla.kernel.org/show_bug.cgi?id=207219#c4
I used the procedure that you suggested yesterday in https://bugzilla.redhat.com/show_bug.cgi?id=1818952#c153

For now, when I need the webcam, I am booting from the distributed Fedora 5.4.10-200.fc31.x86_64 kernel, so my main goal is coming up with a solution that eventually gets into the official Fedora kernels so I don't have to keep rebooting or building kernels.

Thanks for the pen drive information. I'll try it after the virus thing.

Comment 157 Steve 2020-04-15 03:34:25 UTC

(In reply to William Bader from comment #156)
> Thanks for the additional information.
> The kernel people think that the problem was an earlier commit that caused an issue only after the commit that you helped me find by bisection. They asked me to build a bad kernel without that commit.
> https://bugzilla.kernel.org/show_bug.cgi?id=207219#c4

> I used the procedure that you suggested yesterday in
> https://bugzilla.redhat.com/show_bug.cgi?id=1818952#c153

I was quite surprised that you were asked to test a revert after having posted Comment 153.

BTW, I noticed that you didn't create a shallow clone:

$ git clone --branch linux-5.4.y https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4

As you must have discovered, commit bd0e6c9614b9 isn't in a repo with a commit history in the range from v5.3.y to v5.4.y.

$ git describe bd0e6c9614b9
fatal: Not a valid object name bd0e6c9614b9

The earliest tag in the shallow clone is:

$ fig-tags.sh | tail -1
24cb7d728 2019-09-30 10:35:53 -0700 v5.4-rc1

The commit is dated 2018-10-02, which is almost a year earlier:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=bd0e6c9614b95352eb31d0207df16dc156c527fa

For the record, I use shallow clones to reduce download time and disk usage, but they are insufficient for some problems. Git will sometimes show a commit as "grafted", which is displayed when there is no history before a particular commit. It took me a long time to realize that when git shows a commit as "grafted", it is telling me that I shouldn't have been so cheap and should have downloaded the whole history instead. :-)

> For now, when I need the webcam, I am booting from the distributed Fedora 5.4.10-200.fc31.x86_64 kernel, so my main goal is coming up with a solution that eventually gets into the official Fedora kernels so I don't have to keep rebooting or building kernels.

Agreed, although booting 5.4.10 sounds like a better workaround than using the F31 Live image because it is faster and you don't have to install apps into a live environment.

> Thanks for the pen drive information. I'll try it after the virus thing.

You're welcome and good health.

Comment 158 Steve 2020-04-15 03:41:43 UTC

(In reply to William Bader from comment #156)
> I used the procedure that you suggested yesterday in https://bugzilla.redhat.com/show_bug.cgi?id=1818952#c153

BZ tip: You can type in the literal string, Comment 153, and BZ will automatically create a link. Test with the "Preview" tab.

2.7.1. Autolinkification
https://bugzilla.redhat.com/docs/en/html/using/tips.html#autolinkification

Comment 159 Steve 2020-04-15 04:18:21 UTC

There may be another workaround that doesn't require rebooting. This kernel parameter can be set at runtime:

/sys/module/usbcore/parameters/old_scheme_first

To test, use two terminal windows:

In the first, run this to monitor what happens:

$ journalctl --no-hostname -k -f

In the second, run:

# cat /sys/module/usbcore/parameters/old_scheme_first # Verify.*
N

# echo 1 > /sys/module/usbcore/parameters/old_scheme_first # Change.

# cat /sys/module/usbcore/parameters/old_scheme_first # Verify.
Y

Next, use Alan's method** to restart the controller:

# echo 0 >/sys/bus/usb/devices/1-1/bConfigurationValue
# echo 1 >/sys/bus/usb/devices/1-1/bConfigurationValue

And then see if your video devices are present:

# ls /dev/video*

NB: "old_scheme_first" is global, so changing it could cause something to break. However, it is not a permanent change, so rebooting should restore it to its original value.

The relevant kernel source code is in drivers/usb/core/hub.c.

Disclaimer: I tested the setting and restarting in a VM, but that may not reflect what happens on bare metal.

* The "cat" and "ls" commands don't need to be run as root, it is just more convenient that way.

** https://bugzilla.kernel.org/show_bug.cgi?id=207219#c1

Comment 160 William Bader 2020-04-15 04:33:35 UTC

>As you must have discovered, commit bd0e6c9614b9 isn't in a repo with a commit history in the range from v5.3.y to v5.4.y.

Yes, I saw that it wasn't there, and I wasn't sure how to find it, and they might have asked me to revert other commits even further back, so I cloned everything.

>booting 5.4.10 sounds like a better workaround

I have the kernel from koji installed in /boot, so I don't need the Live CD.
My laptop has an SSD, and I have very few services running, so shutdown and reboot is around a minute.
The biggest pain is that I have 20 workspaces open under Mate Desktop, so I have a lot of applications to close and then reopen.

>BZ will automatically create a link.

Thanks, I'll try that next time.

>Then see if your video devices are present:

It didn't work. Maybe the webcam is permanently messed up after it boots.
It would have been nice if it had worked.

$ uname -r
5.5.15-200.fc31.x86_64

# cat /sys/module/usbcore/parameters/old_scheme_first
N
# echo 1 > /sys/module/usbcore/parameters/old_scheme_first
# cat /sys/module/usbcore/parameters/old_scheme_first
Y
# echo 0 >/sys/bus/usb/devices/1-1/bConfigurationValue
# echo 1 >/sys/bus/usb/devices/1-1/bConfigurationValue
# ls /dev/video*
ls: cannot access '/dev/video*': No such file or directory

$ journalctl --no-hostname -k -f 
-- Logs begin at Sat 2019-07-27 03:29:49 WEST. --
...
Apr 15 05:22:58 kernel: usb 1-1.2: Product: Bluetooth USB Host Controller
Apr 15 05:22:58 kernel: usb 1-1.2: Manufacturer: Atheros Communications
Apr 15 05:22:58 kernel: usb 1-1.2: SerialNumber: Alaska Day 2006
Apr 15 05:22:58 kernel: usb 1-1.3: new full-speed USB device number 7 using ehci-pci
Apr 15 05:22:59 kernel: usb 1-1.3: device not accepting address 7, error -32
Apr 15 05:22:59 kernel: usb 1-1.3: new full-speed USB device number 8 using ehci-pci
Apr 15 05:22:59 kernel: usb 1-1.3: device not accepting address 8, error -32
Apr 15 05:22:59 kernel: usb 1-1-port3: attempt power cycle
Apr 15 05:23:00 kernel: usb 1-1.3: new full-speed USB device number 9 using ehci-pci
Apr 15 05:23:00 kernel: usb 1-1.3: device descriptor read/64, error -32
Apr 15 05:23:00 kernel: usb 1-1.3: device descriptor read/64, error -32
Apr 15 05:23:00 kernel: usb 1-1.3: new full-speed USB device number 10 using ehci-pci
Apr 15 05:23:00 kernel: usb 1-1.3: device descriptor read/64, error -32
Apr 15 05:23:00 kernel: usb 1-1.3: device descriptor read/64, error -32
Apr 15 05:23:00 kernel: usb 1-1-port3: unable to enumerate USB device

Comment 161 Steve 2020-04-15 05:15:00 UTC

> It didn't work. Maybe the webcam is permanently messed up after it boots.

Thanks for testing that. Here is another variant:

Append "usbcore.old_scheme_first=1" to the kernel command-line (without the quotes) from grub2.
Press "ctrl-x" to boot and login.

Verify:

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.16-200.fc31.x86_64 root=UUID=c5f30768-dfaf-4dc8-bb85-9b8f088fb16e ro usbcore.old_scheme_first=1

$ cat /sys/module/usbcore/parameters/old_scheme_first
Y

Test:

$ ls /dev/video*

I tested the command-line setting in a VM with 5.5.16-200.fc31.x86_64.

Comment 162 Steve 2020-04-15 11:30:20 UTC

Here is bit more about finding module parameters.

Module directories are listed here:

$ ls /sys/module/ | wc -l
181

Most of them have a "parameters" subdirectory:

$ ls -d /sys/module/*/parameters | wc -l
113

Even when "modinfo" won't show anything:

$ modinfo usbcore
modinfo: ERROR: Module usbcore not found.

Module parameters can still be listed:

$ grep '' /sys/module/usbcore/parameters/*
/sys/module/usbcore/parameters/authorized_default:-1
/sys/module/usbcore/parameters/autosuspend:2
/sys/module/usbcore/parameters/blinkenlights:N
/sys/module/usbcore/parameters/initial_descriptor_timeout:5000
/sys/module/usbcore/parameters/nousb:N
/sys/module/usbcore/parameters/old_scheme_first:N
/sys/module/usbcore/parameters/quirks:
/sys/module/usbcore/parameters/usbfs_memory_mb:16
/sys/module/usbcore/parameters/usbfs_snoop:N
/sys/module/usbcore/parameters/usbfs_snoop_max:65536
/sys/module/usbcore/parameters/use_both_schemes:Y

> ... I have 20 workspaces open under Mate Desktop, so I have a lot of applications to close and then reopen [after rebooting with a kernel that supports the webcam].

MATE has an option to "Automatically remember running applications when logging out". (Under System:Preferences:Personal)

However, it is somewhat inconsistent about what is and what is not restored. And the behavior seems to change between the first restart and later restarts.

Tested in an F31 MATE VM purpose-built for testing that feature.

Comment 163 Steve 2020-04-15 12:25:16 UTC

It's not just your camera. I have a SanDisk USB flash drive installed as a grub2 boot device on my laptop, so it is present at boot-time.

With the default "old_scheme_first" ("N"), there is an error:

$ cat usb-sandisk-new-scheme-first-1.txt
Apr 14 21:52:19 kernel: Command line: BOOT_IMAGE=(hd1,msdos6)/vmlinuz-5.5.16-100.fc30.x86_64 root=/dev/mapper/[removed] ro
Apr 14 21:52:19 kernel: usb 2-1: new high-speed USB device number 2 using ehci-pci
Apr 14 21:52:19 kernel: usb 2-1: device not accepting address 2, error -71 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Apr 14 21:52:19 kernel: usb 2-1: new high-speed USB device number 3 using ehci-pci
Apr 14 21:52:19 kernel: usb 2-1: New USB device found, idVendor=8087, idProduct=0024, bcdDevice= 0.00
Apr 14 21:52:19 kernel: usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Apr 14 21:52:19 kernel: usb 2-1.2: new high-speed USB device number 4 using ehci-pci
Apr 14 21:52:19 kernel: usb 2-1.2: New USB device found, idVendor=0781, idProduct=5583, bcdDevice= 1.00
Apr 14 21:52:19 kernel: usb 2-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Apr 14 21:52:19 kernel: usb 2-1.2: Product: Ultra Fit
Apr 14 21:52:19 kernel: usb 2-1.2: Manufacturer: SanDisk
Apr 14 21:52:19 kernel: usb 2-1.2: SerialNumber: [removed]

However, there is NO ERROR with "old_scheme_first" set to "Y" on the kernel command-line:

$ cat usb-sandisk-old-scheme-first-1.txt 
Apr 14 21:59:06 kernel: Command line: BOOT_IMAGE=(hd1,msdos6)/vmlinuz-5.5.16-100.fc30.x86_64 root=/dev/mapper/[removed] ro usbcore.old_scheme_first=1
Apr 14 21:59:06 kernel: usb 2-1: new high-speed USB device number 2 using ehci-pci
Apr 14 21:59:06 kernel: usb 2-1: New USB device found, idVendor=8087, idProduct=0024, bcdDevice= 0.00
Apr 14 21:59:06 kernel: usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Apr 14 21:59:06 kernel: usb 2-1.2: new high-speed USB device number 3 using ehci-pci
Apr 14 21:59:06 kernel: usb 2-1.2: New USB device found, idVendor=0781, idProduct=5583, bcdDevice= 1.00
Apr 14 21:59:06 kernel: usb 2-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Apr 14 21:59:06 kernel: usb 2-1.2: Product: Ultra Fit
Apr 14 21:59:06 kernel: usb 2-1.2: Manufacturer: SanDisk
Apr 14 21:59:06 kernel: usb 2-1.2: SerialNumber: [removed]

Use this as a quick test after booting:

$ journalctl --no-hostname -k -p err

And to check a prior boot:

$ journalctl --no-hostname -k -p err -b -1

Comment 164 Steve 2020-04-15 12:39:26 UTC

(In reply to Steve from comment #163)

I forgot to document my command-line for collecting those log entries:

$ journalctl --no-hostname -k | egrep 'Command line|usb 2-1'

Comment 165 Steve 2020-04-15 13:47:41 UTC

(In reply to William Bader from comment #28)
> ... there is a second bug, possibly hardware, that doesn't reinitialize something during warm boots.

I think I am seeing something like that with a consistent kernel and without special command-line options:

... Command line: BOOT_IMAGE=(hd1,msdos6)/vmlinuz-5.5.16-100.fc30.x86_64 root=/dev/mapper/[removed] ro

Test procedure:

Configure a laptop with a SanDisk Ultra Fit USB flash drive as the grub2 boot device in a USB 2 port, and configure the BIOS to always boot from it.

$ poweroff

Power on.
...
$ journalctl --no-hostname -k -p err

Repeatedly run:

$ reboot
...
$ journalctl --no-hostname -k -p err

Sometimes I get this error, and sometimes I don't:

... kernel: usb 2-1: device not accepting address 2, error -71

More often I do.

Comment 166 Steve 2020-04-15 14:04:13 UTC

I get the same inconsistent behavior on warm reboots with "usbcore.old_scheme_first=1" on the kernel command-line:

... kernel: Command line: BOOT_IMAGE=(hd1,msdos6)/vmlinuz-5.5.16-100.fc30.x86_64 root=/dev/mapper/[removed] ro usbcore.old_scheme_first=1

So I wonder if that option even works.

Comment 167 Steve 2020-04-15 14:26:48 UTC

Even with cold boots I get inconsistent behavior with:

... Command line: BOOT_IMAGE=(hd1,msdos6)/vmlinuz-5.5.16-100.fc30.x86_64 root=/dev/mapper/[removed] ro

Procedure:

Repeatedly run:
 
$ poweroff

Power on.
...
$ journalctl --no-hostname -k -p err

The only explanation I can think of is that there is a race during USB initialization.

Comment 168 William Bader 2020-04-15 14:44:09 UTC

>Here is another variant:

That didn't work. I don't understand why. I think that I set it correctly.
I did a shutdown and reboot:

[before reboot with usbcore.old_scheme_first=1]

$ uname -r
5.5.15-200.fc31.x86_64
$ ls /sys/module/ | wc -l
199
$ ls -d /sys/module/*/parameters | wc -l
121
$ grep '' /sys/module/usbcore/parameters/*
/sys/module/usbcore/parameters/authorized_default:-1
/sys/module/usbcore/parameters/autosuspend:2
/sys/module/usbcore/parameters/blinkenlights:N
/sys/module/usbcore/parameters/initial_descriptor_timeout:5000
/sys/module/usbcore/parameters/nousb:N
/sys/module/usbcore/parameters/old_scheme_first:N <-
/sys/module/usbcore/parameters/quirks:
/sys/module/usbcore/parameters/usbfs_memory_mb:16
/sys/module/usbcore/parameters/usbfs_snoop:N
/sys/module/usbcore/parameters/usbfs_snoop_max:65536
/sys/module/usbcore/parameters/use_both_schemes:Y

[after reboot]

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.15-200.fc31.x86_64 root=UUID=01ea3428-c96d-4f4c-af30-2072ce724031 ro rhgb quiet elevator=noop usbcore.old_scheme_first=1 LANG=en_US.UTF-8 mitigations=off
$ cat /sys/module/usbcore/parameters/old_scheme_first
Y
$ ls /dev/vid*
ls: cannot access '/dev/vid*': No such file or directory

$ grep '' /sys/module/usbcore/parameters/*
/sys/module/usbcore/parameters/authorized_default:-1
/sys/module/usbcore/parameters/autosuspend:2
/sys/module/usbcore/parameters/blinkenlights:N
/sys/module/usbcore/parameters/initial_descriptor_timeout:5000
/sys/module/usbcore/parameters/nousb:N
/sys/module/usbcore/parameters/old_scheme_first:Y <-
/sys/module/usbcore/parameters/quirks:
/sys/module/usbcore/parameters/usbfs_memory_mb:16
/sys/module/usbcore/parameters/usbfs_snoop:N
/sys/module/usbcore/parameters/usbfs_snoop_max:65536
/sys/module/usbcore/parameters/use_both_schemes:Y

$ journalctl --no-hostname -k | egrep 'Command line|usb 1-1.3'
Apr 15 15:16:39 kernel: Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.15-200.fc31.x86_64 root=UUID=01ea3428-c96d-4f4c-af30-2072ce724031 ro rhgb quiet elevator=noop usbcore.old_scheme_first=1 LANG=en_US.UTF-8 mitigations=off
Apr 15 15:16:39 kernel: usb 1-1.3: new high-speed USB device number 4 using ehci-pci
Apr 15 15:16:39 kernel: usb 1-1.3: config 247 has too many interfaces: 120, using maximum allowed: 32
Apr 15 15:16:39 kernel: usb 1-1.3: config 247 descriptor has 1 excess byte, ignoring
Apr 15 15:16:39 kernel: usb 1-1.3: config 247 has 0 interfaces, different from the descriptor's value: 120
Apr 15 15:16:39 kernel: usb 1-1.3: New USB device found, idVendor=05ca, idProduct=18c0, bcdDevice= 7.32
Apr 15 15:16:39 kernel: usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Apr 15 15:16:39 kernel: usb 1-1.3: Product: USB2.0 Camera
Apr 15 15:16:39 kernel: usb 1-1.3: Manufacturer: Ricoh Company Ltd.
Apr 15 15:16:39 kernel: usb 1-1.3: can't set config #247, error -32

>MATE has an option to "Automatically remember running applications when logging out". (Under System:Preferences:Personal)

I set
System -> Preferences -> Startup Applications -> [Options tab] -> [x] Automatically remember running applications when logging out
and then I left a bunch of xterms running in workspaces.
When I clicked 'Shut Down...' on the menu and then the 'Shut Down' button in the dialog, it said that I had applications running. I said to shutdown anyway, and then it waited for a while and shutdown, and when it restarted, it didn't restart the xterms.
Something I did once a long time ago, maybe testing out switching users, restarted the xterms but placed them all on workspace 1, so I've been cautious about that.
I launch the xterms from an application launcher on the Mate panel that runs a shell script that runs xterm with a long list of flags.

Comment 169 Steve 2020-04-15 17:02:02 UTC

>/sys/module/usbcore/parameters/old_scheme_first:Y <-

Thanks for running that test. That all looks as expected. So, based on your tests and on my tests, setting old_scheme_first=1 has no effect. That's very disappointing.

I guess we will have to wait for Alan to ask for another test ...

>.... it said that I had applications running. I said to shutdown anyway, and then it waited for a while and shutdown, and when it restarted, it didn't restart the xterms.

Try it a second time. When I first tried the feature, some apps did not restart, but on subsequent reboots they did restart. There may be an initialization bug in MATE.

I also noticed that the behavior depends on the app and the state of the app (whether there are unsaved changes in Pluma, for example).

>I launch the xterms from an application launcher on the Mate panel that runs a shell script that runs xterm with a long list of flags.

Good idea. I do something like that to run various shell scripts, such as one that displays the output from "cal -3" in an xterm:

You might also be able to create ".desktop" files. There are examples in ".config/mate-session/saved-session/".

Comment 170 Steve 2020-04-27 14:37:38 UTC

For the record, Alan's revert* is in v5.7-rc3:

Merge tag 'usb-5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.7-rc3&id=e9a61afb69f07b1c5880984d45e5cc232ec1bf6f

A Fedora build is estimated to be completed Mon, 27 Apr 2020 18:40:23 UTC:

Information for build kernel-5.7.0-0.rc3.1.fc33
https://koji.fedoraproject.org/koji/buildinfo?buildID=1498816

* https://bugzilla.kernel.org/show_bug.cgi?id=207219#c9

Comment 171 William Bader 2020-05-01 17:30:41 UTC

Thanks.

>Information for build kernel-5.7.0-0.rc3.1.fc33

Can I use a Fedora 33 kernel with Fedora 31?

I booted with 5.6.8-200.fc31.x86_64 this morning, and it found the webcam. It works maybe 1 out of 10 times.

>MATE has an option to "Automatically remember running applications when logging out". (Under System:Preferences:Personal)
>However, it is somewhat inconsistent about what is and what is not restored. And the behavior seems to change between the first restart and later restarts.

I left that set, but it does not seem to work for my xterm windows. Maybe it works only for applications that do some kind of Mate-specific registration through dbus.

Comment 172 Steve 2020-05-02 01:21:56 UTC

(In reply to William Bader from comment #171)
> Thanks.
> 
> >Information for build kernel-5.7.0-0.rc3.1.fc33
> 
> Can I use a Fedora 33 kernel with Fedora 31?

Yes. It's not any different than building your own kernel and running it. In fact, I have just completed a build of 4.19.119, which is a "longterm" kernel (https://www.kernel.org/). It runs fine, if you start with a Fedora config file.

Starting with the kernel's default config file is very informative, but basically it becomes an exercise in debugging the config. With an early version, after suspending, my laptop would resume briefly and then suspend again. Worse, after I powered it off, it wouldn't power on again. I finally removed the battery and recovered. That's as close to bricking my laptop as I want to get. :-) The fix was to enable certain kernel features in the config, because certain system processes require that kernel functionality.

> I booted with 5.6.8-200.fc31.x86_64 this morning, and it found the webcam.
> It works maybe 1 out of 10 times.

That's not too reliable.

> >MATE has an option to "Automatically remember running applications when logging out". (Under System:Preferences:Personal)
> >However, it is somewhat inconsistent about what is and what is not restored. And the behavior seems to change between the first restart and later restarts.
> 
> I left that set, but it does not seem to work for my xterm windows. Maybe it works only for applications that do some kind of Mate-specific registration through dbus.

The xterm package contains xterm.desktop, which is all that I thought was required. This might give some clues about what the problem is:

Desktop Application Autostart Specification
https://specifications.freedesktop.org/autostart-spec/autostart-spec-latest.html

Comment 173 William Bader 2020-05-02 02:54:31 UTC

I installed kernel-5.7.0-0.rc3.1.fc33 and the webcam worked.
What is strange is that I have been using 5.6.8, and it has been showing the webcam.
uname says
Linux laptop37 5.6.8-200.fc31.x86_64 #1 SMP Wed Apr 29 19:10:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
which is after the Apr 27 date of the change to 5.7 but it doesn't look like it was backported into 5.6.8 https://koji.fedoraproject.org/koji/buildinfo?buildID=1499538

I eventually managed to produce a log that captures a bad initialization. https://bugzilla.kernel.org/show_bug.cgi?id=207219#c20

Comment 174 Steve 2020-05-02 05:21:00 UTC

(In reply to William Bader from comment #173)
> I installed kernel-5.7.0-0.rc3.1.fc33 and the webcam worked.

Congratulations. That took a little over a month from the time you filed your bug report.

> What is strange is that I have been using 5.6.8, and it has been showing the webcam.
> uname says
> Linux laptop37 5.6.8-200.fc31.x86_64 #1 SMP Wed Apr 29 19:10:01 UTC 2020
> x86_64 x86_64 x86_64 GNU/Linux
> which is after the Apr 27 date of the change to 5.7 but it doesn't look like it was backported into 5.6.8
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1499538

Alan's revert* is not in 5.6.8, so there must be something else affecting the way the camera is initialized:

$ git describe 3155f4f40811c5d7e3c686215051acf504e05565
fatal: 3155f4f40811c5d7e3c686215051acf504e05565 is neither a commit nor blob

$ git log --oneline --no-decorate -n1
63c3d49741 Linux 5.6.8

> I eventually managed to produce a log that captures a bad initialization.
> https://bugzilla.kernel.org/show_bug.cgi?id=207219#c20

Thanks for that link. Alan's analysis is very interesting, and it shows the power of tracing as a debugging tool.

However, I would bet that if we had access to the firmware, we could explain that seemingly random data being sent by the camera:

"09027602 78f7e4ff 029e5f02 4675e490 b197f0a3 f07b017a b0790012 28ef7404"

* USB: hub: Revert commit bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3155f4f40811c5d7e3c686215051acf504e05565

Comment 175 William Bader 2020-05-02 07:16:24 UTC

I started browsing https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.6.8 and the revert is there:

commit f8092b0e021762ab73656d6ec87a6c9e90aff4f4
Author: Alan Stern <stern.edu>
Date:   Wed Apr 22 16:13:08 2020 -0400
    USB: hub: Revert commit bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices")
    commit 3155f4f40811c5d7e3c686215051acf504e05565 upstream.

So that explains the mystery.
  
>I would bet that if we had access to the firmware, we could explain that seemingly random data being sent by the camera:

Is there any way to look at it? (Without risking bricking my laptop...)
I remember in MSDOS days, you could do some interesting things reading and writing to some ports.

The mate session issue looks like it has been happening for years:
https://github.com/mate-desktop/mate-session-manager/issues/42
I wrote two of my own mate panel apps (a mail checker and a screen brightness controller (because for many versions of fedora the one that came with mate didn't work)). Working with mate is a pain because the source is divided between a lot of repositories that you have to build in the right order. For the last few versions of fedora, I've been able to rebuild my panel apps by installing mate-panel-devel, which is easier than before.
Something I've been meaning to look at for years is changing GDM so it remembers that I always select mate.
Every once in a while I forget to select it, and then when I close the login form, the screen blanks for longer than usual, and I worry that something went wrong, but it was only gnome starting instead of mate.

Comment 176 Steve 2020-05-02 12:23:17 UTC

> I started browsing https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.6.8 and the revert is there:

Thanks for checking that.

> commit f8092b0e021762ab73656d6ec87a6c9e90aff4f4
...
>     commit 3155f4f40811c5d7e3c686215051acf504e05565 upstream.

The commit ID changed -- I wasn't expecting that. Now I'm not sure how to reliably check that a commit has been merged from mainline into stable.

$ git log --oneline --no-decorate -n1
63c3d49741 Linux 5.6.8

$ git log --format=fuller --grep 'USB: hub: Revert commit bd0e6c9614b9'
commit f8092b0e021762ab73656d6ec87a6c9e90aff4f4
Author:     Alan Stern <...>
AuthorDate: Wed Apr 22 16:13:08 2020 -0400
Commit:     Greg Kroah-Hartman <...>
CommitDate: Wed Apr 29 16:34:45 2020 +0200

    USB: hub: Revert commit bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices")
    
    commit 3155f4f40811c5d7e3c686215051acf504e05565 upstream.
...

Evidently, mainline is considered "upstream" from stable. This is going to require some mental adjustments. :-)

Comment 177 Steve 2020-05-02 13:53:47 UTC

(In reply to William Bader from comment #171)
> I booted with 5.6.8-200.fc31.x86_64 this morning, and it found the webcam. It works maybe 1 out of 10 times.

Now that we know that Alan's revert is in 5.6.8, we can ask why it doesn't seem to be working reliably.

Can you get another usbmon trace that Alan could look at?

Comment 178 Steve 2020-05-02 14:01:16 UTC

(In reply to William Bader from comment #173)
> I installed kernel-5.7.0-0.rc3.1.fc33 and the webcam worked.

How reliable is the camera initialization with 5.7.0-0.rc3?

Comment 179 Steve 2020-05-02 16:14:02 UTC

(In reply to Steve from comment #176)
> This is going to require some mental adjustments. :-)

Here are the "adjustments": :-)

The dates on the mainline and stable commits differ, which causes the commit IDs to differ:*

mainline: CommitDate: Thu Apr 23 15:22:41 2020 +0200
stable:   CommitDate: Wed Apr 29 16:34:45 2020 +0200

Most of the stable repo commits include the upstream commit ID, so it is possible to search for the mainline commit ID in the stable repo:

$ git log --oneline --grep 3155f4f40811c5d7e3c686215051acf504e05565 <<< mainline commit ID (Comment 176 shows the matching commit ID.)
f8092b0e02 USB: hub: Revert commit bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices")

However, there are a few commits in the stable repo with a line like this:
    [ no upstream commit ]

* The commit IDs are just sha1sum hashes on the commit with a short, null-terminated header:
https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

Comment 180 Steve 2020-05-02 17:25:59 UTC

(In reply to William Bader from comment #175)
...
> >I would bet that if we had access to the firmware, we could explain that seemingly random data being sent by the camera:
> 
> Is there any way to look at it? (Without risking bricking my laptop...)
> I remember in MSDOS days, you could do some interesting things reading and writing to some ports.
...

With a USB device, there is a protocol:

Universal Serial Bus Device Class Specification for Device Firmware Upgrade
Version 1.1
Aug 5, 2004
https://usb.org/sites/default/files/DFU_1.1.pdf

A web search for "usb developers kit" found several companies that provide such kits.

> The mate session issue looks like it has been happening for years:
> https://github.com/mate-desktop/mate-session-manager/issues/42

Thanks for that link. It sounds like you are going to have to fix it yourself. :-)

Since you seem to be mainly interested in getting xterm sessions reliably restored, fixing that problem might be easier than trying to solve the problem generally, which could involve getting all possible apps to support session-saving.

There is an "xterm" component in BZ, so you could try opening a bug there and seeing what the maintainer says:
https://bugzilla.redhat.com/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&classification=Fedora&component=xterm

Comment 181 William Bader 2020-05-02 17:35:23 UTC

>Now that we know that Alan's revert is in 5.6.8, we can ask why it doesn't seem to be working reliably.

I think that it is working reliably.
I have been booting a lot of different kernels.
I didn't realize that 5.6.8 had the fix.
I wasn't looking for the webcam, but every once in a while I noticed it was there, so I thought that it was just random, but I think what really happened is that I had booted 5.6.8 instead of the stable kernel at the time, which was 5.6.7.
I'll pay more attention to see that 5.6.8 works consistently.
Last night 5.6.8 became the stable kernel, and dnfdragora installed some (but not all) of the related packages.

$ rpm -qa | grep 'kernel.*5.6.7' | sort
kernel-5.6.7-200.fc31.x86_64
kernel-core-5.6.7-200.fc31.x86_64
kernel-debug-devel-5.6.7-200.fc31.x86_64
kernel-devel-5.6.7-200.fc31.x86_64
kernel-headers-5.6.7-200.fc31.x86_64
kernel-modules-5.6.7-200.fc31.x86_64
kernel-tools-libs-5.6.7-200.fc31.x86_64
$ rpm -qa | grep 'kernel.*5.6.8' | sort
kernel-5.6.8-200.fc31.x86_64
kernel-core-5.6.8-200.fc31.x86_64
kernel-debug-devel-5.6.8-200.fc31.x86_64
kernel-devel-5.6.8-200.fc31.x86_64
kernel-modules-5.6.8-200.fc31.x86_64

>There is an "xterm" component in BZ

Thanks.

Comment 182 Steve 2020-05-02 19:22:42 UTC

There seem to be some 5.6.7 packages still in "updates":

# dnf -q repoquery 'kernel*5.6.7*.x86_64' --repo=updates
kernel-cross-headers-0:5.6.7-200.fc31.x86_64
kernel-headers-0:5.6.7-200.fc31.x86_64
kernel-tools-0:5.6.7-200.fc31.x86_64
kernel-tools-libs-0:5.6.7-200.fc31.x86_64
kernel-tools-libs-devel-0:5.6.7-200.fc31.x86_64

See if this will remove the problematic packages: 

# dnf remove kernel-core-5.6.7-200.fc31 --noautoremove

The "--noautoremove" option sometimes stops dnf from removing too many packages.

Comment 183 William Bader 2020-05-02 19:34:10 UTC

>dnf remove kernel-core-5.6.7-200.fc31 --noautoremove

Thanks, that looks like it would work, but to be safe, I am not going to remove them until everything comes in for 5.6.8.

$ sudo dnf remove kernel-core-5.6.7-200.fc31 --noautoremove
Dependencies resolved.
=================================================================================================================================================================================
 Package                                                       Architecture                   Version                                Repository                             Size
=================================================================================================================================================================================
Removing:
 kernel-core                                                   x86_64                         5.6.7-200.fc31                         @updates                               72 M
Removing dependent packages:
 kernel                                                        x86_64                         5.6.7-200.fc31                         @updates                                0  
 kernel-modules                                                x86_64                         5.6.7-200.fc31                         @updates                               28 M
 kmod-VirtualBox-5.6.7-200.fc31.x86_64                         x86_64                         6.1.6-1.fc31                           @@commandline                         783 k

Transaction Summary
=================================================================================================================================================================================
Remove  4 Packages

Freed space: 101 M
Is this ok [y/N]: n
Operation aborted.

Comment 184 Steve 2020-05-02 23:59:57 UTC

(In reply to Steve from comment #165)
... 
> Sometimes I get this error, and sometimes I don't:
> 
> ... kernel: usb 2-1: device not accepting address 2, error -71
> 
> More often I do.

I retested with 5.6.8-100.fc30.x86_64 on my laptop with the USB grub2 boot drive, and never once got that error.

Specifically, I ran several iterations of warm boots (~7) and several iterations of cold boots (~7) and checked the log each time with:

$ journalctl --no-hostname -b -p err

BTW, I used Xfce's saved-session feature to restart one instance of xfce4-terminal every time I logged in. That saved some time, because all I had to do was press the up-arrow in the shell to rerun the journalctl and reboot or poweroff commands from the shell's history.

xfce4-terminal started every time. There was only a minor issue -- the terminal window would not restart maximized, it would only restart to the size of the desktop.

Comment 185 William Bader 2020-05-03 06:00:23 UTC

The kernel people didn't think that the USB drive issue was related, but maybe a lot of devices are only tested well under Windows. https://bugzilla.kernel.org/show_bug.cgi?id=207219#c8

Back in the days of SCO Xenix 386, before vendors began shipping large numbers of PCs with Windows NT pre-installed, we had a lot of problems with motherboards and disk controller cards that didn't work dependably with a 32 bit OS. We eventually made lists of good and bad revision numbers so we knew what to replacement parts to ask for.

I tried xfce4-terminal. It is close to what I want, but it didn't like the 10x20 font that I use with xterm.

Comment 186 Steve 2020-05-03 06:59:24 UTC

(In reply to William Bader from comment #185)
> The kernel people didn't think that the USB drive issue was related, but maybe a lot of devices are only tested well under Windows.
> https://bugzilla.kernel.org/show_bug.cgi?id=207219#c8

I know that ... and that's why I ran my test. :-)

> Back in the days of SCO Xenix 386, before vendors began shipping large numbers of PCs with Windows NT pre-installed, we had a lot of problems with motherboards and disk controller cards that didn't work dependably with a 32 bit OS. We eventually made lists of good and bad revision numbers so we knew what to replacement parts to ask for.

It's completely unpredictable what USB controller you are going to get. I have USB flash drives that say brand X on the package, but lsusb says they are brand Y.

> I tried xfce4-terminal. It is close to what I want, but it didn't like the 10x20 font that I use with xterm.

What font do you use? This looks great, IMO:

$ cat .Xresources 
! .Xresources

XTerm*foreground:   green
XTerm*background:   black
XTerm*faceName:     DejaVu Sans Mono Book
XTerm*faceSize:     12

XTerm*scrollBar:        true
XTerm*rightScrollBar:   true

BTW, I installed F32 Mate in a VM, and it seems to work quite well with the following test:

Tile one mate-terminal and one xterm side-by-side on each of the four workspaces. Reboot.

All are restored in their correct positions, but they aren't quite the right sizes, so there is still a bug that needs to be fixed.

$ rpm -q mate-terminal xterm
mate-terminal-1.24.0-2.fc32.x86_64
xterm-351-2.fc32.x86_64

Comment 187 Steve 2020-05-03 19:40:59 UTC

I tested kernel 5.6.8 with every USB device that I have and found no errors like the one reported in Comment 163.

The test procedure is simple. In a full-screen terminal window, run:

$ journalctl --no-hostname -k -f

Insert and remove whatever USB devices are available, trying both USB2 and USB3 ports.

The only possibly-related anomaly I found was with an ASUS-branded Broadcom USB Bluetooth adapter:

May 03 07:49:26 kernel: usb 3-11: new full-speed USB device number 20 using xhci_hcd
May 03 07:49:42 kernel: usb 3-11: device descriptor read/64, error -110
May 03 07:49:43 kernel: usb 3-11: New USB device found, idVendor=0b05, idProduct=17cb, bcdDevice= 1.12
May 03 07:49:43 kernel: usb 3-11: New USB device strings: Mfr=1, Product=2, SerialNumber=3
May 03 07:49:43 kernel: usb 3-11: Product: BCM20702A0
May 03 07:49:43 kernel: usb 3-11: Manufacturer: Broadcom Corp
May 03 07:49:43 kernel: usb 3-11: SerialNumber: [removed]

==
BTW, the window-sizing problem on session-restore appears to be related to "marco", the MATE Desktop window manager. The relevant directories are here:

$ ls .config/marco/sessions/ .config/mate-session/saved-session/

Removing all session files is a simple way to get back to a default state.

"marco" is the name of a component in BZ:
https://bugzilla.redhat.com/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&classification=Fedora&component=marco

Comment 188 Artur Zaprzała 2020-05-03 21:26:32 UTC

I have a different webcam, but exactly the same problem. Kernel 5.6.8 fixed it. Thank you.

kernel-core-5.6.7-200.fc31.x86_64
May  2 13:44:50 desktop kernel: usb 1-6: new high-speed USB device number 3 using ehci-pci
May  2 13:44:50 desktop kernel: usb 1-6: config 247 has too many interfaces: 120, using maximum allowed: 32
May  2 13:44:50 desktop kernel: usb 1-6: config 247 descriptor has 1 excess byte, ignoring
May  2 13:44:50 desktop kernel: usb 1-6: config 247 has 0 interfaces, different from the descriptor's value: 120
May  2 13:44:50 desktop kernel: usb 1-6: New USB device found, idVendor=041e, idProduct=4087, bcdDevice=10.20
May  2 13:44:50 desktop kernel: usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
May  2 13:44:50 desktop kernel: usb 1-6: Product: VF0680Live!CamSocialize HD1080
May  2 13:44:50 desktop kernel: usb 1-6: Manufacturer: Creative Technology Ltd.
May  2 13:44:50 desktop kernel: usb 1-6: can't set config #247, error -32

kernel-core-5.6.8-200.fc31.x86_64
May  3 13:19:47 desktop kernel: usb 1-6: new high-speed USB device number 3 using ehci-pci
May  3 13:19:52 desktop kernel: usb 1-6: device descriptor read/all, error -110
May  3 13:19:52 desktop kernel: usb 1-6: new high-speed USB device number 4 using ehci-pci
May  3 13:19:53 desktop kernel: usb 1-6: New USB device found, idVendor=041e, idProduct=4087, bcdDevice=10.20
May  3 13:19:53 desktop kernel: usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
May  3 13:19:53 desktop kernel: usb 1-6: Product: VF0680 Live! Cam Socialize HD 1080
May  3 13:19:53 desktop kernel: usb 1-6: Manufacturer: Creative Technology Ltd.
May  3 13:19:53 desktop kernel: uvcvideo: Found UVC 1.00 device VF0680 Live! Cam Socialize HD 1080 (041e:4087)
May  3 13:19:53 desktop kernel: uvcvideo 1-6:1.0: Entity type for entity Extension 4 was not initialized!
May  3 13:19:53 desktop kernel: uvcvideo 1-6:1.0: Entity type for entity Extension 3 was not initialized!
May  3 13:19:53 desktop kernel: uvcvideo 1-6:1.0: Entity type for entity Processing 2 was not initialized!
May  3 13:19:53 desktop kernel: uvcvideo 1-6:1.0: Entity type for entity Camera 1 was not initialized!
May  3 13:19:53 desktop kernel: input: VF0680 Live! Cam Socialize HD 1 as /devices/pci0000:00/0000:00:12.2/usb1/1-6/1-6:1.0/input/input16
May  3 13:19:53 desktop kernel: usbcore: registered new interface driver uvcvideo
May  3 13:19:53 desktop kernel: USB Video Class driver (1.1.1)
May  3 13:19:53 desktop kernel: usb 1-6: Warning! Unlikely big volume range (=5632), cval->res is probably wrong.
May  3 13:19:53 desktop kernel: usb 1-6: [2] FU [Mic Capture Volume] ch = 2, val = -5632/0/1
May  3 13:19:53 desktop kernel: usbcore: registered new interface driver snd-usb-audio

Comment 189 Ben Cotton 2020-11-03 17:16:08 UTC

This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 190 Ben Cotton 2020-11-24 17:16:30 UTC

Fedora 31 changed to end-of-life (EOL) status on 2020-11-24. Fedora 31 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

airlied
artur.zaprzala
bskeggs
hdegoede
ichavero
itamar
jarodwilson
jeremy
jglisse
john.j5live
jonathan
josef
kernel-maint
linville
masami256
mchehab
mjg59
steved
williambader
y9t7sypezp