1365661 – Cloud images fail to compose from 20160809.n.0

Bug 1365661 - Cloud images fail to compose from 20160809.n.0

Summary: Cloud images fail to compose from 20160809.n.0

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	distribution
Sub Component:
Version:	25
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Václav Pavlín
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedBlocker
Depends On:
Blocks:	F25AlphaBlocker
TreeView+	depends on / blocked

Reported:	2016-08-09 19:07 UTC by Adam Williamson
Modified:	2016-08-23 16:03 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-23 16:03:54 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Adam Williamson 2016-08-09 19:07:54 UTC

We lost all Cloud images with the 20160809.n.0 compose. Here's the cloud base x86_64 task:

http://koji.fedoraproject.org/koji/taskinfo?taskID=15185144

This is an automatic Alpha blocker, as per https://fedoraproject.org/wiki/Fedora_Program_Management/ReleaseBlocking/Fedora25 , the cloud base images are release-blocking - https://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers :

"Bugs which entirely prevent the composition of one or more of the release-blocking images required to be built for a currently-pending (pre-)release"

Comment 1 Adam Williamson 2016-08-17 21:19:15 UTC

This still seems to be the case; as far as we can tell, every single cloud image compose attempt for f25 since 20160809.n.0 has failed. All f24 (two-week Atomic) attempts have succeeded. Rawhide failed on 20160809 and 20160810, but succeeded on 20160811 (twice) and 20160812.

We're thinking (or, to be honest, hoping) that this somehow has to do with the issue of unsigned packages in the f25 repos, and that it may work again with the currently-running compose - Fedora-25-20160817.n.1 - which should have the signatures cleaned up. So for now we're just watching that compose and hoping it'll work.

For the record, debugging this is rather hard as imagefactory does not provide us with any useful logs. Basically all the information we have is that anaconda stops at this screen:

https://kojipkgs.fedoraproject.org//work/tasks/5172/15285172/screenshot.ppm

and there's no disk activity for a few minutes. We don't get any of the anaconda logs, so we're really pretty in the dark as to what's going on.

Comment 2 Adam Williamson 2016-08-17 21:58:52 UTC

Nope. 20160817.n.1 failed.

http://koji.fedoraproject.org/koji/taskinfo?taskID=15289597

Comment 3 Adam Williamson 2016-08-18 18:16:52 UTC

So, we got a bit more detailed logging from imagefactory:

https://kojipkgs.fedoraproject.org//work/tasks/5799/15295799/oz-x86_64.log

the anaconda logs are now included, all mushed up together. If you separate 'em out with grep you can make a bit more sense of it. It seems that what happens is anaconda is waiting for udev to settle, and it's not settling. anaconda keeps calling 'udevadm settle --timeout=300', and it keeps hitting the timeout - every five minutes it returns with code 1, and anaconda runs it again. After a while, imagefactory times out due to lack of disk activity (which I guess it uses as an indication that something's gone wrong so it should give up).

It's just a thought, but this *may* be to do with the kernel. Right around the time composes started failing in both F25 and Rawhide, there was a significant kernel version bump, from kernel-4.8.0-0.rc0.git7.1.fc26 and kernel-4.8.0-0.rc0.git3.1.fc25 to kernel-4.8.0-0.rc1.git0.1.fc26 and kernel-4.8.0-0.rc1.git0.1.fc25 - i.e. it went from 'rc0' (which AIUI are early git snapshots) to 'rc1.git0'.

Then F25 was frozen, so that's *still* the kernel that's being used in F25 composes. Meanwhile Rawhide kept getting new kernel builds, including kernel-4.8.0-0.rc1.git3.1.fc26 on 20160811, which is when Rawhide builds started working again.

There was a failed attempt to build the cloud-base images for Rawhide-20160810.n.0 , and a successful attempt for Rawhide-20160811.n.1 (there was no attempt for 20160811.n.0). And looking at the logs:

https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20160810.n.0/logs/x86_64/createrepo-Cloud.x86_64.log
https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20160811.n.1/logs/x86_64/createrepo-Cloud.rpm.x86_64.log

we can see the kernels are indeed different:

20160810.n.0: kernel-4.8.0-0.rc1.git1.1.fc26.x86_64.rpm
20160811.n.1: kernel-4.8.0-0.rc1.git3.1.fc26.x86_64.rpm

so I'm kinda wondering if this would go away on F25 with a newer kernel...

Comment 4 Marian Csontos 2016-08-19 13:47:48 UTC

Any chance these are VMs with cirrus as video? I have seen this backtrace in F25 VMs:

https://repos.fedorapeople.org/thl/kernel-vanilla-fedora/fedora-25/x86_64/dmesg-4.8.0-0.rc1.git0.1.vanilla.knurd.1.fc25.x86_64.log

After switching away from cirrus to vga I have workable F25 installation.

And among other things udev was blocking waiting. This is the last message on the console:

    A start job is running for udev Wai...itialization (...)

And this was there too:

[    1.122327] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[    1.122335] IP: [<ffffffffc017f7d8>] drm_pick_crtcs+0x118/0x290 [drm_kms_helper]
[    1.122336] PGD 0 
[    1.122337] Oops: 0000 [#1] SMP
[    1.122346] Modules linked in: cirrus(+) drm_kms_helper virtio_scsi ttm virtio_net crc32c_intel drm serio_raw ata_generic pata_acpi virtio_pci virtio_ring virtio
[    1.122348] CPU: 1 PID: 269 Comm: systemd-udevd Not tainted 4.8.0-0.rc1.git0.1.vanilla.knurd.1.fc25.x86_64 #1
[    1.122348] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[    1.122349] task: ffff999936091c40 task.stack: ffff999936adc000
[    1.122353] RIP: 0010:[<ffffffffc017f7d8>]  [<ffffffffc017f7d8>] drm_pick_crtcs+0x118/0x290 [drm_kms_helper]
[    1.122354] RSP: 0018:ffff999936adf7d0  EFLAGS: 00010217
[    1.122355] RAX: ffffffffc00d5930 RBX: 0000000000000000 RCX: 0000000000001000
[    1.122355] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9999362d9c00
[    1.122356] RBP: ffff999936adf838 R08: ffff99997fd1c920 R09: ffff99997c97daf0
[    1.122356] R10: ffff99997c97dd80 R11: ffff99997c97d958 R12: ffff999936a94600
[    1.122356] R13: 0000000000001000 R14: ffff9999362d9c00 R15: ffff99997c97d698
[    1.122357] FS:  00007fbefd940640(0000) GS:ffff99997fd00000(0000) knlGS:0000000000000000
[    1.122358] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.122358] CR2: 0000000000000018 CR3: 0000000036013000 CR4: 00000000000006e0
[    1.122361] Stack:
[    1.122362]  ffff99997c97d958 0000000000000000 0000000100000003 ffff99997c97dd80
[    1.122363]  0000000000000000 ffff99997c97d958 0000000000001000 ffff99997c97daf0
[    1.122364]  0000000000000001 ffff999936a94600 ffff999936a94600 ffff99997c97d958
[    1.122365] Call Trace:
[    1.122369]  [<ffffffffc017ff9d>] drm_setup_crtcs+0x39d/0xae0 [drm_kms_helper]
[    1.122373]  [<ffffffffc01809ef>] drm_fb_helper_initial_config+0xcf/0x420 [drm_kms_helper]
[    1.122376]  [<ffffffffc00d6e70>] cirrus_fbdev_init+0x90/0xa0 [cirrus]
[    1.122380]  [<ffffffffc00d64e2>] cirrus_modeset_init+0x192/0x1e0 [cirrus]
[    1.122383]  [<ffffffffc00d730f>] ? cirrus_mm_init+0x11f/0x1a0 [cirrus]
[    1.122384]  [<ffffffffc00d5458>] cirrus_driver_load+0xb8/0x100 [cirrus]
[    1.122400]  [<ffffffffc00f0567>] drm_dev_register+0xa7/0xd0 [drm]
[    1.122406]  [<ffffffffc00f263f>] drm_get_pci_dev+0x8f/0x1d0 [drm]
[    1.122408]  [<ffffffffc00d66a5>] cirrus_pci_probe+0x95/0xb0 [cirrus]
[    1.122413]  [<ffffffffa1432e65>] local_pci_probe+0x45/0xa0
[    1.122415]  [<ffffffffa14342b3>] pci_device_probe+0x103/0x150
[    1.122422]  [<ffffffffa1528513>] driver_probe_device+0x223/0x430
[    1.122423]  [<ffffffffa15287ff>] __driver_attach+0xdf/0xf0
[    1.122424]  [<ffffffffa1528720>] ? driver_probe_device+0x430/0x430
[    1.122425]  [<ffffffffa1525e9c>] bus_for_each_dev+0x6c/0xc0
[    1.122426]  [<ffffffffa1527c1e>] driver_attach+0x1e/0x20
[    1.122427]  [<ffffffffa1527630>] bus_add_driver+0x170/0x270
[    1.122428]  [<ffffffffc0162000>] ? 0xffffffffc0162000
[    1.122429]  [<ffffffffa1529120>] driver_register+0x60/0xe0
[    1.122430]  [<ffffffffc0162000>] ? 0xffffffffc0162000
[    1.122431]  [<ffffffffa14327cc>] __pci_register_driver+0x4c/0x50
[    1.122437]  [<ffffffffc00f2860>] drm_pci_init+0xe0/0x110 [drm]
[    1.122437]  [<ffffffffc0162000>] ? 0xffffffffc0162000
[    1.122439]  [<ffffffffc016203c>] cirrus_init+0x3c/0x1000 [cirrus]
[    1.122441]  [<ffffffffa1002190>] do_one_initcall+0x50/0x180
[    1.122445]  [<ffffffffa12225d9>] ? kmem_cache_alloc_trace+0x159/0x1b0
[    1.122447]  [<ffffffffa11b30c9>] ? do_init_module+0x27/0x1ef
[    1.122447]  [<ffffffffa11b3101>] do_init_module+0x5f/0x1ef
[    1.122451]  [<ffffffffa112a977>] load_module+0x22b7/0x29d0
[    1.122452]  [<ffffffffa1127550>] ? __symbol_put+0x60/0x60
[    1.122453]  [<ffffffffa112b1f6>] SYSC_init_module+0x166/0x180
[    1.122454]  [<ffffffffa112b32e>] SyS_init_module+0xe/0x10
[    1.122455]  [<ffffffffa1003c57>] do_syscall_64+0x67/0x160
[    1.122461]  [<ffffffffa17fa3e1>] entry_SYSCALL64_slow_path+0x25/0x25
[    1.122471] Code: 85 7b 01 00 00 48 8b 00 48 39 c2 75 e3 49 8b 54 24 08 49 8b 86 98 02 00 00 4c 89 f7 4c 89 55 b0 48 8b 92 08 04 00 00 48 8b 40 10 <48> 83 7a 18 00 74 09 48 85 c0 0f 84 52 01 00 00 ff d0 4c 8b 55 
[    1.122475] RIP  [<ffffffffc017f7d8>] drm_pick_crtcs+0x118/0x290 [drm_kms_helper]
[    1.122476]  RSP <ffff999936adf7d0>
[    1.122476] CR2: 0000000000000018
[    1.122477] ---[ end trace 52e0620db783c8e6 ]---

And the `udevadm settle`'s stack is here:

[<ffffffff9c2648b4>] poll_schedule_timeout+0x44/0x70
[<ffffffff9c265f61>] do_sys_poll+0x4a1/0x550
[<ffffffff9c266101>] SyS_poll+0x71/0x130
[<ffffffff9c003c57>] do_syscall_64+0x67/0x160
[<ffffffff9c7ff721>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff

I have not tried to update kernel and switching back to cirrus yet.

Comment 5 Marian Csontos 2016-08-19 14:11:05 UTC

After updating kernel from updates-testing, and switching back to cirrus VM boots and works fine.

Comment 6 Adam Williamson 2016-08-22 17:22:30 UTC

so the kernel theory seems plausible at least, and a new kernel is currently headed to stable:

https://fedorahosted.org/rel-eng/ticket/6463#comment:8

so I'll mark this bug as ON_QA to indicate that we're hopeful the new kernel will fix it, and we'll check the first compose that happens after that stable push.

Comment 7 Adam Williamson 2016-08-23 16:03:54 UTC

Cloud image builds did succeed for the 20160822.n.1 compose. There was an issue with appliance-creator (ARM appliance) images which prevented the compose from fully completing, but I'm pretty sure this bug is now fixed.

Note You need to log in before you can comment on or make changes to this bug.