Bug 1670798 - Unable to start service libvirtd
Summary: Unable to start service libvirtd
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-glance
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Cyril Roelandt
QA Contact: Mike Abrams
Tana
URL:
Whiteboard:
Depends On: 1658652 1673101
Blocks: 1671406
TreeView+ depends on / blocked
 
Reported: 2019-01-30 11:45 UTC by Arie Bregman
Modified: 2019-07-12 12:41 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1671406 (view as bug list)
Environment:
Last Closed: 2019-07-12 12:41:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The differences between a successful build and latest failed build (27.81 KB, text/plain)
2019-02-19 18:06 UTC, Arie Bregman
no flags Details

Description Arie Bregman 2019-01-30 11:45:14 UTC
Description of problem:

Updating an existing image with the following command fails:

openstack overcloud image upload --update-existing

The error:

<requests.packages.urllib3.connection.HTTPConnection object at 0x7feb820cead0>: Failed to establish a new connection: [Errno 111] Connection refused


Version-Release number of selected component (if applicable): Latest OSP 9 (11-29.1)


How reproducible: 100%

Comment 2 Cyril Roelandt 2019-01-30 14:22:54 UTC
Could we see the Glance log (from glance-api.log)? This would help understanding the issue.

Comment 6 Cyril Roelandt 2019-02-05 19:17:31 UTC
Could you both please post your glance-api logs?

Comment 7 Cyril Roelandt 2019-02-06 17:09:38 UTC
So, looking at the log:

2019-01-29 09:43:09.715 7098 ERROR glance InternalError: Unknown OpenSSL error. This error is commonly encountered when another library is not cleaning up the OpenSSL error stack. If you are using cryptography with another library that uses OpenSSL try disabling it before reporting a bug. Otherwise please file an issue at https://github.com/pyca/cryptography/issues with information on how to reproduce this. ([_OpenSSLErrorWithText(code=2148712748L, lib=128, func=300, reason=300, reason_text="error:8012C12C:lib(128):dev_urandom_fd:open('/dev/urandom') failed.")])


On the machine /dev/urandom seems to work as expected:
[root@shark15 ~]# python -c "open('/dev/urandom')"
[root@shark15 ~]#

Did we upgrade openssl, or the cryptography Python package recently? When did this start to fail?

Comment 10 Cyril Roelandt 2019-02-07 17:05:18 UTC
About SYS_getrandom:
[root@shark15 ~]# grep SYS_getrandom /usr/include/sys/syscall.h 
[root@shark15 ~]# grep SYS_getrandom /usr/include/asm/unistd.h 
[root@shark15 ~]# grep SYS_getrandom /usr/include/bits/syscall.h 
# define SYS_getrandom __NR_getrandom


About /dev/urandom:
>>> import os, fcntl
>>> fd = os.open('/dev/urandom', os.O_RDONLY)
>>> os.fstat(fd)
posix.stat_result(st_mode=8630, st_ino=2057, st_dev=5L, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1542849212, st_mtime=1542849212, st_ctime=1542849212)
>>> flags = fcntl.fcntl(fd, fcntl.F_GETFD)
>>> fcntl.fcntl(fd, fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)
0



Thanks for helping with this :)

Comment 11 Christian Heimes 2019-02-13 11:46:35 UTC
The result looks good. I still wonder why cryptography isn't using SYS_getrandom on your system. Either the syscall is not availabe, blocked, or the cryptography package has been build on a system without SYS_getrandom constant.

Please run the script:
---
from cryptography.hazmat.backends.openssl.backend import backend

print("RAND_bytes")
buf = backend._ffi.new("unsigned char[]", 500)
res = backend._lib.RAND_bytes(buf, 500)
assert res == 1
assert backend._ffi.buffer(buf)[:] != "\x00" * 500
---
with "strace python testscript.py" and show me the last 20 lines or so. It should look like this:


stat("/usr/lib64/python3.7/site-packages/cryptography/hazmat/primitives/kdf/scrypt.py", {st_mode=S_IFREG|0644, st_size=2252, ...}) = 0
stat("/usr/lib64/python3.7/site-packages/cryptography/hazmat/primitives/kdf/scrypt.py", {st_mode=S_IFREG|0644, st_size=2252, ...}) = 0
openat(AT_FDCWD, "/usr/lib64/python3.7/site-packages/cryptography/hazmat/primitives/kdf/__pycache__/scrypt.cpython-37.pyc", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2077, ...}) = 0
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=2077, ...}) = 0
read(3, "B\r\r\n\0\0\0\0\200,q[\314\10\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0"..., 2078) = 2077
read(3, "", 1)                          = 0
close(3)                                = 0
getrandom("\xe1", 1, GRND_NONBLOCK)     = 1
write(1, "RAND_bytes\n", 11RAND_bytes
)            = 11
futex(0x7fb3b5ab08b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
getrandom("\xfa\xaf\xc3\x2c\x93\xa6\x2f\x64\x39\x04\x33\xd8\x63\x3e\xf6\xfd\x51\xe7\xb2\x87\x11\xbb\x1e\xc5\xad\xdc\x52\x52\xa1\x79\xf7\x76"..., 500, GRND_NONBLOCK) = 500
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fb3c330b030}, {sa_handler=0x7fb3c3507310, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fb3c330b030}, 8) = 0
sigaltstack(NULL, {ss_sp=0x5572e8962f70, ss_flags=0, ss_size=8192}) = 0
sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
exit_group(0)                           = ?
+++ exited with 0 +++

The first getrandom call is the code that detects if getrandom is available. The second is the call from the script.

Comment 12 Christian Heimes 2019-02-13 11:53:37 UTC
"dev_urandom_fd:open('/dev/urandom') failed" can also occur when the process is running out of file descriptors. Please check your ulimit and system-wide fd limits.

python-requests supports the builtin ssl module and PyOpenSSL. The library behaves differently when you have PyOpenSSL installed, see RHBZ #1567862, RHBZ #1491508, and https://pagure.io/freeipa/issue/5442 for some background.

Comment 13 Arie Bregman 2019-02-13 12:03:44 UTC
fstat(6, {st_mode=S_IFREG|0644, st_size=2733, ...}) = 0
read(6, "", 4096)                       = 0
close(6)                                = 0
munmap(0x7feff7049000, 4096)            = 0
close(5)                                = 0
brk(NULL)                               = 0x12e5000
brk(0x130c000)                          = 0x130c000
open("/dev/random", O_RDONLY)           = 5
fcntl(5, F_GETFD)                       = 0
fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
poll([{fd=5, events=POLLIN}], 1, -1)    = 1 ([{fd=5, revents=POLLIN}])
close(5)                                = 0
open("/dev/urandom", O_RDONLY)          = 5
fcntl(5, F_GETFD)                       = 0
fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
close(4)                                = 0
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feff7049000
write(1, "RAND_bytes\n", 11RAND_bytes
)            = 11
futex(0x7fefea009d78, FUTEX_WAKE_PRIVATE, 2147483647) = 0
fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
read(5, "F\222\211 \320\36\335\20\241\307\302\360b\337\373\354\261\24\341\345\27h!\r\371\322\331zJV\231\353"..., 500) = 500
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7feff68515d0}, {0x7feff6b74680, [], SA_RESTORER, 0x7feff68515d0}, 8) = 0
close(15)                               = 0
fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
exit_group(0)                           = ?
+++ exited with 0 +++


I checked and it doesn't reach the limit of file descriptors.

Comment 14 Christian Heimes 2019-02-13 12:22:04 UTC
The strace solves the mystery of "dev_urandom_fd:open('/dev/urandom')". There is no syscall to getrandom. That means python-cryptography package was compiled on a machine without SYS_getrandom constant. A simple rebuild might help.

Comment 16 Cyril Roelandt 2019-02-14 16:44:27 UTC
@Arie: do you think we could rebuild the python-cryptography package and try again?

Comment 18 Arie Bregman 2019-02-19 11:08:11 UTC
(In reply to Cyril Roelandt from comment #16)
> @Arie: do you think we could rebuild the python-cryptography package and try
> again?

I tried, nothing changed.

Also, I compared packages from a working build to latest build which fails and both are using the same package: python2-cryptography-1.7.2-2.el7.x86_64

Comment 19 Cyril Roelandt 2019-02-19 17:07:39 UTC
> I tried, nothing changed.

Does the build leave a log somewhere? Do you think we could either:
- use gdb to go step by step in the C code of python-cryptography
- patch python-cryptography and rebuild the package?

Also, on shark15, SYS_getrandom seems to be defined and working as expected:

$ cat test.c
#include <stdio.h>
#include <sys/syscall.h>

#if !defined(__linux__) || !defined(SYS_getrandom)
#error "SYS_getrandom not available"
#endif

int
main(void)
{
        long n;
        char dest[1];
        n = syscall(SYS_getrandom, dest, sizeof(dest), 0);
        if (n == sizeof(dest)) {
                printf("OK\n");
        } else {
                printf("Failure\n");
        }
        return 0;
}

$ ./test
OK



> Also, I compared packages from a working build to latest build which fails and both are using the same package: python2-cryptography-1.7.2-2.el7.x86_64

Are there any interesting differences between these two setups? Could we see a diff of the installed packages?

Comment 20 Arie Bregman 2019-02-19 18:06:48 UTC
Created attachment 1536451 [details]
The differences between a successful build and latest failed build

Comment 21 Arie Bregman 2019-02-19 18:22:59 UTC
(In reply to Cyril Roelandt from comment #19)
> > I tried, nothing changed.
> 
> Does the build leave a log somewhere? Do you think we could either:
> - use gdb to go step by step in the C code of python-cryptography
> - patch python-cryptography and rebuild the package?
> 

We can but do you believe we should? given that there is no difference in python2-cryptography version between the latest failed build and latest successful build

> Also, on shark15, SYS_getrandom seems to be defined and working as expected:
> 
> $ cat test.c
> #include <stdio.h>
> #include <sys/syscall.h>
> 
> #if !defined(__linux__) || !defined(SYS_getrandom)
> #error "SYS_getrandom not available"
> #endif
> 
> int
> main(void)
> {
>         long n;
>         char dest[1];
>         n = syscall(SYS_getrandom, dest, sizeof(dest), 0);
>         if (n == sizeof(dest)) {
>                 printf("OK\n");
>         } else {
>                 printf("Failure\n");
>         }
>         return 0;
> }
> 
> $ ./test
> OK
> 
> 
> 
> > Also, I compared packages from a working build to latest build which fails and both are using the same package: python2-cryptography-1.7.2-2.el7.x86_64
> 
> Are there any interesting differences between these two setups? Could we see
> a diff of the installed packages?

I attached the diff to the bug. No difference in crypto and glance versions.
There is a difference in openssl packages. Since currently, it's the only lead I have, I'll try to deploy with the previous working puddle and update only the ssl packages and see if it makes any difference.

Comment 22 Arie Bregman 2019-02-19 19:52:53 UTC
The major difference between the last successful build and the latest one is the jump in libvirt version from 3.9.0 to 4.5.0

This is causing to current error (please ignore the initial reported issue):

Feb 19 14:12:09 undercloud-0.redhat.local libvirtd[14071]: 2019-02-19 19:12:09.296+0000: 14071: error : virModuleLoadFile:53 : internal error: Failed to load module '/usr/lib64/libvirt/storage-backend/libvirt_storage_backend_rbd.so': /usr/lib64/libvirt/storage-backend/libvirt_storage_backend_rbd.so: undefined symbol: rbd_diff_iterate2

Comment 23 Cyril Roelandt 2019-02-19 20:06:10 UTC
> I attached the diff to the bug. No difference in crypto and glance versions.
There is a difference in openssl packages. Since currently, it's the only lead I have, I'll try to deploy with the previous working puddle and update only the ssl packages and see if it makes any difference.

I would love that.


> The major difference between the last successful build and the latest one is the jump in libvirt version from 3.9.0 to 4.5.0
> This is causing to current error (please ignore the initial reported issue):

Just to be clear, we should forget about python-cryptography, at least for now?


> undefined symbol: rbd_diff_iterate2

This looks like https://bugzilla.redhat.com/show_bug.cgi?id=1316911 . Could you try the suggested solution?

Comment 24 Arie Bregman 2019-02-19 20:18:58 UTC
(In reply to Cyril Roelandt from comment #23)
> > I attached the diff to the bug. No difference in crypto and glance versions.
> There is a difference in openssl packages. Since currently, it's the only
> lead I have, I'll try to deploy with the previous working puddle and update
> only the ssl packages and see if it makes any difference.
> 
> I would love that.
> 
> 
> > The major difference between the last successful build and the latest one is the jump in libvirt version from 3.9.0 to 4.5.0
> > This is causing to current error (please ignore the initial reported issue):
> 
> Just to be clear, we should forget about python-cryptography, at least for
> now?
>
 
Yes, I believe the initial description was misleading as it was taken from one specific build that failed with a completely different failure when compared to another 10 builds I checked today. All of which fail while trying to restart libvirtd service.

> 
> > undefined symbol: rbd_diff_iterate2
> 
> This looks like https://bugzilla.redhat.com/show_bug.cgi?id=1316911 . Could
> you try the suggested solution?

I tried it but it didn't work for me. Quite strange since the solution for both this bug and the RHEL one: https://bugzilla.redhat.com/show_bug.cgi?id=1658652 is to update librbd1 but all the packages I have on the system already updated

[stack@undercloud-0 ~]$ sudo yum update
Loaded plugins: search-disabled-repos
No packages marked for update

Comment 25 Cyril Roelandt 2019-02-20 12:47:18 UTC
I would say this is a libvirt issue now. Should we retarget this?

Comment 26 Arie Bregman 2019-02-20 13:37:27 UTC
(In reply to Cyril Roelandt from comment #25)
> I would say this is a libvirt issue now. Should we retarget this?

Yes, I'll change the details

Comment 27 Arie Bregman 2019-02-20 13:49:31 UTC
I have opened a new bug since there is too much "noise" here regarding the initially reported issue and not the actual one we are experiencing.

Not sure if we want to close this immediately or wait.

Comment 28 Cyril Roelandt 2019-02-20 14:11:38 UTC
Maybe this bug should depend on the new bug you opened. What do you think?

Comment 29 Arie Bregman 2019-02-20 14:35:49 UTC
Sounds right to me :)

Comment 31 Cyril Roelandt 2019-02-28 21:09:58 UTC
@Slawek: This is now a libvirt bug, see the "Depends On" field.

Comment 37 Cyril Roelandt 2019-05-21 17:45:12 UTC
@Paul: I think we need OP to tell us whether this is still an issue :) Otherwise we'll close this bug.

Comment 39 Cyril Roelandt 2019-06-03 11:50:33 UTC
Care to elaborate?

Regarding the depends-on:

- https://bugzilla.redhat.com/show_bug.cgi?id=1658652 has been verified, but I'm not sure whether it is in the new puddle

- https://bugzilla.redhat.com/show_bug.cgi?id=1673101 is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1658652, which has also been verified, but may or may not be in the puddle.

Also, there are errors in that jenkins link you sent, but I see no "OpenSSL error". It might be worth re-stating what the issue really is.

Please keep in mind that OSP9 is going to be unsupported soon.

Comment 41 Cyril Roelandt 2019-06-18 21:25:26 UTC
"Unable to start service libvirtd: Job for libvirtd.service failed because the control process exited with error code. See "systemctl status libvirtd.service" and "journalctl -xe" for details."

You could always run these commands and see what happens, but this is an issue with libvirtd, not with glance, I'm afraid.


Note You need to log in before you can comment on or make changes to this bug.