Bug 99839

Summary: Valgrind warnings when running test-bus
Product: dbus Reporter: Philip Withnall <bugzilla>
Component: coreAssignee: D-Bus Maintainers <dbus>
Status: RESOLVED FIXED QA Contact: D-Bus Maintainers <dbus>
Severity: minor    
Priority: medium CC: bugzilla
Version: git masterKeywords: patch
Hardware: All   
OS: All   
Whiteboard: review+
i915 platform: i915 features:
Attachments: dbus: Fix writing off the end of an fd_set when testing with Valgrind
dbus: Fix writing off the end of an fd_set when testing with Valgrind

Description Philip Withnall 2017-02-16 15:07:31 UTC
On git master (1c390c0 with the patch from bug #99793 applied), if I run:

# Stuff from AM_TEST_ENVIRONMENT:
export XDG_RUNTIME_DIR=/opt/gnome/build/dbus/test/XDG_RUNTIME_DIR
export DBUS_FATAL_WARNINGS=1
export DBUS_TEST_DAEMON=/opt/gnome/build/dbus/bus/dbus-daemon
export DBUS_TEST_DBUS_LAUNCH=/opt/gnome/build/dbus/tools/dbus-launch
export DBUS_TEST_DBUS_MONITOR=/opt/gnome/build/dbus/tools/dbus-monitor
export DBUS_TEST_DBUS_SEND=/opt/gnome/build/dbus/tools/dbus-send
export DBUS_TEST_DATA=/opt/gnome/build/dbus/test/data
export DBUS_TEST_HOMEDIR=/opt/gnome/build/dbus/dbus
libtool exec valgrind --tool=memcheck --leak-check=full --leak-resolution=high $builddir/dbus/bus/test-bus

I get:

==28587== Command: /opt/gnome/build/dbus/bus/.libs/lt-test-bus
==28587== 
==28587== Warning: invalid file descriptor 1024 in syscall close()
==28587== Warning: invalid file descriptor 1025 in syscall close()
==28587== Warning: invalid file descriptor 1026 in syscall close()
==28587== Warning: invalid file descriptor 1027 in syscall close()
==28587==    Use --log-fd=<number> to select an alternative log fd.
==28587== Warning: invalid file descriptor 1028 in syscall close()
==28587== Warning: invalid file descriptor 1029 in syscall close()
==28587== Warning: invalid file descriptor 1030 in syscall close()
==28587== Invalid read of size 8
==28587==    at 0x47097B: _dbus_check_fdleaks_enter (dbus-message-util.c:185)
==28587==    by 0x46C4F8: test_pre_hook (test-main.c:76)
==28587==    by 0x46C786: main (test-main.c:129)
==28587==  Address 0x74b0800 is 0 bytes after a block of size 128 alloc'd
==28587==    at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==28587==    by 0x47076A: _dbus_check_fdleaks_enter (dbus-message-util.c:155)
==28587==    by 0x46C4F8: test_pre_hook (test-main.c:76)
==28587==    by 0x46C786: main (test-main.c:129)
…
/opt/gnome/build/dbus/bus/.libs/lt-test-bus: Running expire list test
/opt/gnome/build/dbus/bus/.libs/lt-test-bus: checking for memleaks
==28587== Invalid read of size 8
==28587==    at 0x470C12: _dbus_check_fdleaks_leave (dbus-message-util.c:230)
==28587==    by 0x46C5AA: test_post_hook (test-main.c:88)
==28587==    by 0x46C806: main (test-main.c:133)
==28587==  Address 0x74b0800 is 0 bytes after a block of size 128 alloc'd
==28587==    at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==28587==    by 0x47076A: _dbus_check_fdleaks_enter (dbus-message-util.c:155)
==28587==    by 0x46C4F8: test_pre_hook (test-main.c:76)
==28587==    by 0x46C786: main (test-main.c:129)
…
/opt/gnome/build/dbus/bus/.libs/lt-test-bus: Running SHA1 connection test
==28587== Syscall param epoll_ctl(event) points to uninitialised byte(s)
==28587==    at 0x58F751A: epoll_ctl (syscall-template.S:84)
==28587==    by 0x48E70B: socket_set_epoll_add (dbus-socket-set-epoll.c:160)
==28587==    by 0x46D93C: _dbus_socket_set_add (dbus-socket-set.h:79)
==28587==    by 0x46E85A: _dbus_loop_add_watch (dbus-mainloop.c:296)
==28587==    by 0x418C2A: add_server_watch (bus.c:116)
==28587==    by 0x4ED9A8B: _dbus_watch_list_set_functions (dbus-watch.c:321)
==28587==    by 0x4EC1E0B: dbus_server_set_watch_functions (dbus-server.c:953)
==28587==    by 0x4191D4: setup_server (bus.c:242)
==28587==    by 0x419FC9: process_config_first_time_only (bus.c:501)
==28587==    by 0x41B315: bus_context_new (bus.c:839)
==28587==    by 0x46B0AC: bus_context_new_test (test.c:295)
==28587==    by 0x448322: bus_dispatch_sha1_test (dispatch.c:5057)
==28587==  Address 0xffeffed88 is on thread 1's stack
==28587==  in frame #1, created by socket_set_epoll_add (dbus-socket-set-epoll.c:142)
==28587== 
==28587== Syscall param epoll_ctl(event) points to uninitialised byte(s)
==28587==    at 0x58F751A: epoll_ctl (syscall-template.S:84)
==28587==    by 0x48E8EE: socket_set_epoll_enable (dbus-socket-set-epoll.c:202)
==28587==    by 0x46D9E1: _dbus_socket_set_enable (dbus-socket-set.h:94)
==28587==    by 0x46E600: refresh_watches_for_fd (dbus-mainloop.c:266)
==28587==    by 0x46E8DA: _dbus_loop_add_watch (dbus-mainloop.c:307)
==28587==    by 0x42D1D9: add_connection_watch (connection.c:357)
==28587==    by 0x4ED9A8B: _dbus_watch_list_set_functions (dbus-watch.c:321)
==28587==    by 0x4E81BC7: dbus_connection_set_watch_functions (dbus-connection.c:4963)
==28587==    by 0x42EA55: bus_connections_setup_connection (connection.c:788)
==28587==    by 0x418E6A: new_connection_callback (bus.c:174)
==28587==    by 0x4EC3E01: _dbus_transport_debug_pipe_new (dbus-server-debug-pipe.c:312)
==28587==    by 0x4EC416F: _dbus_transport_open_debug_pipe (dbus-server-debug-pipe.c:408)
==28587==  Address 0xffeffed68 is on thread 1's stack
==28587==  in frame #1, created by socket_set_epoll_enable (dbus-socket-set-epoll.c:194)
==28587== 
==28587== Syscall param epoll_ctl(event) points to uninitialised byte(s)
==28587==    at 0x58F751A: epoll_ctl (syscall-template.S:84)
==28587==    by 0x48EABE: socket_set_epoll_disable (dbus-socket-set-epoll.c:257)
==28587==    by 0x46DA30: _dbus_socket_set_disable (dbus-socket-set.h:101)
==28587==    by 0x46E63B: refresh_watches_for_fd (dbus-mainloop.c:268)
==28587==    by 0x46E96D: _dbus_loop_toggle_watch (dbus-mainloop.c:319)
==28587==    by 0x42D2BA: toggle_connection_watch (connection.c:375)
==28587==    by 0x4EDA111: _dbus_watch_list_toggle_watch (dbus-watch.c:460)
==28587==    by 0x4E72F3E: protected_change_watch (dbus-connection.c:722)
==28587==    by 0x4E730D1: _dbus_connection_toggle_watch_unlocked (dbus-connection.c:787)
==28587==    by 0x4ED4CF5: check_read_watch (dbus-transport-socket.c:228)
==28587==    by 0x4ED5C0A: do_authentication (dbus-transport-socket.c:488)
==28587==    by 0x4ED76AE: socket_handle_watch (dbus-transport-socket.c:974)
==28587==  Address 0xffeffebb8 is on thread 1's stack
==28587==  in frame #1, created by socket_set_epoll_disable (dbus-socket-set-epoll.c:233)

These errors are repeated a few times.

---

smcv suggested the high FDs are Valgrind magic:

% valgrind dash
$ ls /proc/$$/fd
0  1  10  1024	1025  1026  1027  1028	1029  1030  2

So we can probably just add code to ignore them. We should check FD_SETSIZE before using FD_SET in any case.

IIRC the epoll problems are false positives, but we might as well zero-initialise the epoll event struct anyway.

---

Other things to do (this serves as a reminder for me for when I get back to this bug):
 - Try the tests under other Valgrind tools (helgrind, drd, etc.)
 - Try other tests under memcheck
Comment 1 Simon McVittie 2017-02-16 15:08:54 UTC
(In reply to Philip Withnall from comment #0)
> IIRC the epoll problems are false positives, but we might as well
> zero-initialise the epoll event struct anyway.

I think that's the one remaining patch on Bug #88808. I should finish that.
Comment 2 Simon McVittie 2017-02-16 15:15:15 UTC
(In reply to Philip Withnall from comment #0)
> ==28587== Warning: invalid file descriptor 1024 in syscall close()

That's the fix for Bug #83899 (see also Bug #73689) being confused by the cognitive dissonance between valgrind (which pretends we only have low fds) and the kernel as accessed directly via /proc (which knows we also have Valgrind's internal fds).

We could just #include <dbus/dbus-valgrind-internal.h>, and ignore fds > 1024 in _dbus_close_all() when RUNNING_ON_VALGRIND is nonzero.

> ==28587== Invalid read of size 8
> ==28587==    at 0x47097B: _dbus_check_fdleaks_enter (dbus-message-util.c:185)
> ==28587==    by 0x46C4F8: test_pre_hook (test-main.c:76)
> ==28587==    by 0x46C786: main (test-main.c:129)
> ==28587==  Address 0x74b0800 is 0 bytes after a block of size 128 alloc'd

Similarly, this is Valgrind's special high fds causing us to overrun the fd_set. We could just ignore fds above the bounds of a fd_set - we never deliberately allocate non-contiguous fds anyway.
Comment 3 Philip Withnall 2017-02-22 10:59:09 UTC
Created attachment 129822 [details] [review]
dbus: Fix writing off the end of an fd_set when testing with Valgrind

If the test-bus test is run under Valgrind, its code to detect FD leaks
accidentally writes off the end of the fd_set it uses, as Valgrind opens
some high FDs (≥1024) for internal use.

Detect if it’s running under Valgrind and ignore those FDs; and
otherwise assert that the FDs are within the size of an fd_set.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 4 Philip Withnall 2017-02-22 11:01:12 UTC
(In reply to Philip Withnall from comment #3)
> Created attachment 129822 [details] [review] [review]
> dbus: Fix writing off the end of an fd_set when testing with Valgrind
> 
> If the test-bus test is run under Valgrind, its code to detect FD leaks
> accidentally writes off the end of the fd_set it uses, as Valgrind opens
> some high FDs (≥1024) for internal use.
> 
> Detect if it’s running under Valgrind and ignore those FDs; and
> otherwise assert that the FDs are within the size of an fd_set.
> 
> Signed-off-by: Philip Withnall <withnall@endlessm.com>

This fixes the Valgrind FD problems. For the epoll_event problems, I’ve attached a patch in bug #88808.

I briefly ran a few other tests (test-corrupt, for example) in Valgrind and found some other leaks, which look like they’re in the tests themselves. I’ll try and find some time to actually investigate and fix them at some point.
Comment 5 Simon McVittie 2017-02-22 15:14:04 UTC
Comment on attachment 129822 [details] [review]
dbus: Fix writing off the end of an fd_set when testing with Valgrind

Review of attachment 129822 [details] [review]:
-----------------------------------------------------------------

::: dbus/dbus-message-util.c
@@ +183,4 @@
>            if (fd == dirfd (d))
>              continue;
>  
> +#ifdef WITH_VALGRIND

There shouldn't be any need for this #ifdef, dbus-valgrind-internal.h unconditionally defines the various macros (for instance RUNNING_ON_VALGRIND expands to 0).

I think I'd prefer "fd >= 1024" or whatever specific offset valgrind starts at.

If you do need the #ifdef for whatever reason, shouldn't the assertion be outside it?
Comment 6 Philip Withnall 2017-02-22 16:08:38 UTC
Comment on attachment 129822 [details] [review]
dbus: Fix writing off the end of an fd_set when testing with Valgrind

Review of attachment 129822 [details] [review]:
-----------------------------------------------------------------

::: dbus/dbus-message-util.c
@@ +183,4 @@
>            if (fd == dirfd (d))
>              continue;
>  
> +#ifdef WITH_VALGRIND

The #ifdef is needed for the (probably quite common) case where dbus-daemon is compiled without --with-valgrind. --with-valgrind=no is the default. In that case, RUNNING_ON_VALGRIND is 0 even when running under Valgrind, and the assertion causes the test to fail.

My thinking was that:
 • The ultimate aim here is to add `_dbus_assert (fd < FD_SETSIZE)`, since otherwise we write off the end of the fd_set.
 • That works whenever we’re not running under Valgrind. Under Valgrind, we have high-numbered FDs and it aborts with a false positive.
 • RUNNING_ON_VALGRIND is not 100% reliable at telling us we’re under Valgrind, due to needing to be compiled with --with-valgrind.
 => The assertion has to only be enabled when we have WITH_VALGRIND.
Comment 7 Philip Withnall 2017-02-22 16:23:42 UTC
Comment on attachment 129822 [details] [review]
dbus: Fix writing off the end of an fd_set when testing with Valgrind

Review of attachment 129822 [details] [review]:
-----------------------------------------------------------------

::: dbus/dbus-message-util.c
@@ +183,4 @@
>            if (fd == dirfd (d))
>              continue;
>  
> +#ifdef WITH_VALGRIND

> I think I'd prefer "fd >= 1024" or whatever specific offset valgrind starts at.

Fair point. I’ll have to check what they use — there is a chance it’s FD_SETSIZE.

I can’t immediately find anything relevant in the Valgrind code base or its installed headers. :-\
Comment 8 Simon McVittie 2017-02-24 18:41:51 UTC
(In reply to Philip Withnall from comment #6)
> My thinking was that:
>  • The ultimate aim here is to add `_dbus_assert (fd < FD_SETSIZE)`, since
> otherwise we write off the end of the fd_set.
>  • That works whenever we’re not running under Valgrind. Under Valgrind, we
> have high-numbered FDs and it aborts with a false positive.

Well, if we write off the end of the fd_set, that's always a bug regardless.

Since this is just best-effort stuff for detecting whether we leaked any fds, how about this?

* If fd < FD_SETSIZE, we track it

* If fd >= FD_SET_SIZE, we emit
  _dbus_verbose ("fd %d unexpectedly large, cannot track whether it is leaked\n", fd);
  or something, and do not track it

This can all be unconditional; it doesn't need to test RUNNING_ON_VALGRIND.

(Rationale: we know we don't actually allocate thousands of fds, if only because on many OSs and configurations we'd run out of fd quota if we tried; we don't go to any specific effort to open fds of large magnitude; and POSIX guarantees that things like open() and socket() will always allocate the lowest available fd. So if we leak anything, realistically it's going to be < 100.)
Comment 9 Simon McVittie 2017-02-24 19:46:41 UTC
(In reply to Philip Withnall from comment #0)
> ==28587== Warning: invalid file descriptor 1024 in syscall close()
> ==28587== Warning: invalid file descriptor 1025 in syscall close()
> ==28587== Warning: invalid file descriptor 1026 in syscall close()
> ==28587== Warning: invalid file descriptor 1027 in syscall close()
> ==28587==    Use --log-fd=<number> to select an alternative log fd.
> ==28587== Warning: invalid file descriptor 1028 in syscall close()
> ==28587== Warning: invalid file descriptor 1029 in syscall close()
> ==28587== Warning: invalid file descriptor 1030 in syscall close()

If you want to address these warnings, *this* is where we'd need to use RUNNING_ON_VALGRIND, I think.

The log message led me via https://codesearch.debian.net/search?q=package%3Avalgrind+invalid+file+descriptor&page=1 to https://codesearch.debian.net/search?q=package%3Avalgrind+soft_limit and from there to https://sources.debian.net/src/valgrind/1:3.12.0-1.1/coregrind/m_main.c/?hl=1515#L1515 which indicates that valgrind adjusts the soft fd limit up a bit, then puts its own reserved fds just beyond that. But it seems like maybe overkill to have a block that gets the rlimit specifically so that we can skip closing fds beyond it...
Comment 10 Philip Withnall 2017-03-02 10:18:48 UTC
Created attachment 130024 [details] [review]
dbus: Fix writing off the end of an fd_set when testing with Valgrind

If the test-bus test is run under Valgrind, its code to detect FD leaks
accidentally writes off the end of the fd_set it uses, as Valgrind opens
some high FDs (≥1024) for internal use.

Ignore those FDs. Realistically, they are never going to be leaks — in
order to have a false negative from omitting this check, D-Bus would
have to allocate and not leak all the FDs up to FD_SETSIZE, and then
leak the first FD over that which it allocated. D-Bus never allocates
anywhere near that number of FDs concurrently.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Comment 11 Simon McVittie 2017-04-07 11:55:38 UTC
Fixed in git for 1.11.12, thanks
Comment 12 Simon McVittie 2017-04-07 11:56:54 UTC
(In reply to Simon McVittie from comment #9)
> (In reply to Philip Withnall from comment #0)
> > ==28587== Warning: invalid file descriptor 1024 in syscall close()
> > ==28587== Warning: invalid file descriptor 1025 in syscall close()
> > ==28587== Warning: invalid file descriptor 1026 in syscall close()
> > ==28587== Warning: invalid file descriptor 1027 in syscall close()
> > ==28587==    Use --log-fd=<number> to select an alternative log fd.
> > ==28587== Warning: invalid file descriptor 1028 in syscall close()
> > ==28587== Warning: invalid file descriptor 1029 in syscall close()
> > ==28587== Warning: invalid file descriptor 1030 in syscall close()
> 
> If you want to address these warnings, *this* is where we'd need to use
> RUNNING_ON_VALGRIND, I think.

I'm assuming you don't want to bother with that. If you do, please reopen, or open another bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.