Bug 20048

Summary: XtAppPending can block uninterruptably.
Product: xorg Reporter: Stephen Turnbull <stephenjturnbull>
Component: Lib/XtAssignee: Jeremy Huddleston Sequoia <jeremyhu>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: cosmos, jeremyhu
Version: unspecified   
Hardware: x86 (IA32)   
OS: Mac OS X (All)   
Whiteboard: 2011BRB_Reviewed
i915 platform: i915 features:
Attachments:
Description Flags
Patch to handle POLLNVAL condition as an exception.
none
Patch to prevent reentering the WaitLoop if block==False. none

Description Stephen Turnbull 2009-02-11 00:11:00 UTC
Sorry about the version; according to MacPorts it's libXt-1.0.5_0.  I don't know what release of X.org that corresponds to.

Mac OS X 10.5.6 (Intel Core Duo MacBook Pro) and Mac OS X 10.4.11 + security update 2008-008 (iBook G4).  This seems to have started with recent updates to Mac OS X (10.5.5 or 10.5.6 for Leopard, and security update 2008-007 or so for Tiger).  However, I've been frequently updating X.org with MacPorts as well, so it might be recent X.org versions rather than any change in Mac OS X.

In XEmacs current head version, if I start "xemacs -vanilla" (ie, no user configuration), xemacs will initialize itself and start its toplevel read-eval-redisplay loop.  This eventually settles into a loop which calls XtAppPending, which calls _XtWaitForSomething, which calls IOWait, which is a wrapper for poll(2) in the MacPorts configuration of libXt.

This works fine (normal editing etc is possible) until I start an asynchronous process, for example with the Lisp code

(start-process "echo" nil "echo")

The process does get started AFAICT, and XEmacs reports the status return from start-process, then freezes in a tight loop in _XtWaitForSomething:

^Z
Program received signal SIGTSTP, Stopped (user).
0x929b9e0e in poll$UNIX2003 ()
(gdb) next
Single stepping until exit from function poll$UNIX2003, 
which has no line number information.
IoWait (wt=<value temporarily unavailable, due to optimizations>, wf=<value temporarily unavailable, due to optimizations>) at NextEvent.c:353
353	}
(gdb) 
_XtWaitForSomething (app=0xe111a0, ignoreEvents=0, ignoreTimers=<value temporarily unavailable, due to optimizations>, ignoreInputs=<value temporarily unavailable, due to optimizations>, ignoreSignals=1, block=0, drop_lock=0, howlong=0x0) at NextEvent.c:619
619		if (nfds == -1) {
(gdb) next
696	    if (nfds == 0) {
(gdb) 
706	    if (block && howlong != NULL)
(gdb) 
709	    if (ignoreInputs && ignoreEvents) {
(gdb) 
715		FindInputs (app, &wf, nfds,
(gdb) 
** Stepping over inlined function code. **
466	    if (!ignoreInputs) {
(gdb) 
454		for (ii = 0; ii < wf->num_dpys; ii++, fdlp++) {
(gdb) 
466	    if (!ignoreInputs) {
(gdb) 
FindInputs [inlined] () at NextEvent.c:467
467		fdlp = &wf->fdlist[wf->num_dpys];
(gdb) 
FindInputs [inlined] () at NextEvent.c:468
468		for (ii = wf->num_dpys; ii < wf->fdlistlen; ii++, fdlp++) {
(gdb) 
FindInputs [inlined] () at NextEvent.c:467
467		fdlp = &wf->fdlist[wf->num_dpys];
(gdb) 
FindInputs [inlined] () at NextEvent.c:468
468		for (ii = wf->num_dpys; ii < wf->fdlistlen; ii++, fdlp++) {
(gdb) 
FindInputs [inlined] () at NextEvent.c:470
470		    if (fdlp->revents) {
(gdb) 
FindInputs [inlined] () at NextEvent.c:468
468		for (ii = wf->num_dpys; ii < wf->fdlistlen; ii++, fdlp++) {
(gdb) 
FindInputs [inlined] () at NextEvent.c:470
470		    if (fdlp->revents) {
(gdb) 
FindInputs [inlined] () at NextEvent.c:471
471			if (fdlp->revents & (XPOLL_READ|POLLHUP|POLLERR)
(gdb) 
FindInputs [inlined] () at NextEvent.c:477
477			if (fdlp->revents & XPOLL_WRITE)
(gdb) 
482		    if (condition) {
(gdb) 
FindInputs [inlined] () at NextEvent.c:468
468		for (ii = wf->num_dpys; ii < wf->fdlistlen; ii++, fdlp++) {
(gdb) 
719	    if (dpy_no >= 0 || found_input) {
(gdb) 
588		AdjustTimes (app, block, howlong, ignoreTimers, &wt);
(gdb) 
** Stepping over inlined function code. **
590		if (block && app->block_hook_list) {
(gdb) 
608		if (app->rebuild_fdlist)
(gdb) 
612		if (drop_lock) {
(gdb) 
618		nfds = IoWait (&wt, &wf);
(gdb) 

Heh-heh, heh-heh, heh-heh Beavis, let's do that again! :-(

If I enable printing of signals in gdb, I see SIGIO being delivered over and over again, but I can't tell what those are for.  Moving the mouse or clicking or typing in the window should queue up input I suppose, but doesn't break the infloop although gdb reports it's generating SIGIOs.  nfds and wf look like this in the infloop:

(gdb) print nfds
$3 = 1
(gdb) print wf
$4 = {fdlist = 0xeb1b50, stack = 0xbffff170, fdlistlen = 3, num_dpys = 1}
gdb) print wf.fdlist[0]
$5 = {fd = 6, events = 1, revents = 0}
(gdb) print wf.fdlist[1]
$6 = {fd = 4, events = 195, revents = 0}
(gdb) print wf.fdlist[2]
$7 = {fd = 7, events = 195, revents = 32}

revents = 32 is POLLNVAL.

lsof identifies the fds in question as

xemacs  88335 steve    4   PIPE 0x5464f40     16384           
xemacs  88335 steve    6u  unix 0x5a6d198       0t0           ->0xbc15cc0
xemacs  88335 steve    7u   CHR      15,3       0t0  73199364 /dev/ptmx

The PIPE in fd=4  is used internally by XEmacs's own event loop, the unix socket fd=6 is X I assume, and the character special fd=7 is associated with the subprocess, but I don't know the details.

This configuration is just ignored by _XtWaitForSomething, and it just keeps reproducing itself (I don't understand poll, so I can't guess whether the signal is being reraised within every loop or if the state of the fd persists over the loop).  Thus we just skip all the way down to goto WaitLoop ... lather, rinse, repeat ....

XtAppPending promises not to block, so I think this is a pretty serious bug.

See also Bug #13471.
Comment 1 Stephen Turnbull 2009-02-12 07:02:10 UTC
Confirmed with XEmacs 21.4.22 (latest stable release), using the MacPorts libraries (@1.0.5_0).

On Leopard, both with the Mercurial HEAD version of XEmacs 21.5 and with XEmacs 21.4 the problem does not occur with Xquartz project libraries (version 2.3.2.1).
Comment 2 Stephen Turnbull 2009-02-12 07:19:35 UTC
Xquartz project libraries, version 2.4.0-alpha2, also work correctly for both XEmacs 21.5 HEAD and XEmacs 21.4.22.  So it's related to the MacPorts build.
Comment 3 Stephen Turnbull 2009-02-13 22:41:53 UTC
I tried rebuilding libXt with the latest version of NextEvent.c, which rebuilds the fdList every time through the WaitLoop.  This didn't help.

I tried changing the unconditional goto WaitLoop at the end of _XTWaitForSomething() to

    if (block)
        goto WaitLoop;
    else {
#ifdef USE_POLL
        XtStackFree ((XtPointer) wf.fdlist, fdlist);
#endif
        return -1;
    }

This helps in the sense that it now blocks in XtAppProcessEvent, which documents the behavior.  I still get an infloop in _XtWaitForSomething but that is presumably an XEmacs or Mac OS X bug.  I'll take that to appropriate channels, but any advice would be appreciated.

Comment 4 Stephen Turnbull 2009-02-14 01:03:45 UTC
Created attachment 22934 [details] [review]
Patch to handle POLLNVAL condition as an exception.
Comment 5 Stephen Turnbull 2009-02-14 01:05:22 UTC
Created attachment 22935 [details] [review]
Patch to prevent reentering the WaitLoop if block==False.
Comment 6 Stephen Turnbull 2009-02-14 01:07:47 UTC
In light testing, the attached patch patch-NextEvent.c-XPOLL_EXCEPT seems to win for me.  I don't really understand all ramifications, so I can't actually recommend it, and I've only tried it on Mac OS X "Leopard" 10.5.6.

On the other hand, it seems rather safe because it only triggers in the case that an app receives an fd with POLLNVAL set, which would seem to count as an exceptional condition, and is apparently very rare since nobody else is experiencing the infloop.

I also attach a patch to respect block=False in the WaitLoop (patch-NextEvent.c-BLOCK_CORRECTLY), which seem to me to be The RightThang[tm]. for XtAppPending, but again I don't understand the other uses of _XtWaitForSomething so I can't recommend it whole-heartedly.

Comment 7 Jeremy Huddleston Sequoia 2009-02-14 12:19:42 UTC
The reason this works in the xquartz.macosforge.org build is that we actually do:

--- libXt-1.0.5.orig/configure.ac	2008-02-12 19:02:35.000000000 -0800
+++ libXt-1.0.5/configure.ac	2008-02-12 19:02:41.000000000 -0800
@@ -43,8 +43,6 @@ PKG_CHECK_MODULES(XT, sm x11 xproto kbpr
 AC_CHECK_HEADER([alloca.h], AC_DEFINE(INCLUDE_ALLOCA_H, 1, [Define to 1 if Xalloca.h should include <alloca.h>]))
 
 # Map function checks to old Imake #defines
-AC_CHECK_FUNC(poll, AC_DEFINE(USE_POLL,1,
-        [Define to 1 if you have the `poll' function.]))
 AC_CHECK_FUNC(snprintf, AC_DEFINE(USE_SNPRINTF,1,
         [Define to 1 if you have the `snprintf' function.]))


I inherited that, but the previous maintainer said that emacs exhibited a bug when we used poll() in libXt, but neither he nor I had the time to trace it any further.
Comment 8 Stephen Turnbull 2009-02-14 21:11:36 UTC
OK, that doesn't help localize the bug very much, since it completely changes the behavior of the event loop with respect to IO.  

Do you have a pointer to the previous discussion offhand?  I plan to keep looking since the USE_POLL code is much more straightforward, and XtAppPending should never block, even temporarily, but I guess using select() instead of poll() is a reasonable workaround for MacPorts.  Not to mention that something is very likely wrong with the Emacsen, since somehow they're passing around dead fds.

FWIW, I'm on the emacs-devel list and haven't seen anything there; nor have others on the XEmacs lists complained about it, so it's not an urgent problem.
But I would prefer that it not be closed.
Comment 9 Jeremy Huddleston Sequoia 2009-02-18 12:42:14 UTC
Stephen, sorry I don't have a link to that discussion.  As I mentioned, it was something that I inherited and didn't dig deep into (when you inherit 200+ bugs, it's hard to find time to satisfy curiosities when there's not really something "broken").

The patches look good for me... although the *real* problem is figuring out why POLLNVAL is being returned when the fd is still valid.
Comment 10 Ben Byer 2009-02-18 23:14:32 UTC
(In reply to comment #9)
> The patches look good for me... although the *real* problem is figuring out why
> POLLNVAL is being returned when the fd is still valid.

Unfortunately, xnu (the OS X kernel) does not properly support poll() on character devices (in this case, /dev/ptmx).   This is being tracked by Apple bug number 3710161 -- if you Google for that, you'll find a couple other references to it e.g. http://lists.apple.com/archives/darwin-dev/2005/May/msg00220.html

I think that POLLNVAL means "don't poll() me plz".  The real solution for this is for autoconf to automatically detect that we are building for OS X and disable poll().
 

Comment 11 Jeremy Huddleston Sequoia 2009-02-18 23:25:25 UTC
That's curious... so then why is Stephen eventually getting valid data at some point?
Comment 12 Jeremy Huddleston Sequoia 2011-09-24 23:54:10 UTC
Jamey, didn't you say you touched something like this recently?  Is this still a problem?
Comment 13 Jamey Sharp 2011-09-25 10:11:08 UTC
This symptom doesn't appear to be related to libX11 or libxcb, but we did fix a similar-looking bug in libxcb. Here's the patch that we used there:

http://cgit.freedesktop.org/xcb/libxcb/commit/?id=f0565e8f06aadf760a9065a97b8cf5ab9cbd18de

Perhaps that will inspire a solution?
Comment 14 Jeremy Huddleston Sequoia 2011-10-03 09:55:05 UTC
*** Bug 13471 has been marked as a duplicate of this bug. ***
Comment 15 Jeremy Huddleston Sequoia 2012-01-10 00:33:54 UTC
Resolved on git master in the same way as libxcb handled it:

To ssh://git.freedesktop.org/git/xorg/lib/libXt
   0d8ef50..70bb9e2  master -> master
Comment 16 Jeremy Huddleston Sequoia 2012-03-12 18:45:39 UTC
I've reverted this change from libXt as it caused some regressions.
#47203 #47216

Reopening for further investigation, but I think this might not be necessary any more since I believe we should fall through to _XIOError() from _XEventsQueued() from XEventsQueued() from FindInputs() because xcb should be flagging the relevant connection as broken and we check xcb_connection_has_error() in _XEventsQueued.
Comment 17 Musaab 2014-07-08 06:38:22 UTC
Comment on attachment 22935 [details] [review]
Patch to prevent reentering the WaitLoop if block==False.

>--- src/NextEvent.c.orig	2006-07-12 03:05:55.000000000 +0900
>+++ src/NextEvent.c	2009-02-14 17:22:36.000000000 +0900
>@@ -722,7 +722,14 @@
> #endif
> 	return dpy_no;
>     }
>-    goto WaitLoop;
>+    if (block)
>+        goto WaitLoop;
>+    else {
>+#ifdef USE_POLL
>+	XtStackFree ((XtPointer) wf.fdlist, fdlist);
>+#endif
>+        return -1;
>+    }
> }
> 
> #define IeCallProc(ptr) \
Comment 18 Subrata 2017-08-03 12:28:16 UTC
I am also frequently getting same kind of error on a 32 bit Linux X application. When this error happened I forcefully dumped core and see the back trace. Below is the part of back trace.

#11 0xf7381864 in _XIOError () from /lib/libX11.so.6
#12 0xf737ed2f in _XEventsQueued () from /lib/libX11.so.6
#13 0xf736f770 in XEventsQueued () from /lib/libX11.so.6
#14 0xf74a6c26 in _XtWaitForSomething () from /lib/libXt.so.6
#15 0xf74a7eda in XtAppProcessEvent () from /lib/libXt.so.6

Is this error related with this bug ?? Is this issue have some fix till now??
Please help.
Comment 19 Stephen Turnbull 2017-08-15 08:48:33 UTC
Hi, Subrata

I'm the OP, not a developer, but it seems the developers aren't able to respond immediately so I'll tell you what I can.

I'd like to begin by advising you that the wait loop is quite complex, so the partial trace you sent is not enough to confirm any relationship with this bug.  Please add the whole trace you get so that the developers can more accurately diagnose the situation.

Now, in the case that I reported, there were two interacting problems.  The first was that the loop entry test didn't handle all possible conditions, only those that the developers recognized as valid.  This is a general practice in Xorg, apparently, and I was told my patches attached to the issue which validate the condition would not be accepted. The second problem was that due to an incomplete implementation on Apple's part (which didn't get fixed in the 3 years I tracked this bug), an invalid file descriptor was left in the table, and so it was possible to enter the loop because of an invalid condition.  Since nothing in the loop can fix the invalid condition, this led to an infinite loop.  I couldn't find a way to fix or workaround the invalid file descriptor, either, and an alternative configure option to Xt that used select() rather than poll() gave acceptable results so I eventually gave up on this bug.

AFAIK there have never been any bugs in Linux that would cause an invalid file descriptor to be left in the table, or a spurious report of POLLNVAL.  So if there is any relation of your issue to this bug, the most likely cause is in the program that is freezing.  That is, the application is allocating a file descriptor, it doesn't get closed but somehow becomes invalid, and thus causes this behavior.

Of course it is not impossible that a new bug has been introduced into Linux or into Xorg, but I think your best use of time would be to check on the possibility that the application has a bug in its resource cleanup as described above.

It appears that libxcb uses a patch similar in intent to my "handle POLLNVAL" patch above.  Perhaps if you switch to a libxcb-based build of Xorg you'll get better results.  (That's a wild guess since I have no idea what your Xorg configuration looks like.)

Good luck, sorry I can't be more helpful.
Comment 20 GitLab Migration User 2018-08-10 20:01:48 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/lib/libxt/issues/3.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.