Monday, November 24, 2008

Poll set to handle poll, eventports, epoll, kqueue and Windows IO Completion

This blog describes background and implementation of a poll
set to monitor many sockets in one or several receive
threads. The blog is intended as a description but also to
enable some feedback on the design.

I've been spending some time working out all the gory details
of starting and stopping lots of send threads, connection
threads and receive threads over the last few months.

The receive threads will be monitoring a number of socket
connections and as soon as data is available ensure that
the data is received and forwarded to the proper user
thread for execution.

However listening on many sockets is a problem which needs
a scalable solution. Almost every operating system on the
planet has some solution to this problem. The problem is
that they all have different solutions. So I decided to
make a simple interface to those socket monitoring solutions.

First my requirements. I will handle a great number of socket
connections from each thread, I will have the ability to
move socket connections from receive thread to another thread
to dynamically adapt to the usage scenarios. However mostly
the receive thread will wait for events on a set of socket
connections, handle them as they arrive and go back waiting
for more events. On the operating system side I aim at
supporting Linux, OpenSolaris, Mac OS X, FreeBSD and Windows.

One receive thread might be required to listen to socket
connections from many clusters. So there is no real limit
to the number of sockets a receive thread can handle. I
decided however to put a compile time limit in there since
e.g. epoll requires this at create time. This is currently
set to 1024. So if more sockets are needed another receive
thread is needed even if not needed from a performance
point of view.

The implementation aims to cover 5 different implementations.

epoll
-----
epoll interface is a Linux-only interface which uses
epoll_create to create an epoll file descriptor, then
epoll_ctl is used to add/drop file descriptors to the epoll
set. Finally epoll_wait is used to wait on the events to
arrive. Socket connections remain in the epoll set as
long as they are not explicitly removed or closed.

poll
----
Poll is the standard which is there simply to make sure it
works also on older platforms that have none of the other
mechanisms supported. Here there is only one system call,
the poll-call and all the state of the poll set needs to
be taken care of by this implementation.

kqueue
------
kqueue achieves more or less the same thing as epoll, it does
so however with a more complex interface that can support a
lot more things such as polling for completed processes and
i-nodes and so forth. It has a kqueue-method to create the
kqueue file descriptor and then a kevent call which is used
both to add, drop and listen to events on the kqueue socket.
kqueue exists in BSD OS:s such as FreeBSD and Mac OS X.

eventports
----------
eventports is again a very similar implementation to epoll which
has the calls port_create to create the eventport file descriptor.
It has a port_associate call to add a socket to the eventport set.
It has a port_dissociate call to drop a socket from the set.
It has a port_getn call to wait on the events arriving. There is
however a major difference in that after an event arriving in a
port_getn call the socket is removed from the set and has to be
added back. From an implementation point of view this mainly
complicated my design of error handling.

Windows IO Completion
---------------------
I have only skimmed this yet and it differs mainly in being a tad
complex and also in that events from the set can be distributed to
more than one thread. However this feature will not be used in this
design, also I have currently not implemented this yet, I need to
get all the bits together on building on Windows done first.

Implementation
--------------
The implementation is done in C but I still wanted to have
a clear object-oriented interface. To achieve this I
created two header files ic_poll_set.h which declares all
the public parts and ic_poll_set_int.h which defines the
private and the public data structures used. This means that
the internals of the IC_POLL_SET-object is hidden from the
user of this interface.

Here is the public part of the interface (the code is GPL:ed
but it isn't released yet):

Copyright (C) 2008 iClaustron AB, All rights reserved
struct ic_poll_connection
{
int fd;
guint32 index;
void *user_obj;
int ret_code;
};
typedef struct ic_poll_connection IC_POLL_CONNECTION;

struct ic_poll_set;
typedef struct ic_poll_set IC_POLL_SET;
struct ic_poll_operations
{
/*
The poll set implementation isn't multi-thread safe. It's intended to be
used within one thread, the intention is that one can have several
poll sets, but only one per thread. Thus no mutexes are needed to
protect the poll set.

ic_poll_set_add_connection is used to add a socket connection to the
poll set, it requires only the file descriptor and a user object of
any kind. The poll set implementation will ensure that this file
descriptor is checked together with the other file descriptors in
the poll set independent of the implementation in the underlying OS.

ic_poll_set_remove_connection is used to remove the file descriptor
from the poll set.

ic_check_poll_set is the method that goes to check which socket
connections are ready to receive.

ic_get_next_connection is used in a loop where it is called until it
returns NULL after a ic_check_poll_set call, the output from
ic_get_next_connection is prepared already at the time of the
ic_check_poll_set call. ic_get_next_connection will return a
IC_POLL_CONNECTION object. It is possible that ic_check_poll_set
can return without error whereas the IC_POLL_CONNECTION can still
have an error in the ret_code in the object. So it is important to
both check this return code as well as the return code from the
call to ic_check_poll_set (this is due to the implementation using
eventports on Solaris).

ic_free_poll_set is used to free the poll set, it will also if the
implementation so requires close any file descriptor of the poll
set.

ic_is_poll_set_full can be used to check if there is room for more
socket connections in the poll set. The poll set has a limited size
(currently set to 1024) set by a compile time parameter.
*/
int (*ic_poll_set_add_connection) (IC_POLL_SET *poll_set,
int fd,
void *user_obj);
int (*ic_poll_set_remove_connection) (IC_POLL_SET *poll_set,
int fd);
int (*ic_check_poll_set) (IC_POLL_SET *poll_set,
int ms_time);
const IC_POLL_CONNECTION*
(*ic_get_next_connection) (IC_POLL_SET *poll_set);
void (*ic_free_poll_set) (IC_POLL_SET *poll_set);
gboolean (*ic_is_poll_set_full) (IC_POLL_SET *poll_set);
};
typedef struct ic_poll_operations IC_POLL_OPERATIONS;

/* Creates a new poll set */
IC_POLL_SET* ic_create_poll_set();

struct ic_poll_set
{
IC_POLL_OPERATIONS poll_ops;
};

16 comments:

jim said...

why not use libevent?

Olaf van der Spek said...

Why don't you use a library instead of writing your own wrapper?

Unknown said...

Hi! Have you looked at libevent?

http://monkey.org/~provos/libevent/

It's under a BSD license and includes all the various enhanced polling methods for each OS. Might be one less thing to debug and maintain. You could then just have one libevent instance per listening thread and use queues to pass connections around to worker threads (what I've done in the past).

Anonymous said...

On top of the previous replies - we actually include libevent in 6.0 for the new thread pooling functionality..

Mikael Ronstrom said...

Yes, I had looked at libevent,
I did again based on the
comments and noted that it's
likely that I will introduce
more bugs by using it than
using my own interface since
my needs are really simple
and the libevent header file
only for describing the
interface is 2x bigger than my
implementation. Also it still
lacks implementation of
eventports (has /dev/poll instead)
and lacks Windows IO completion
(I don't currently have it either).
Also the risk of libevent containing
bugs in itself isn't totally unlikely.

So I'll continue my current path
mostly based on that my requirements
are simpler but use the libevent
code to recheck my implementation.

Thx for the comments, still interested
though in comments about the interface
as well :)

Mikael Ronstrom said...

Oops, my comment on eventports not
being in libevent was obviously
wrong I noted.

Olaf van der Spek said...

> since my needs are really simple
and the libevent header file
only for describing the
interface is 2x bigger than my
implementation.

The point is?
Code duplication should be avoided.

> Also it still
lacks implementation of
eventports (has /dev/poll instead)
and lacks Windows IO completion
(I don't currently have it either).

If you're using C++, maybe Boost Asio is also an option.

> Also the risk of libevent containing bugs in itself isn't totally unlikely.

And your code is guaranteed to be bug free?

Mikael Ronstrom said...

Was a bit curious how libevent handled
the error case in the call to
port_associate after an event occured.
It didn't, it ignored the return code
from reassociate.

Mikael Ronstrom said...

Code duplication should be avoided
I completely agree on. It's kind of
interesting to have this discussion
since I'm usually at the other end
of the fence.

My code is obviously not certain
to be bug free. My point was simply
that integrating a library means that
you have to fully understand the
library and its code.

I have used glib in my project as a
library to avoid most of the hassles
with various things. I then tried to
use it for some hash table and
discovered that it crashes on memory
allocation failures, not what I wanted.

So my point is twosome, that integrating
a library means a lot of work which can
cause bugs to occur. Also a library
might be programmed with a different
tolerance for errors than you're
prepared to live with.

So in this case I find the best method
to be to keep my implementation (also
because I like to avoid callbacks
as much as possible since they have a
tendency to make the code less readable). However at the same time
reading the code in libevent carefully
to ensure that I take of all error
cases properly.

I might find on the way that this means
that my implementation will be more or
less libevent but then I have no
problem in switching. Writing this
library myself means also that I am
in a much better position to take
full advantage and understand how to
use libevent and even change it
according to my needs if that is
needed.

Mikael Ronstrom said...

Code duplication should be avoided
I completely agree on. It's kind of
interesting to have this discussion
since I'm usually at the other end
of the fence.

My code is obviously not certain
to be bug free. My point was simply
that integrating a library means that
you have to fully understand the
library and its code.

I have used glib in my project as a
library to avoid most of the hassles
with various things. I then tried to
use it for some hash table and
discovered that it crashes on memory
allocation failures, not what I wanted.

So my point is twosome, that integrating
a library means a lot of work which can
cause bugs to occur. Also a library
might be programmed with a different
tolerance for errors than you're
prepared to live with.

So in this case I find the best method
to be to keep my implementation (also
because I like to avoid callbacks
as much as possible since they have a
tendency to make the code less readable). However at the same time
reading the code in libevent carefully
to ensure that I take of all error
cases properly.

I might find on the way that this means
that my implementation will be more or
less libevent but then I have no
problem in switching. Writing this
library myself means also that I am
in a much better position to take
full advantage and understand how to
use libevent and even change it
according to my needs if that is
needed.

Mikael Ronstrom said...

Concluded that my implementation is
essentially a simplified version of the
epoll.c, epoll_sub.c, evport.c, kqueue.c
and poll.c file in the libevent.

Olaf van der Spek said...

I don't think it's necessary to understand the implementation of a library. Understanding the interface should be enough.

I assume you've filed a bug report / feature request about the glib hash tables?

Not integrating a library but writing it yourself probably causes more bugs. But that certainly depends on the quality of the library. ;)

Anonymous said...

Since you are using Glib already, why not use Glib's IO channels?

Mark Callaghan said...

Mikael,

This isn't an issue for NDB yet as there is little development for community NDB, but this is a point over which community and official MySQL will diverge. A lot of the library code used throughout MySQL (check out mysys/*) has no documentation, not even for the interfaces. It is much easier for us to replace such code with open-source libraries.

Anonymous said...

I really appreciate the effort you've put into your blog, and the commitment you've had to writing in it. By the way; I've used the 6.4 MySQL Cluster for a few weeks and find that the user interface is simple and really easy to use.
Thumbs up!

Kalle said...

Ok, so now when you have concluded that your implementation is basically a simplified version, does that mean you will switch?

If you switch, before you switch, let me give you another suggestion.

Why not use libev, even when it uses libevent emulation it is still faster than libevent. Google knows where it exists.

Cheers!