Monday, November 14, 2011

Abstraction in C++ using I/O Functors

In Abstraction in C Using Source and Sinks, I wrote about how useful I've found it to abstract out I/O interfaces so that I could write software that didn't have to know from where its input was coming or to where its output was going: sockets, standard I/O library FILE pointers, memory buffers, the system log, etc. That C-based project, Concha, was a clean room reimplementation of work I had done for a client back in 2007. What I didn't mention was that it was in turn inspired by work I had done in 2006 for a C++-based project, Desperado. Now that I'm building yet another project, Hayloft, on top of Desperado, I'm reminded how much I like the Desperado I/O abstraction.

The Desperado I/O abstraction defines two interfaces, Input and Output. These interfaces make use of the ability in C++ to overload the parentheses operators to create functors (a.k.a. function objects): objects that can be manipulated through a function call interface. Looking at code, you will think you are looking at function calls. What you are really seeing are instance method calls against an object that overrides the parentheses operators.

The Input interface defines the following four operations.

int operator() ();

This returns the next character as an integer, or EOF (end of file) to indicate that no more input is available.

int operator() (int ch);

This pushes a single character back into the input stream. (This vastly simplifies the implementation of many common parsing algorithms that require one-token look ahead.) The interface guarantees that at least one character can be pushed back. It need not be the most recent character read. How the push back is actually done is up to the underlying implementation; buffering it inside the object in the derived class implementation is acceptable. If the push back is successful, the character is returned, otherwise EOF is returned.

ssize_t operator() (char * buffer, size_t size);

This returns a line terminated by a newline character (/n) or EOF. The optional size parameter allows the caller to place a limit on the number of characters returned. The result is guaranteed to be NUL terminated as long as the buffer is at least one byte in length. The actual number of characters input is returned, or EOF if none.

ssize_t operator() (void * buffer, size_t minimum, size_t maximum);

This is typically used for unformatted data. The minimum parameter indicates the minimum number of characters to be returned. The implementation blocks until that many characters are available or EOF is reached. The maximum parameter indicates the maximum number of characters to be returned if it can be done without blocking. Specifying a minimum of zero is a common way to implement non-blocking I/O using polling. Specifying the same value for minimum and maximum simply blocks for a fixed amount of data. Using the value one for minimum results in something similar to the POSIX read system call. The actual number of characters input is returned, or EOF if none.

That's it. No open or close: those are the job of either the caller, or of the implementation's constructor and destructor. The Input base class isn't pure: it actually implements all of these operators, returning EOF for all operations. That makes the base class the equivalent of /dev/null.

Since all implementations derive from the Input base class, you can pass a pointer or reference to any implementation to a function expecting an Input pointer or reference and it will read its data from that object without having any idea what the actual underlying data source is. Desperado implements a variety of derived classes, such as DescriptorInput (its constructor takes a file descriptor as an argument), FileInput (a FILE pointer), BufferInput (a read/write memory buffer), DataInput (a read-only memory buffer), and PathInput (a path name in the file system). All of these derived classes implement the full Input interface.

The Hayloft project leverages this in its Parameter class. Parameter takes an Input reference or pointer, uses the line input functor, and reads a parameter value into a C++ std::string that is a instance variable named parameter. It has no idea where this value is coming from: a file, a socket, a memory location, what have you. Here's the complete implementation of the method in Parameter that does this.

(I apologize in advance for any violence done to this and other code snippets by the Blogger formatter - which truly sucks at code examples even when editing raw HTML - and for any typos I make when transcribing this from actual working source code.)

void Parameter::source(Input & input, size_t maximum) {
int ch;
while (maximum > 0) {
ch = input();
if ((ch == EOF) || (ch == '\0') || (ch == '\n')) { break; }
parameter += ch;
--maximum;
}
}

You can see the Input functor called on the fourth line: the input object reference is used just as if it were a function.

The Output interface defines the following five operations.

int operator() (int c);

A single character in an integer is emitted. The character is returned or EOF if unsuccessful.

ssize_t operator() (const char * s, size_t size);

A null terminated string is emitted. The optional size parameter places a limit on the number of characters emitted. The actual number of characters emitted is returned or EOF if none.

ssize_t operator() (const char * format, va_list ap);

A variable length argument list is emitted according to the printf-style format string. The actual number of characters emitted is returned or EOF if none.

ssize_t operator() (const void * buffer, size_t minimum, size_t maximum);

At least a minimum and no more than a maximum number of bytes are emitted. As before, more than the minimum is emitted if it can be done without blocking. The actual number of characters emitted is returned or EOF if none.

int operator() ();

Any data buffered in the underlying implementation are flushed to the output stream. A non-negative number is returned for success, EOF for failure.

Similar to the Input interface, the Output base class is not pure: it implements all of these functors, each of which throws the data away and returns success. The Output base class is also the equivalent of /dev/null.

As you might expect, Desperado implements a variety of derived classes: DescriptorOutput, FileOutput, BufferOutput, PathOutput, and SyslogOutput (which writes all of its output to the system log).

The Desperado Print class makes effective of Output functors. Its constructor takes a single Output reference as its only parameter.

Print:Print(Output & output);

Print defines its own functor which takes a variable length argument list. Here is its entire implementation.

ssize_t Print::operator() (const char * format ...) {
va_list ap;
ssize_t rc;
va_start(ap, format);
rc = output(format, ap);
va_end(ap);
return rc;
}

You can see the output constructor reference used just like a function call on the fifth line.

In an application, you can now write code like this, which I do all the time. (Warning: head explosion may be imminent.)

extern Output * myoutputp;
Print printf(*myoutputp);

printf("An error occurred!\n");
printf("errno=%d\n", errno);

This illustrates one of the greatest strengths and greatest weaknesses of C++: if you encountered this while reading code, you would likely to assume that the printf was the standard I/O statement you know and love. But it isn't at all; it's a Print object. And that Print object has no idea where it's output is going. Depending on actual type of its constructor argument, it could be a socket, a file, or even a memory buffer.

It is possible to have a class that offers both an Input and an Output interface. Hayloft does this for its Packet class. Packet implements an infinite (as long as memory holds out anyway) bi-directional memory buffer. Although it offers specialized methods to prepend and append data, it also exposes both an Input and Output interface so that a Packet can be used as a data sink (by passing its Output interface) or as a data source (through its Input interface). A Packet can be used in this manner as a ring buffer.

This pattern is so common that Desperado defines an InputOutput interface for such implementations. This interface defines two methods that each return a reference to the appropriate interface.

Input & input();

Output & output();

This allows you to do things like

extern Packet * mypacketp;
Print printf(mypacketp->output());

to collect printed output into a Packet.

I/O functors illustrate one of those capabilities that makes C++ both extremely powerful and often hard to understand, because it allows one to use C++ not just as a programming language but as a meta-language, effectively creating a new domain specific language with its own operations, one that just happens to look vaguely like C++. It also means that while reading C++ code, especially in large code bases, it can be very difficult to make any assumptions about what is going on without both a broad and a deep understanding of both C++ and the underlying code.