summaryrefslogtreecommitdiff
path: root/ThirdParty/luasocket/gem/ltn012.tex
diff options
context:
space:
mode:
Diffstat (limited to 'ThirdParty/luasocket/gem/ltn012.tex')
-rw-r--r--ThirdParty/luasocket/gem/ltn012.tex695
1 files changed, 695 insertions, 0 deletions
diff --git a/ThirdParty/luasocket/gem/ltn012.tex b/ThirdParty/luasocket/gem/ltn012.tex
new file mode 100644
index 0000000..8027ecc
--- /dev/null
+++ b/ThirdParty/luasocket/gem/ltn012.tex
@@ -0,0 +1,695 @@
+\documentclass[10pt]{article}
+\usepackage{fancyvrb}
+\usepackage{url}
+\DefineVerbatimEnvironment{lua}{Verbatim}{fontsize=\small,commandchars=\@\#\%}
+\DefineVerbatimEnvironment{C}{Verbatim}{fontsize=\small,commandchars=\@\#\%}
+\DefineVerbatimEnvironment{mime}{Verbatim}{fontsize=\small,commandchars=\$\#\%}
+\newcommand{\stick}[1]{\vbox{\setlength{\parskip}{0pt}#1}}
+\newcommand{\bl}{\ensuremath{\mathtt{\backslash}}}
+\newcommand{\CR}{\texttt{CR}}
+\newcommand{\LF}{\texttt{LF}}
+\newcommand{\CRLF}{\texttt{CR~LF}}
+\newcommand{\nil}{\texttt{nil}}
+
+\title{Filters, sources, sinks, and pumps\\
+ {\large or Functional programming for the rest of us}}
+\author{Diego Nehab}
+
+\begin{document}
+
+\maketitle
+
+\begin{abstract}
+Certain data processing operations can be implemented in the
+form of filters. A filter is a function that can process
+data received in consecutive invocations, returning partial
+results each time it is called. Examples of operations that
+can be implemented as filters include the end-of-line
+normalization for text, Base64 and Quoted-Printable transfer
+content encodings, the breaking of text into lines, SMTP
+dot-stuffing, and there are many others. Filters become
+even more powerful when we allow them to be chained together
+to create composite filters. In this context, filters can be
+seen as the internal links in a chain of data transformations.
+Sources and sinks are the corresponding end points in these
+chains. A source is a function that produces data, chunk by
+chunk, and a sink is a function that takes data, chunk by
+chunk. Finally, pumps are procedures that actively drive
+data from a source to a sink, and indirectly through all
+intervening filters. In this article, we describe the design of an
+elegant interface for filters, sources, sinks, chains, and
+pumps, and we illustrate each step with concrete examples.
+\end{abstract}
+
+\section{Introduction}
+
+Within the realm of networking applications, we are often
+required to apply transformations to streams of data. Examples
+include the end-of-line normalization for text, Base64 and
+Quoted-Printable transfer content encodings, breaking text
+into lines with a maximum number of columns, SMTP
+dot-stuffing, \texttt{gzip} compression, HTTP chunked
+transfer coding, and the list goes on.
+
+Many complex tasks require a combination of two or more such
+transformations, and therefore a general mechanism for
+promoting reuse is desirable. In the process of designing
+\texttt{LuaSocket~2.0}, we repeatedly faced this problem.
+The solution we reached proved to be very general and
+convenient. It is based on the concepts of filters, sources,
+sinks, and pumps, which we introduce below.
+
+\emph{Filters} are functions that can be repeatedly invoked
+with chunks of input, successively returning processed
+chunks of output. Naturally, the result of
+concatenating all the output chunks must be the same as the
+result of applying the filter to the concatenation of all
+input chunks. In fancier language, filters \emph{commute}
+with the concatenation operator. More importantly, filters
+must handle input data correctly no matter how the stream
+has been split into chunks.
+
+A \emph{chain} is a function that transparently combines the
+effect of one or more filters. The interface of a chain is
+indistinguishable from the interface of its component
+filters. This allows a chained filter to be used wherever
+an atomic filter is accepted. In particular, chains can be
+themselves chained to create arbitrarily complex operations.
+
+Filters can be seen as internal nodes in a network through
+which data will flow, potentially being transformed many
+times along the way. Chains connect these nodes together.
+The initial and final nodes of the network are
+\emph{sources} and \emph{sinks}, respectively. Less
+abstractly, a source is a function that produces new chunks
+of data every time it is invoked. Conversely, sinks are
+functions that give a final destination to the chunks of
+data they receive in sucessive calls. Naturally, sources
+and sinks can also be chained with filters to produce
+filtered sources and sinks.
+
+Finally, filters, chains, sources, and sinks are all passive
+entities: they must be repeatedly invoked in order for
+anything to happen. \emph{Pumps} provide the driving force
+that pushes data through the network, from a source to a
+sink, and indirectly through all intervening filters.
+
+In the following sections, we start with a simplified
+interface, which we later refine. The evolution we present
+is not contrived: it recreates the steps we ourselves
+followed as we consolidated our understanding of these
+concepts within our application domain.
+
+\subsection{A simple example}
+
+The end-of-line normalization of text is a good
+example to motivate our initial filter interface.
+Assume we are given text in an unknown end-of-line
+convention (including possibly mixed conventions) out of the
+commonly found Unix (\LF), Mac OS (\CR), and
+DOS (\CRLF) conventions. We would like to be able to
+use the folowing code to normalize the end-of-line markers:
+\begin{quote}
+\begin{lua}
+@stick#
+local CRLF = "\013\010"
+local input = source.chain(source.file(io.stdin), normalize(CRLF))
+local output = sink.file(io.stdout)
+pump.all(input, output)
+%
+\end{lua}
+\end{quote}
+
+This program should read data from the standard input stream
+and normalize the end-of-line markers to the canonic
+\CRLF\ marker, as defined by the MIME standard.
+Finally, the normalized text should be sent to the standard output
+stream. We use a \emph{file source} that produces data from
+standard input, and chain it with a filter that normalizes
+the data. The pump then repeatedly obtains data from the
+source, and passes it to the \emph{file sink}, which sends
+it to the standard output.
+
+In the code above, the \texttt{normalize} \emph{factory} is a
+function that creates our normalization filter, which
+replaces any end-of-line marker with the canonic marker.
+The initial filter interface is
+trivial: a filter function receives a chunk of input data,
+and returns a chunk of processed data. When there are no
+more input data left, the caller notifies the filter by invoking
+it with a \nil\ chunk. The filter responds by returning
+the final chunk of processed data (which could of course be
+the empty string).
+
+Although the interface is extremely simple, the
+implementation is not so obvious. A normalization filter
+respecting this interface needs to keep some kind of context
+between calls. This is because a chunk boundary may lie between
+the \CR\ and \LF\ characters marking the end of a single line. This
+need for contextual storage motivates the use of
+factories: each time the factory is invoked, it returns a
+filter with its own context so that we can have several
+independent filters being used at the same time. For
+efficiency reasons, we must avoid the obvious solution of
+concatenating all the input into the context before
+producing any output chunks.
+
+To that end, we break the implementation into two parts:
+a low-level filter, and a factory of high-level filters. The
+low-level filter is implemented in C and does not maintain
+any context between function calls. The high-level filter
+factory, implemented in Lua, creates and returns a
+high-level filter that maintains whatever context the low-level
+filter needs, but isolates the user from its internal
+details. That way, we take advantage of C's efficiency to
+perform the hard work, and take advantage of Lua's
+simplicity for the bookkeeping.
+
+\subsection{The Lua part of the filter}
+
+Below is the complete implementation of the factory of high-level
+end-of-line normalization filters:
+\begin{quote}
+\begin{lua}
+@stick#
+function filter.cycle(lowlevel, context, extra)
+ return function(chunk)
+ local ret
+ ret, context = lowlevel(context, chunk, extra)
+ return ret
+ end
+end
+%
+
+@stick#
+function normalize(marker)
+ return filter.cycle(eol, 0, marker)
+end
+%
+\end{lua}
+\end{quote}
+
+The \texttt{normalize} factory simply calls a more generic
+factory, the \texttt{cycle}~factory, passing the low-level
+filter~\texttt{eol}. The \texttt{cycle}~factory receives a
+low-level filter, an initial context, and an extra
+parameter, and returns a new high-level filter. Each time
+the high-level filer is passed a new chunk, it invokes the
+low-level filter with the previous context, the new chunk,
+and the extra argument. It is the low-level filter that
+does all the work, producing the chunk of processed data and
+a new context. The high-level filter then replaces its
+internal context, and returns the processed chunk of data to
+the user. Notice that we take advantage of Lua's lexical
+scoping to store the context in a closure between function
+calls.
+
+\subsection{The C part of the filter}
+
+As for the low-level filter, we must first accept
+that there is no perfect solution to the end-of-line marker
+normalization problem. The difficulty comes from an
+inherent ambiguity in the definition of empty lines within
+mixed input. However, the following solution works well for
+any consistent input, as well as for non-empty lines in
+mixed input. It also does a reasonable job with empty lines
+and serves as a good example of how to implement a low-level
+filter.
+
+The idea is to consider both \CR\ and~\LF\ as end-of-line
+\emph{candidates}. We issue a single break if any candidate
+is seen alone, or if it is followed by a different
+candidate. In other words, \CR~\CR~and \LF~\LF\ each issue
+two end-of-line markers, whereas \CR~\LF~and \LF~\CR\ issue
+only one marker each. It is easy to see that this method
+correctly handles the most common end-of-line conventions.
+
+With this in mind, we divide the low-level filter into two
+simple functions. The inner function~\texttt{pushchar} performs the
+normalization itself. It takes each input character in turn,
+deciding what to output and how to modify the context. The
+context tells if the last processed character was an
+end-of-line candidate, and if so, which candidate it was.
+For efficiency, we use Lua's auxiliary library's buffer
+interface:
+\begin{quote}
+\begin{C}
+@stick#
+@#define candidate(c) (c == CR || c == LF)
+static int pushchar(int c, int last, const char *marker,
+ luaL_Buffer *buffer) {
+ if (candidate(c)) {
+ if (candidate(last)) {
+ if (c == last)
+ luaL_addstring(buffer, marker);
+ return 0;
+ } else {
+ luaL_addstring(buffer, marker);
+ return c;
+ }
+ } else {
+ luaL_pushchar(buffer, c);
+ return 0;
+ }
+}
+%
+\end{C}
+\end{quote}
+
+The outer function~\texttt{eol} simply interfaces with Lua.
+It receives the context and input chunk (as well as an
+optional custom end-of-line marker), and returns the
+transformed output chunk and the new context.
+Notice that if the input chunk is \nil, the operation
+is considered to be finished. In that case, the loop will
+not execute a single time and the context is reset to the
+initial state. This allows the filter to be reused many
+times:
+\begin{quote}
+\begin{C}
+@stick#
+static int eol(lua_State *L) {
+ int context = luaL_checkint(L, 1);
+ size_t isize = 0;
+ const char *input = luaL_optlstring(L, 2, NULL, &isize);
+ const char *last = input + isize;
+ const char *marker = luaL_optstring(L, 3, CRLF);
+ luaL_Buffer buffer;
+ luaL_buffinit(L, &buffer);
+ if (!input) {
+ lua_pushnil(L);
+ lua_pushnumber(L, 0);
+ return 2;
+ }
+ while (input < last)
+ context = pushchar(*input++, context, marker, &buffer);
+ luaL_pushresult(&buffer);
+ lua_pushnumber(L, context);
+ return 2;
+}
+%
+\end{C}
+\end{quote}
+
+When designing filters, the challenging part is usually
+deciding what to store in the context. For line breaking, for
+instance, it could be the number of bytes that still fit in the
+current line. For Base64 encoding, it could be a string
+with the bytes that remain after the division of the input
+into 3-byte atoms. The MIME module in the \texttt{LuaSocket}
+distribution has many other examples.
+
+\section{Filter chains}
+
+Chains greatly increase the power of filters. For example,
+according to the standard for Quoted-Printable encoding,
+text should be normalized to a canonic end-of-line marker
+prior to encoding. After encoding, the resulting text must
+be broken into lines of no more than 76 characters, with the
+use of soft line breaks (a line terminated by the \texttt{=}
+sign). To help specifying complex transformations like
+this, we define a chain factory that creates a composite
+filter from one or more filters. A chained filter passes
+data through all its components, and can be used wherever a
+primitive filter is accepted.
+
+The chaining factory is very simple. The auxiliary
+function~\texttt{chainpair} chains two filters together,
+taking special care if the chunk is the last. This is
+because the final \nil\ chunk notification has to be
+pushed through both filters in turn:
+\begin{quote}
+\begin{lua}
+@stick#
+local function chainpair(f1, f2)
+ return function(chunk)
+ local ret = f2(f1(chunk))
+ if chunk then return ret
+ else return ret .. f2() end
+ end
+end
+%
+
+@stick#
+function filter.chain(...)
+ local f = select(1, ...)
+ for i = 2, select('@#', ...) do
+ f = chainpair(f, select(i, ...))
+ end
+ return f
+end
+%
+\end{lua}
+\end{quote}
+
+Thanks to the chain factory, we can
+define the Quoted-Printable conversion as such:
+\begin{quote}
+\begin{lua}
+@stick#
+local qp = filter.chain(normalize(CRLF), encode("quoted-printable"),
+ wrap("quoted-printable"))
+local input = source.chain(source.file(io.stdin), qp)
+local output = sink.file(io.stdout)
+pump.all(input, output)
+%
+\end{lua}
+\end{quote}
+
+\section{Sources, sinks, and pumps}
+
+The filters we introduced so far act as the internal nodes
+in a network of transformations. Information flows from node
+to node (or rather from one filter to the next) and is
+transformed along the way. Chaining filters together is our
+way to connect nodes in this network. As the starting point
+for the network, we need a source node that produces the
+data. In the end of the network, we need a sink node that
+gives a final destination to the data.
+
+\subsection{Sources}
+
+A source returns the next chunk of data each time it is
+invoked. When there is no more data, it simply returns~\nil.
+In the event of an error, the source can inform the
+caller by returning \nil\ followed by the error message.
+
+Below are two simple source factories. The \texttt{empty} source
+returns no data, possibly returning an associated error
+message. The \texttt{file} source yields the contents of a file
+in a chunk by chunk fashion:
+\begin{quote}
+\begin{lua}
+@stick#
+function source.empty(err)
+ return function()
+ return nil, err
+ end
+end
+%
+
+@stick#
+function source.file(handle, io_err)
+ if handle then
+ return function()
+ local chunk = handle:read(2048)
+ if not chunk then handle:close() end
+ return chunk
+ end
+ else return source.empty(io_err or "unable to open file") end
+end
+%
+\end{lua}
+\end{quote}
+
+\subsection{Filtered sources}
+
+A filtered source passes its data through the
+associated filter before returning it to the caller.
+Filtered sources are useful when working with
+functions that get their input data from a source (such as
+the pumps in our examples). By chaining a source with one or
+more filters, such functions can be transparently provided
+with filtered data, with no need to change their interfaces.
+Here is a factory that does the job:
+\begin{quote}
+\begin{lua}
+@stick#
+function source.chain(src, f)
+ return function()
+ if not src then
+ return nil
+ end
+ local chunk, err = src()
+ if not chunk then
+ src = nil
+ return f(nil)
+ else
+ return f(chunk)
+ end
+ end
+end
+%
+\end{lua}
+\end{quote}
+
+\subsection{Sinks}
+
+Just as we defined an interface for a source of data, we can
+also define an interface for a data destination. We call
+any function respecting this interface a sink. In our first
+example, we used a file sink connected to the standard
+output.
+
+Sinks receive consecutive chunks of data, until the end of
+data is signaled by a \nil\ input chunk. A sink can be
+notified of an error with an optional extra argument that
+contains the error message, following a \nil\ chunk.
+If a sink detects an error itself, and
+wishes not to be called again, it can return \nil,
+followed by an error message. A return value that
+is not \nil\ means the sink will accept more data.
+
+Below are two useful sink factories.
+The table factory creates a sink that stores
+individual chunks into an array. The data can later be
+efficiently concatenated into a single string with Lua's
+\texttt{table.concat} library function. The \texttt{null} sink
+simply discards the chunks it receives:
+\begin{quote}
+\begin{lua}
+@stick#
+function sink.table(t)
+ t = t or {}
+ local f = function(chunk, err)
+ if chunk then table.insert(t, chunk) end
+ return 1
+ end
+ return f, t
+end
+%
+
+@stick#
+local function null()
+ return 1
+end
+
+function sink.null()
+ return null
+end
+%
+\end{lua}
+\end{quote}
+
+Naturally, filtered sinks are just as useful as filtered
+sources. A filtered sink passes each chunk it receives
+through the associated filter before handing it down to the
+original sink. In the following example, we use a source
+that reads from the standard input. The input chunks are
+sent to a table sink, which has been coupled with a
+normalization filter. The filtered chunks are then
+concatenated from the output array, and finally sent to
+standard out:
+\begin{quote}
+\begin{lua}
+@stick#
+local input = source.file(io.stdin)
+local output, t = sink.table()
+output = sink.chain(normalize(CRLF), output)
+pump.all(input, output)
+io.write(table.concat(t))
+%
+\end{lua}
+\end{quote}
+
+\subsection{Pumps}
+
+Although not on purpose, our interface for sources is
+compatible with Lua iterators. That is, a source can be
+neatly used in conjunction with \texttt{for} loops. Using
+our file source as an iterator, we can write the following
+code:
+\begin{quote}
+\begin{lua}
+@stick#
+for chunk in source.file(io.stdin) do
+ io.write(chunk)
+end
+%
+\end{lua}
+\end{quote}
+
+Loops like this will always be present because everything
+we designed so far is passive. Sources, sinks, filters: none
+of them can do anything on their own. The operation of
+pumping all data a source can provide into a sink is so
+common that it deserves its own function:
+\begin{quote}
+\begin{lua}
+@stick#
+function pump.step(src, snk)
+ local chunk, src_err = src()
+ local ret, snk_err = snk(chunk, src_err)
+ if chunk and ret then return 1
+ else return nil, src_err or snk_err end
+end
+%
+
+@stick#
+function pump.all(src, snk, step)
+ step = step or pump.step
+ while true do
+ local ret, err = step(src, snk)
+ if not ret then
+ if err then return nil, err
+ else return 1 end
+ end
+ end
+end
+%
+\end{lua}
+\end{quote}
+
+The \texttt{pump.step} function moves one chunk of data from
+the source to the sink. The \texttt{pump.all} function takes
+an optional \texttt{step} function and uses it to pump all the
+data from the source to the sink.
+Here is an example that uses the Base64 and the
+line wrapping filters from the \texttt{LuaSocket}
+distribution. The program reads a binary file from
+disk and stores it in another file, after encoding it to the
+Base64 transfer content encoding:
+\begin{quote}
+\begin{lua}
+@stick#
+local input = source.chain(
+ source.file(io.open("input.bin", "rb")),
+ encode("base64"))
+local output = sink.chain(
+ wrap(76),
+ sink.file(io.open("output.b64", "w")))
+pump.all(input, output)
+%
+\end{lua}
+\end{quote}
+
+The way we split the filters here is not intuitive, on
+purpose. Alternatively, we could have chained the Base64
+encode filter and the line-wrap filter together, and then
+chain the resulting filter with either the file source or
+the file sink. It doesn't really matter.
+
+\section{Exploding filters}
+
+Our current filter interface has one serious shortcoming.
+Consider for example a \texttt{gzip} decompression filter.
+During decompression, a small input chunk can be exploded
+into a huge amount of data. To address this problem, we
+decided to change the filter interface and allow exploding
+filters to return large quantities of output data in a chunk
+by chunk manner.
+
+More specifically, after passing each chunk of input to
+a filter, and collecting the first chunk of output, the
+user must now loop to receive other chunks from the filter until no
+filtered data is left. Within these secondary calls, the
+caller passes an empty string to the filter. The filter
+responds with an empty string when it is ready for the next
+input chunk. In the end, after the user passes a
+\nil\ chunk notifying the filter that there is no
+more input data, the filter might still have to produce too
+much output data to return in a single chunk. The user has
+to loop again, now passing \nil\ to the filter each time,
+until the filter itself returns \nil\ to notify the
+user it is finally done.
+
+Fortunately, it is very easy to modify a filter to respect
+the new interface. In fact, the end-of-line translation
+filter we presented earlier already conforms to it. The
+complexity is encapsulated within the chaining functions,
+which must now include a loop. Since these functions only
+have to be written once, the user is rarely affected.
+Interestingly, the modifications do not have a measurable
+negative impact in the performance of filters that do
+not need the added flexibility. On the other hand, for a
+small price in complexity, the changes make exploding
+filters practical.
+
+\section{A complex example}
+
+The LTN12 module in the \texttt{LuaSocket} distribution
+implements all the ideas we have described. The MIME
+and SMTP modules are tightly integrated with LTN12,
+and can be used to showcase the expressive power of filters,
+sources, sinks, and pumps. Below is an example
+of how a user would proceed to define and send a
+multipart message, with attachments, using \texttt{LuaSocket}:
+\begin{quote}
+\begin{mime}
+local smtp = require"socket.smtp"
+local mime = require"mime"
+local ltn12 = require"ltn12"
+
+local message = smtp.message{
+ headers = {
+ from = "Sicrano <sicrano@example.com>",
+ to = "Fulano <fulano@example.com>",
+ subject = "A message with an attachment"},
+ body = {
+ preamble = "Hope you can see the attachment" .. CRLF,
+ [1] = {
+ body = "Here is our logo" .. CRLF},
+ [2] = {
+ headers = {
+ ["content-type"] = 'image/png; name="luasocket.png"',
+ ["content-disposition"] =
+ 'attachment; filename="luasocket.png"',
+ ["content-description"] = 'LuaSocket logo',
+ ["content-transfer-encoding"] = "BASE64"},
+ body = ltn12.source.chain(
+ ltn12.source.file(io.open("luasocket.png", "rb")),
+ ltn12.filter.chain(
+ mime.encode("base64"),
+ mime.wrap()))}}}
+
+assert(smtp.send{
+ rcpt = "<fulano@example.com>",
+ from = "<sicrano@example.com>",
+ source = message})
+\end{mime}
+\end{quote}
+
+The \texttt{smtp.message} function receives a table
+describing the message, and returns a source. The
+\texttt{smtp.send} function takes this source, chains it with the
+SMTP dot-stuffing filter, connects a socket sink
+with the server, and simply pumps the data. The message is never
+assembled in memory. Everything is produced on demand,
+transformed in small pieces, and sent to the server in chunks,
+including the file attachment which is loaded from disk and
+encoded on the fly. It just works.
+
+\section{Conclusions}
+
+In this article, we introduced the concepts of filters,
+sources, sinks, and pumps to the Lua language. These are
+useful tools for stream processing in general. Sources provide
+a simple abstraction for data acquisition. Sinks provide an
+abstraction for final data destinations. Filters define an
+interface for data transformations. The chaining of
+filters, sources and sinks provides an elegant way to create
+arbitrarily complex data transformations from simpler
+components. Pumps simply push the data through.
+
+\section{Acknowledgements}
+
+The concepts described in this text are the result of long
+discussions with David Burgess. A version of this text has
+been released on-line as the Lua Technical Note 012, hence
+the name of the corresponding LuaSocket module, LTN12. Wim
+Couwenberg contributed to the implementation of the module,
+and Adrian Sietsma was the first to notice the
+correspondence between sources and Lua iterators.
+
+
+\end{document}