diff options
Diffstat (limited to 'Data/Libraries/Penlight/docs_topics/06-data.md')
-rw-r--r-- | Data/Libraries/Penlight/docs_topics/06-data.md | 1262 |
1 files changed, 1262 insertions, 0 deletions
diff --git a/Data/Libraries/Penlight/docs_topics/06-data.md b/Data/Libraries/Penlight/docs_topics/06-data.md new file mode 100644 index 0000000..e067b6b --- /dev/null +++ b/Data/Libraries/Penlight/docs_topics/06-data.md @@ -0,0 +1,1262 @@ +## Data + +### Reading Data Files + +The first thing to consider is this: do you actually need to write a custom file +reader? And if the answer is yes, the next question is: can you write the reader +in as clear a way as possible? Correctness, Robustness, and Speed; pick the first +two and the third can be sorted out later, _if necessary_. + +A common sort of data file is the configuration file format commonly used on Unix +systems. This format is often called a _property_ file in the Java world. + + # Read timeout in seconds + read.timeout=10 + + # Write timeout in seconds + write.timeout=10 + +Here is a simple Lua implementation: + + -- property file parsing with Lua string patterns + props = [] + for line in io.lines() do + if line:find('#',1,true) ~= 1 and not line:find('^%s*$') then + local var,value = line:match('([^=]+)=(.*)') + props[var] = value + end + end + +Very compact, but it suffers from a similar disease in equivalent Perl programs; +it uses odd string patterns which are 'lexically noisy'. Noisy code like this +slows the casual reader down. (For an even more direct way of doing this, see the +next section, 'Reading Configuration Files') + +Another implementation, using the Penlight libraries: + + -- property file parsing with extended string functions + require 'pl' + stringx.import() + props = [] + for line in io.lines() do + if not line:startswith('#') and not line:isspace() then + local var,value = line:splitv('=') + props[var] = value + end + end + +This is more self-documenting; it is generally better to make the code express +the _intention_, rather than having to scatter comments everywhere - comments are +necessary, of course, but mostly to give the higher view of your intention that +cannot be expressed in code. It is slightly slower, true, but in practice the +speed of this script is determined by I/O, so further optimization is unnecessary. + +### Reading Unstructured Text Data + +Text data is sometimes unstructured, for example a file containing words. The +`pl.input` module has a number of functions which makes processing such files +easier. For example, a script to count the number of words in standard input +using `import.words`: + + -- countwords.lua + require 'pl' + local k = 1 + for w in input.words(io.stdin) do + k = k + 1 + end + print('count',k) + +Or this script to calculate the average of a set of numbers using `input.numbers`: + + -- average.lua + require 'pl' + local k = 1 + local sum = 0 + for n in input.numbers(io.stdin) do + sum = sum + n + k = k + 1 + end + print('average',sum/k) + +These scripts can be improved further by _eliminating loops_ In the last case, +there is a perfectly good function `seq.sum` which can already take a sequence of +numbers and calculate these numbers for us: + + -- average2.lua + require 'pl' + local total,n = seq.sum(input.numbers()) + print('average',total/n) + +A further simplification here is that if `numbers` or `words` are not passed an +argument, they will grab their input from standard input. The first script can +be rewritten: + + -- countwords2.lua + require 'pl' + print('count',seq.count(input.words())) + +A useful feature of a sequence generator like `numbers` is that it can read from +a string source. Here is a script to calculate the sums of the numbers on each +line in a file: + + -- sums.lua + for line in io.lines() do + print(seq.sum(input.numbers(line)) + end + +### Reading Columnar Data + +It is very common to find data in columnar form, either space or comma-separated, +perhaps with an initial set of column headers. Here is a typical example: + + EventID Magnitude LocationX LocationY LocationZ + 981124001 2.0 18988.4 10047.1 4149.7 + 981125001 0.8 19104.0 9970.4 5088.7 + 981127003 0.5 19012.5 9946.9 3831.2 + ... + +`input.fields` is designed to extract several columns, given some delimiter +(default to whitespace). Here is a script to calculate the average X location of +all the events: + + -- avg-x.lua + require 'pl' + io.read() -- skip the header line + local sum,count = seq.sum(input.fields {3}) + print(sum/count) + +`input.fields` is passed either a field count, or a list of column indices, +starting at one as usual. So in this case we're only interested in column 3. If +you pass it a field count, then you get every field up to that count: + + for id,mag,locX,locY,locZ in input.fields (5) do + .... + end + +`input.fields` by default tries to convert each field to a number. It will skip +lines which clearly don't match the pattern, but will abort the script if there +are any fields which cannot be converted to numbers. + +The second parameter is a delimiter, by default spaces. ' ' is understood to mean +'any number of spaces', i.e. '%s+'. Any Lua string pattern can be used. + +The third parameter is a _data source_, by default standard input (defined by +`input.create_getter`.) It assumes that the data source has a `read` method which +brings in the next line, i.e. it is a 'file-like' object. As a special case, a +string will be split into its lines: + + > for x,y in input.fields(2,' ','10 20\n30 40\n') do print(x,y) end + 10 20 + 30 40 + +Note the default behaviour for bad fields, which is to show the offending line +number: + + > for x,y in input.fields(2,' ','10 20\n30 40x\n') do print(x,y) end + 10 20 + line 2: cannot convert '40x' to number + +This behaviour of `input.fields` is appropriate for a script which you want to +fail immediately with an appropriate _user_ error message if conversion fails. +The fourth optional parameter is an options table: `{no_fail=true}` means that +conversion is attempted but if it fails it just returns the string, rather as AWK +would operate. You are then responsible for checking the type of the returned +field. `{no_convert=true}` switches off conversion altogether and all fields are +returned as strings. + +@lookup pl.data + +Sometimes it is useful to bring a whole dataset into memory, for operations such +as extracting columns. Penlight provides a flexible reader specifically for +reading this kind of data, using the `data` module. Given a file looking like this: + + x,y + 10,20 + 2,5 + 40,50 + +Then `data.read` will create a table like this, with each row represented by a +sublist: + + > t = data.read 'test.txt' + > pretty.dump(t) + {{10,20},{2,5},{40,50},fieldnames={'x','y'},delim=','} + +You can now analyze this returned table using the supplied methods. For instance, +the method `column_by_name` returns a table of all the values of that column. + + -- testdata.lua + require 'pl' + d = data.read('fev.txt') + for _,name in ipairs(d.fieldnames) do + local col = d:column_by_name(name) + if type(col[1]) == 'number' then + local total,n = seq.sum(col) + utils.printf("Average for %s is %f\n",name,total/n) + end + end + +`data.read` tries to be clever when given data; by default it expects a first +line of column names, unless any of them are numbers. It tries to deduce the +column delimiter by looking at the first line. Sometimes it guesses wrong; these +things can be specified explicitly. The second optional parameter is an options +table: can override `delim` (a string pattern), `fieldnames` (a list or +comma-separated string), specify `no_convert` (default is to convert), numfields +(indices of columns known to be numbers, as a list) and `thousands_dot` (when the +thousands separator in Excel CSV is '.') + +A very powerful feature is a way to execute SQL-like queries on such data: + + -- queries on tabular data + require 'pl' + local d = data.read('xyz.txt') + local q = d:select('x,y,z where x > 3 and z < 2 sort by y') + for x,y,z in q do + print(x,y,z) + end + +Please note that the format of queries is restricted to the following syntax: + + FIELDLIST [ 'where' CONDITION ] [ 'sort by' FIELD [asc|desc]] + +Any valid Lua code can appear in `CONDITION`; remember it is _not_ SQL and you +have to use `==` (this warning comes from experience.) + +For this to work, _field names must be Lua identifiers_. So `read` will massage +fieldnames so that all non-alphanumeric chars are replaced with underscores. +However, the `original_fieldnames` field always contains the original un-massaged +fieldnames. + +`read` can handle standard CSV files fine, although doesn't try to be a +full-blown CSV parser. With the `csv=true` option, it's possible to have +double-quoted fields, which may contain commas; then trailing commas become +significant as well. + +Spreadsheet programs are not always the best tool to +process such data, strange as this might seem to some people. This is a toy CSV +file; to appreciate the problem, imagine thousands of rows and dozens of columns +like this: + + Department Name,Employee ID,Project,Hours Booked + sales,1231,overhead,4 + sales,1255,overhead,3 + engineering,1501,development,5 + engineering,1501,maintenance,3 + engineering,1433,maintenance,10 + +The task is to reduce the dataset to a relevant set of rows and columns, perhaps +do some processing on row data, and write the result out to a new CSV file. The +`write_row` method uses the delimiter to write the row to a file; +`Data.select_row` is like `Data.select`, except it iterates over _rows_, not +fields; this is necessary if we are dealing with a lot of columns! + + names = {[1501]='don',[1433]='dilbert'} + keepcols = {'Employee_ID','Hours_Booked'} + t:write_row (outf,{'Employee','Hours_Booked'}) + q = t:select_row { + fields=keepcols, + where=function(row) return row[1]=='engineering' end + } + for row in q do + row[1] = names[row[1]] + t:write_row(outf,row) + end + +`Data.select_row` and `Data.select` can be passed a table specifying the query; a +list of field names, a function defining the condition and an optional parameter +`sort_by`. It isn't really necessary here, but if we had a more complicated row +condition (such as belonging to a specified set) then it is not generally +possible to express such a condition as a query string, without resorting to +hackery such as global variables. + +With 1.0.3, you can specify explicit conversion functions for selected columns. +For instance, this is a log file with a Unix date stamp: + + Time Message + 1266840760 +# EE7C0600006F0D00C00F06010302054000000308010A00002B00407B00 + 1266840760 closure data 0.000000 1972 1972 0 + 1266840760 ++ 1266840760 EE 1 + 1266840760 +# EE7C0600006F0D00C00F06010302054000000408020A00002B00407B00 + 1266840764 closure data 0.000000 1972 1972 0 + +We would like the first column as an actual date object, so the `convert` +field sets an explicit conversion for column 1. (Note that we have to explicitly +convert the string to a number first.) + + Date = require 'pl.Date' + + function date_convert (ds) + return Date(tonumber(ds)) + end + + d = data.read(f,{convert={[1]=date_convert},last_field_collect=true}) + +This gives us a two-column dataset, where the first column contains `Date` objects +and the second column contains the rest of the line. Queries can then easily +pick out events on a day of the week: + + q = d:select "Time,Message where Time:weekday_name()=='Sun'" + +Data does not have to come from files, nor does it necessarily come from the lab +or the accounts department. On Linux, `ps aux` gives you a full listing of all +processes running on your machine. It is straightforward to feed the output of +this command into `data.read` and perform useful queries on it. Notice that +non-identifier characters like '%' get converted into underscores: + + require 'pl' + f = io.popen 'ps aux' + s = data.read (f,{last_field_collect=true}) + f:close() + print(s.fieldnames) + print(s:column_by_name 'USER') + qs = 'COMMAND,_MEM where _MEM > 5 and USER=="steve"' + for name,mem in s:select(qs) do + print(mem,name) + end + +I've always been an admirer of the AWK programming language; with `filter` you +can get Lua programs which are just as compact: + + -- printxy.lua + require 'pl' + data.filter 'x,y where x > 3' + +It is common enough to have data files without headers of field names. +`data.read` makes a special exception for such files if all fields are numeric. +Since there are no column names to use in query expressions, you can use AWK-like +column indexes, e.g. '$1,$2 where $1 > 3'. I have a little executable script on +my system called `lf` which looks like this: + + #!/usr/bin/env lua + require 'pl.data'.filter(arg[1]) + +And it can be used generally as a filter command to extract columns from data. +(The column specifications may be expressions or even constants.) + + $ lf '$1,$5/10' < test.dat + +(As with AWK, please note the single-quotes used in this command; this prevents +the shell trying to expand the column indexes. If you are on Windows, then you +must quote the expression in double-quotes so +it is passed as one argument to your batch file.) + +As a tutorial resource, have a look at `test-data.lua` in the PL tests directory +for other examples of use, plus comments. + +The data returned by `read` or constructed by `Data.copy_select` from a query is +basically just an array of rows: `{{1,2},{3,4}}`. So you may use `read` to pull +in any array-like dataset, and process with any function that expects such a +implementation. In particular, the functions in `array2d` will work fine with +this data. In fact, these functions are available as methods; e.g. +`array2d.flatten` can be called directly like so to give us a one-dimensional list: + + v = data.read('dat.txt'):flatten() + +The data is also in exactly the right shape to be treated as matrices by +[LuaMatrix](http://lua-users.org/wiki/LuaMatrix): + + > matrix = require 'matrix' + > m = matrix(data.read 'mat.txt') + > = m + 1 0.2 0.3 + 0.2 1 0.1 + 0.1 0.2 1 + > = m^2 -- same as m*m + 1.07 0.46 0.62 + 0.41 1.06 0.26 + 0.24 0.42 1.05 + +`write` will write matrices back to files for you. + +Finally, for the curious, the global variable `_DEBUG` can be used to print out +the actual iterator function which a query generates and dynamically compiles. By +using code generation, we can get pretty much optimal performance out of +arbitrary queries. + + > lua -lpl -e "_DEBUG=true" -e "data.filter 'x,y where x > 4 sort by x'" < test.txt + return function (t) + local i = 0 + local v + local ls = {} + for i,v in ipairs(t) do + if v[1] > 4 then + ls[#ls+1] = v + end + end + table.sort(ls,function(v1,v2) + return v1[1] < v2[1] + end) + local n = #ls + return function() + i = i + 1 + v = ls[i] + if i > n then return end + return v[1],v[2] + end + end + + 10,20 + 40,50 + +### Reading Configuration Files + +The `config` module provides a simple way to convert several kinds of +configuration files into a Lua table. Consider the simple example: + + # test.config + # Read timeout in seconds + read.timeout=10 + + # Write timeout in seconds + write.timeout=5 + + #acceptable ports + ports = 1002,1003,1004 + +This can be easily brought in using `config.read` and the result shown using +`pretty.write`: + + -- readconfig.lua + local config = require 'pl.config' + local pretty= require 'pl.pretty' + + local t = config.read(arg[1]) + print(pretty.write(t)) + +and the output of `lua readconfig.lua test.config` is: + + { + ports = { + 1002, + 1003, + 1004 + }, + write_timeout = 5, + read_timeout = 10 + } + +That is, `config.read` will bring in all key/value pairs, ignore # comments, and +ensure that the key names are proper Lua identifiers by replacing non-identifier +characters with '_'. If the values are numbers, then they will be converted. (So +the value of `t.write_timeout` is the number 5). In addition, any values which +are separated by commas will be converted likewise into an array. + +Any line can be continued with a backslash. So this will all be considered one +line: + + names=one,two,three, \ + four,five,six,seven, \ + eight,nine,ten + + +Windows-style INI files are also supported. The section structure of INI files +translates naturally to nested tables in Lua: + + ; test.ini + [timeouts] + read=10 ; Read timeout in seconds + write=5 ; Write timeout in seconds + [portinfo] + ports = 1002,1003,1004 + + The output is: + + { + portinfo = { + ports = { + 1002, + 1003, + 1004 + } + }, + timeouts = { + write = 5, + read = 10 + } + } + +You can now refer to the write timeout as `t.timeouts.write`. + +As a final example of the flexibility of `config.read`, if passed this simple +comma-delimited file + + one,two,three + 10,20,30 + 40,50,60 + 1,2,3 + +it will produce the following table: + + { + { "one", "two", "three" }, + { 10, 20, 30 }, + { 40, 50, 60 }, + { 1, 2, 3 } + } + +`config.read` isn't designed to read all CSV files in general, but intended to +support some Unix configuration files not structured as key-value pairs, such as +'/etc/passwd'. + +This function is intended to be a Swiss Army Knife of configuration readers, but +it does have to make assumptions, and you may not like them. So there is an +optional extra parameter which allows some control, which is table that may have +the following fields: + + { + variablilize = true, + convert_numbers = tonumber, + trim_space = true, + list_delim = ',', + trim_quotes = true, + ignore_assign = false, + keysep = '=', + smart = false, + } + +`variablilize` is the option that converted `write.timeout` in the first example +to the valid Lua identifier `write_timeout`. If `convert_numbers` is true, then +an attempt is made to convert any string that starts like a number. You can +specify your own function (say one that will convert a string like '5224 kb' into +a number.) + +`trim_space` ensures that there is no starting or trailing whitespace with +values, and `list_delim` is the character that will be used to decide whether to +split a value up into a list (it may be a Lua string pattern such as '%s+'.) + +For instance, the password file in Unix is colon-delimited: + + t = config.read('/etc/passwd',{list_delim=':'}) + +This produces the following output on my system (only last two lines shown): + + { + ... + { + "user", + "x", + "1000", + "1000", + "user,,,", + "/home/user", + "/bin/bash" + }, + { + "sdonovan", + "x", + "1001", + "1001", + "steve donovan,28,,", + "/home/sdonovan", + "/bin/bash" + } + } + +You can get this into a more sensible format, where the usernames are the keys, +with this (the `tablex.pairmap` function must return value, key!) + + t = tablex.pairmap(function(k,v) return v,v[1] end,t) + +and you get: + + { ... + sdonovan = { + "sdonovan", + "x", + "1001", + "1001", + "steve donovan,28,,", + "/home/sdonovan", + "/bin/bash" + } + ... + } + +Many common Unix configuration files can be read by tweaking these parameters. +For `/etc/fstab`, the options `{list_delim='%s+',ignore_assign=true}` will +correctly separate the columns. It's common to find 'KEY VALUE' assignments in +files such as `/etc/ssh/ssh_config`; the options `{keysep=' '}` make +`config.read` return a table where each KEY has a value VALUE. + +Files in the Linux `procfs` usually use ':` as the field delimiter: + + > t = config.read('/proc/meminfo',{keysep=':'}) + > = t.MemFree + 220140 kB + +That result is a string, since `tonumber` doesn't like it, but defining the +`convert_numbers` option as `function(s) return tonumber((s:gsub(' kB$',''))) +end` will get the memory figures as actual numbers in the result. (The extra +parentheses are necessary so that `tonumber` only gets the first result from +`gsub`). From `tests/test-config.lua': + + testconfig([[ + MemTotal: 1024748 kB + MemFree: 220292 kB + ]], + { MemTotal = 1024748, MemFree = 220292 }, + { + keysep = ':', + convert_numbers = function(s) + s = s:gsub(' kB$','') + return tonumber(s) + end + } + ) + + +The `smart` option lets `config.read` make a reasonable guess for you; there +are examples in `tests/test-config.lua`, but basically these common file +formats (and those following the same pattern) can be processed directly in +smart mode: 'etc/fstab', '/proc/XXXX/status', 'ssh_config' and 'pdatedb.conf'. + +Please note that `config.read` can be passed a _file-like object_; if it's not a +string and supports the `read` method, then that will be used. For instance, to +read a configuration from a string, use `stringio.open`. + + +<a id="lexer"/> + +### Lexical Scanning + +Although Lua's string pattern matching is very powerful, there are times when +something more powerful is needed. `pl.lexer.scan` provides lexical scanners +which _tokenize_ a string, classifying tokens into numbers, strings, etc. + + > lua -lpl + Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio + > tok = lexer.scan 'alpha = sin(1.5)' + > = tok() + iden alpha + > = tok() + = = + > = tok() + iden sin + > = tok() + ( ( + > = tok() + number 1.5 + > = tok() + ) ) + > = tok() + (nil) + +The scanner is a function, which is repeatedly called and returns the _type_ and +_value_ of the token. Recognized basic types are 'iden','string','number', and +'space'. and everything else is represented by itself. Note that by default the +scanner will skip any 'space' tokens. + +'comment' and 'keyword' aren't applicable to the plain scanner, which is not +language-specific, but a scanner which understands Lua is available. It +recognizes the Lua keywords, and understands both short and long comments and +strings. + + > for t,v in lexer.lua 'for i=1,n do' do print(t,v) end + keyword for + iden i + = = + number 1 + , , + iden n + keyword do + +A lexical scanner is useful where you have highly-structured data which is not +nicely delimited by newlines. For example, here is a snippet of a in-house file +format which it was my task to maintain: + + points + (818344.1,-20389.7,-0.1),(818337.9,-20389.3,-0.1),(818332.5,-20387.8,-0.1) + ,(818327.4,-20388,-0.1),(818322,-20387.7,-0.1),(818316.3,-20388.6,-0.1) + ,(818309.7,-20389.4,-0.1),(818303.5,-20390.6,-0.1),(818295.8,-20388.3,-0.1) + ,(818290.5,-20386.9,-0.1),(818285.2,-20386.1,-0.1),(818279.3,-20383.6,-0.1) + ,(818274,-20381.2,-0.1),(818274,-20380.7,-0.1); + +Here is code to extract the points using `pl.lexer`: + + -- assume 's' contains the text above... + local lexer = require 'pl.lexer' + local expecting = lexer.expecting + local append = table.insert + + local tok = lexer.scan(s) + + local points = {} + local t,v = tok() -- should be 'iden','points' + + while t ~= ';' do + c = {} + expecting(tok,'(') + c.x = expecting(tok,'number') + expecting(tok,',') + c.y = expecting(tok,'number') + expecting(tok,',') + c.z = expecting(tok,'number') + expecting(tok,')') + t,v = tok() -- either ',' or ';' + append(points,c) + end + +The `expecting` function grabs the next token and if the type doesn't match, it +throws an error. (`pl.lexer`, unlike other PL libraries, raises errors if +something goes wrong, so you should wrap your code in `pcall` to catch the error +gracefully.) + +The scanners all have a second optional argument, which is a table which controls +whether you want to exclude spaces and/or comments. The default for `lexer.lua` +is `{space=true,comments=true}`. There is a third optional argument which +determines how string and number tokens are to be processsed. + +The ultimate highly-structured data is of course, program source. Here is a +snippet from 'text-lexer.lua': + + require 'pl' + + lines = [[ + for k,v in pairs(t) do + if type(k) == 'number' then + print(v) -- array-like case + else + print(k,v) + end + end + ]] + + ls = List() + for tp,val in lexer.lua(lines,{space=true,comments=true}) do + assert(tp ~= 'space' and tp ~= 'comment') + if tp == 'keyword' then ls:append(val) end + end + test.asserteq(ls,List{'for','in','do','if','then','else','end','end'}) + +Here is a useful little utility that identifies all common global variables found +in a lua module (ignoring those declared locally for the moment): + + -- testglobal.lua + require 'pl' + + local txt,err = utils.readfile(arg[1]) + if not txt then return print(err) end + + local globals = List() + for t,v in lexer.lua(txt) do + if t == 'iden' and _G[v] then + globals:append(v) + end + end + pretty.dump(seq.count_map(globals)) + +Rather then dumping the whole list, with its duplicates, we pass it through +`seq.count_map` which turns the list into a table where the keys are the values, +and the associated values are the number of times those values occur in the +sequence. Typical output looks like this: + + { + type = 2, + pairs = 2, + table = 2, + print = 3, + tostring = 2, + require = 1, + ipairs = 4 + } + +You could further pass this through `tablex.keys` to get a unique list of +symbols. This can be useful when writing 'strict' Lua modules, where all global +symbols must be defined as locals at the top of the file. + +For a more detailed use of `lexer.scan`, please look at `testxml.lua` in the +examples directory. + +### XML + +New in the 0.9.7 release is some support for XML. This is a large topic, and +Penlight does not provide a full XML stack, which is properly the task of a more +specialized library. + +#### Parsing and Pretty-Printing + +The semi-standard XML parser in the Lua universe is [lua-expat](http://matthewwild.co.uk/projects/luaexpat/). +In particular, +it has a function called `lxp.lom.parse` which will parse XML into the Lua Object +Model (LOM) format. However, it does not provide a way to convert this data back +into XML text. `xml.parse` will use this function, _if_ `lua-expat` is +available, and otherwise switches back to a pure Lua parser originally written by +Roberto Ierusalimschy. + +The resulting document object knows how to render itself as a string, which is +useful for debugging: + + > d = xml.parse "<nodes><node id='1'>alice</node></nodes>" + > = d + <nodes><node id='1'>alice</node></nodes> + > pretty.dump (d) + { + { + "alice", + attr = { + "id", + id = "1" + }, + tag = "node" + }, + attr = { + }, + tag = "nodes" + } + +Looking at the actual shape of the data reveals the structure of LOM: + + * every element has a `tag` field with its name + * plus a `attr` field which is a table containing the attributes as fields, and +also as an array. It is always present. + * the children of the element are the array part of the element, so `d[1]` is +the first child of `d`, etc. + +It could be argued that having attributes also as the array part of `attr` is not +essential (you cannot depend on attribute order in XML) but that's how +it goes with this standard. + +`lua-expat` is another _soft dependency_ of Penlight; generally, the fallback +parser is good enough for straightforward XML as is commonly found in +configuration files, etc. `doc.basic_parse` is not intended to be a proper +conforming parser (it's only sixty lines) but it handles simple kinds of +documents that do not have comments or DTD directives. It is intelligent enough +to ignore the `<?xml` directive and that is about it. + +You can get pretty-printing by explicitly calling `xml.tostring` and passing it +the initial indent and the per-element indent: + + > = xml.tostring(d,'',' ') + + <nodes> + <node id='1'>alice</node> + </nodes> + +There is a fourth argument which is the _attribute indent_: + + > a = xml.parse "<frodo name='baggins' age='50' type='hobbit'/>" + > = xml.tostring(a,'',' ',' ') + + <frodo + type='hobbit' + name='baggins' + age='50' + /> + +#### Parsing and Working with Configuration Files + +It's common to find configurations expressed with XML these days. It's +straightforward to 'walk' the [LOM](http://matthewwild.co.uk/projects/luaexpat/lom.html) +data and extract the data in the form you want: + + require 'pl' + + local config = [[ + <config> + <alpha>1.3</alpha> + <beta>10</beta> + <name>bozo</name> + </config> + ]] + local d,err = xml.parse(config) + + local t = {} + for item in d:childtags() do + t[item.tag] = item[1] + end + + pretty.dump(t) + ---> + { + beta = "10", + alpha = "1.3", + name = "bozo" + } + +The only gotcha is that here we must use the `Doc:childtags` method, which will +skip over any text elements. + +A more involved example is this excerpt from `serviceproviders.xml`, which is +usually found at `/usr/share/mobile-broadband-provider-info/serviceproviders.xml` +on Debian/Ubuntu Linux systems. + + d = xml.parse [[ + <serviceproviders format="2.0"> + ... + <country code="za"> + <provider> + <name>Cell-c</name> + <gsm> + <network-id mcc="655" mnc="07"/> + <apn value="internet"> + <username>Cellcis</username> + <dns>196.7.0.138</dns> + <dns>196.7.142.132</dns> + </apn> + </gsm> + </provider> + <provider> + <name>MTN</name> + <gsm> + <network-id mcc="655" mnc="10"/> + <apn value="internet"> + <dns>196.11.240.241</dns> + <dns>209.212.97.1</dns> + </apn> + </gsm> + </provider> + <provider> + <name>Vodacom</name> + <gsm> + <network-id mcc="655" mnc="01"/> + <apn value="internet"> + <dns>196.207.40.165</dns> + <dns>196.43.46.190</dns> + </apn> + <apn value="unrestricted"> + <name>Unrestricted</name> + <dns>196.207.32.69</dns> + <dns>196.43.45.190</dns> + </apn> + </gsm> + </provider> + <provider> + <name>Virgin Mobile</name> + <gsm> + <apn value="vdata"> + <dns>196.7.0.138</dns> + <dns>196.7.142.132</dns> + </apn> + </gsm> + </provider> + </country> + .... + </serviceproviders> + ]] + +Getting the names of the providers per-country is straightforward: + + local t = {} + for country in d:childtags() do + local providers = {} + t[country.attr.code] = providers + for provider in country:childtags() do + table.insert(providers,provider:child_with_name('name'):get_text()) + end + end + + pretty.dump(t) + --> + { + za = { + "Cell-c", + "MTN", + "Vodacom", + "Virgin Mobile" + } + .... + } + +#### Generating XML with 'xmlification' + +This feature is inspired by the `htmlify` function used by +[Orbit](http://keplerproject.github.com/orbit/) to simplify HTML generation, +except that no function environment magic is used; the `tags` function returns a +set of _constructors_ for elements of the given tag names. + + > nodes, node = xml.tags 'nodes, node' + > = node 'alice' + <node>alice</node> + > = nodes { node {id='1','alice'}} + <nodes><node id='1'>alice</node></nodes> + +The flexibility of Lua tables is very useful here, since both the attributes and +the children of an element can be encoded naturally. The argument to these tag +constructors is either a single value (like a string) or a table where the +attributes are the named keys and the children are the array values. + +#### Generating XML using Templates + +A template is a little XML document which contains dollar-variables. The `subst` +method on a document is fed an array of tables containing values for these +variables. Note how the parent tag name is specified: + + > templ = xml.parse "<node id='$id'>$name</node>" + > = templ:subst {tag='nodes', {id=1,name='alice'},{id=2,name='john'}} + <nodes><node id='1'>alice</node><node id='2'>john</node></nodes> + +Substitution is very related to _filtering_ documents. One of the annoying things +about XML is that it is a document markup language first, and a data language +second. Standard parsers will assume you really care about all those extra +text elements. Consider this fragment, which has been changed by a five-year old: + + T = [[ + <weather> + boops! + <current_conditions> + <condition data='$condition'/> + <temp_c data='$temp'/> + <bo>whoops!</bo> + </current_conditions> + </weather> + ]] + +Conformant parsers will give you text elements with the line feed after `<current_conditions>` +although it makes handling the data more irritating. + + local function parse (str) + return xml.parse(str,false,true) + end + +Second argument means 'string, not file' and third argument means use the built-in +Lua parser (instead of LuaExpat if available) which _by default_ is not interested in +keeping such strings. + +How to remove the string `boops!`? `clone` (also called `filter` when called as a +method) copies a LOM document. It can be passed a filter function, which is applied +to each string found. The powerful thing about this is that this function receives +structural information - the parent node, and whether this was a tag name, a text +element or a attribute name: + + d = parse (T) + c = d:filter(function(s,kind,parent) + print(stringx.strip(s),kind,parent and parent.tag or '?') + if kind == '*TEXT' and #parent > 1 then return nil end + return s + end) + ---> + weather *TAG ? + boops! *TEXT weather + current_conditions *TAG weather + condition *TAG current_conditions + $condition data condition + temp_c *TAG current_conditions + $temp data temp_c + bo *TAG current_conditions + whoops! *TEXT bo + +We can pull out 'boops' and not 'whoops' by discarding text elements which are not +the single child of an element. + + + +#### Extracting Data using Templates + +Matching goes in the opposite direction. We have a document, and would like to +extract values from it using a pattern. + +A common use of this is parsing the XML result of API queries. The +[(undocumented and subsequently discontinued) Google Weather +API](http://blog.programmableweb.com/2010/02/08/googles-secret-weather-api/) is a +good example. Grabbing the result of +`http://www.google.com/ig/api?weather=Johannesburg,ZA" we get something like +this, after pretty-printing: + + <xml_api_reply version='1'> + <weather module_id='0' tab_id='0' mobile_zipped='1' section='0' row='0' +mobile_row='0'> + <forecast_information> + <city data='Johannesburg, Gauteng'/> + <postal_code data='Johannesburg,ZA'/> + <latitude_e6 data=''/> + <longitude_e6 data=''/> + <forecast_date data='2010-10-02'/> + <current_date_time data='2010-10-02 18:30:00 +0000'/> + <unit_system data='US'/> + </forecast_information> + <current_conditions> + <condition data='Clear'/> + <temp_f data='75'/> + <temp_c data='24'/> + <humidity data='Humidity: 19%'/> + <icon data='/ig/images/weather/sunny.gif'/> + <wind_condition data='Wind: NW at 7 mph'/> + </current_conditions> + <forecast_conditions> + <day_of_week data='Sat'/> + <low data='60'/> + <high data='89'/> + <icon data='/ig/images/weather/sunny.gif'/> + <condition data='Clear'/> + </forecast_conditions> + .... + </weather> + </xml_api_reply> + +Assume that the above XML has been read into `google`. The idea is to write a +pattern looking like a template, and use it to extract some values of interest: + + t = [[ + <weather> + <current_conditions> + <condition data='$condition'/> + <temp_c data='$temp'/> + </current_conditions> + </weather> + ]] + + local res, ret = google:match(t) + pretty.dump(res) + +And the output is: + + { + condition = "Clear", + temp = "24" + } + +The `match` method can be passed a LOM document or some text, which will be +parsed first. + +But what if we need to extract values from repeated elements? Match templates may +contain 'array matches' which are enclosed in '{{..}}': + + <weather> + {{<forecast_conditions> + <day_of_week data='$day'/> + <low data='$low'/> + <high data='$high'/> + <condition data='$condition'/> + </forecast_conditions>}} + </weather> + +And the match result is: + + { + { + low = "60", + high = "89", + day = "Sat", + condition = "Clear", + }, + { + low = "53", + high = "86", + day = "Sun", + condition = "Clear", + }, + { + low = "57", + high = "87", + day = "Mon", + condition = "Clear", + }, + { + low = "60", + high = "84", + day = "Tue", + condition = "Clear", + } + } + +With this array of tables, you can use `tablex` or `List` +to reshape into the desired form, if you choose. Just as with reading a Unix password +file with `config`, you can make the array into a map of days to conditions using: + + `tablex.pairmap`('|k,v| v,v.day',conditions) + +(Here using the alternative string lambda option) + +However, xml matches can shape the structure of the output. By replacing the `day_of_week` +line of the template with `<day_of_week data='$_'/>` we get the same effect; `$_` is +a special symbol that means that this captured value (or simply _capture_) becomes the key. + +Note that `$NUMBER` means a numerical index, so +that `$1` is the first element of the resulting array, and so forth. You can mix +numbered and named captures, but it's strongly advised to make the numbered captures +form a proper array sequence (everything from `1` to `n` inclusive). `$0` has a +special meaning; if it is the only capture (`{[0]='foo'}`) then the table is +collapsed into 'foo'. + + <weather> + {{<forecast_conditions> + <day_of_week data='$_'/> + <low data='$1'/> + <high data='$2'/> + <condition data='$3'/> + </forecast_conditions>}} + </weather> + +Now the result is: + + { + Tue = { + "60", + "84", + "Clear" + }, + Sun = { + "53", + "86", + "Clear" + }, + Sat = { + "60", + "89", + "Clear" + }, + Mon = { + "57", + "87", + "Clear" + } + } + +Applying matches to this config file poses another problem, because the actual +tags matched are themselves meaningful. + + <config> + <alpha>1.3</alpha> + <beta>10</beta> + <name>bozo</name> + </config> + +So there are tag 'wildcards' which are element names ending with a hyphen. + + <config> + {{<key->$value</key->}} + </config> + +You will then get `{{alpha='1.3'},...}`. The most convenient format would be +returned by this (note that `_-` behaves just like `$_`): + + <config> + {{<_->$0</_->}} + </config> + +which would return `{alpha='1.3',beta='10',name='bozo'}`. + +We could play this game endlessly, and encode ways of converting captures, but +the scheme is complex enough, and it's easy to do the conversion later + + local numbers = {alpha=true,beta=true} + for k,v in pairs(res) do + if numbers[v] then res[k] = tonumber(v) end + end + + +#### HTML Parsing + +HTML is an unusually degenerate form of XML, and Dennis Schridde has contributed +a feature which makes parsing it easier. For instance, from the tests: + + doc = xml.parsehtml [[ + <BODY> + Hello dolly<br> + HTML is <b>slack</b><br> + </BODY> + ]] + + asserteq(xml.tostring(doc),[[ + <body> + Hello dolly<br/> + HTML is <b>slack</b><br/></body>]]) + +That is, all tags are converted to lowercase, and empty HTML elements like `br` +are properly closed; attributes do not need to be quoted. + +Also, DOCTYPE directives and comments are skipped. For truly badly formed HTML, +this is not the tool for you! + + + |