summaryrefslogtreecommitdiff
path: root/Data/Libraries/Penlight/docs_topics/06-data.md
diff options
context:
space:
mode:
Diffstat (limited to 'Data/Libraries/Penlight/docs_topics/06-data.md')
-rw-r--r--Data/Libraries/Penlight/docs_topics/06-data.md1262
1 files changed, 1262 insertions, 0 deletions
diff --git a/Data/Libraries/Penlight/docs_topics/06-data.md b/Data/Libraries/Penlight/docs_topics/06-data.md
new file mode 100644
index 0000000..e067b6b
--- /dev/null
+++ b/Data/Libraries/Penlight/docs_topics/06-data.md
@@ -0,0 +1,1262 @@
+## Data
+
+### Reading Data Files
+
+The first thing to consider is this: do you actually need to write a custom file
+reader? And if the answer is yes, the next question is: can you write the reader
+in as clear a way as possible? Correctness, Robustness, and Speed; pick the first
+two and the third can be sorted out later, _if necessary_.
+
+A common sort of data file is the configuration file format commonly used on Unix
+systems. This format is often called a _property_ file in the Java world.
+
+ # Read timeout in seconds
+ read.timeout=10
+
+ # Write timeout in seconds
+ write.timeout=10
+
+Here is a simple Lua implementation:
+
+ -- property file parsing with Lua string patterns
+ props = []
+ for line in io.lines() do
+ if line:find('#',1,true) ~= 1 and not line:find('^%s*$') then
+ local var,value = line:match('([^=]+)=(.*)')
+ props[var] = value
+ end
+ end
+
+Very compact, but it suffers from a similar disease in equivalent Perl programs;
+it uses odd string patterns which are 'lexically noisy'. Noisy code like this
+slows the casual reader down. (For an even more direct way of doing this, see the
+next section, 'Reading Configuration Files')
+
+Another implementation, using the Penlight libraries:
+
+ -- property file parsing with extended string functions
+ require 'pl'
+ stringx.import()
+ props = []
+ for line in io.lines() do
+ if not line:startswith('#') and not line:isspace() then
+ local var,value = line:splitv('=')
+ props[var] = value
+ end
+ end
+
+This is more self-documenting; it is generally better to make the code express
+the _intention_, rather than having to scatter comments everywhere - comments are
+necessary, of course, but mostly to give the higher view of your intention that
+cannot be expressed in code. It is slightly slower, true, but in practice the
+speed of this script is determined by I/O, so further optimization is unnecessary.
+
+### Reading Unstructured Text Data
+
+Text data is sometimes unstructured, for example a file containing words. The
+`pl.input` module has a number of functions which makes processing such files
+easier. For example, a script to count the number of words in standard input
+using `import.words`:
+
+ -- countwords.lua
+ require 'pl'
+ local k = 1
+ for w in input.words(io.stdin) do
+ k = k + 1
+ end
+ print('count',k)
+
+Or this script to calculate the average of a set of numbers using `input.numbers`:
+
+ -- average.lua
+ require 'pl'
+ local k = 1
+ local sum = 0
+ for n in input.numbers(io.stdin) do
+ sum = sum + n
+ k = k + 1
+ end
+ print('average',sum/k)
+
+These scripts can be improved further by _eliminating loops_ In the last case,
+there is a perfectly good function `seq.sum` which can already take a sequence of
+numbers and calculate these numbers for us:
+
+ -- average2.lua
+ require 'pl'
+ local total,n = seq.sum(input.numbers())
+ print('average',total/n)
+
+A further simplification here is that if `numbers` or `words` are not passed an
+argument, they will grab their input from standard input. The first script can
+be rewritten:
+
+ -- countwords2.lua
+ require 'pl'
+ print('count',seq.count(input.words()))
+
+A useful feature of a sequence generator like `numbers` is that it can read from
+a string source. Here is a script to calculate the sums of the numbers on each
+line in a file:
+
+ -- sums.lua
+ for line in io.lines() do
+ print(seq.sum(input.numbers(line))
+ end
+
+### Reading Columnar Data
+
+It is very common to find data in columnar form, either space or comma-separated,
+perhaps with an initial set of column headers. Here is a typical example:
+
+ EventID Magnitude LocationX LocationY LocationZ
+ 981124001 2.0 18988.4 10047.1 4149.7
+ 981125001 0.8 19104.0 9970.4 5088.7
+ 981127003 0.5 19012.5 9946.9 3831.2
+ ...
+
+`input.fields` is designed to extract several columns, given some delimiter
+(default to whitespace). Here is a script to calculate the average X location of
+all the events:
+
+ -- avg-x.lua
+ require 'pl'
+ io.read() -- skip the header line
+ local sum,count = seq.sum(input.fields {3})
+ print(sum/count)
+
+`input.fields` is passed either a field count, or a list of column indices,
+starting at one as usual. So in this case we're only interested in column 3. If
+you pass it a field count, then you get every field up to that count:
+
+ for id,mag,locX,locY,locZ in input.fields (5) do
+ ....
+ end
+
+`input.fields` by default tries to convert each field to a number. It will skip
+lines which clearly don't match the pattern, but will abort the script if there
+are any fields which cannot be converted to numbers.
+
+The second parameter is a delimiter, by default spaces. ' ' is understood to mean
+'any number of spaces', i.e. '%s+'. Any Lua string pattern can be used.
+
+The third parameter is a _data source_, by default standard input (defined by
+`input.create_getter`.) It assumes that the data source has a `read` method which
+brings in the next line, i.e. it is a 'file-like' object. As a special case, a
+string will be split into its lines:
+
+ > for x,y in input.fields(2,' ','10 20\n30 40\n') do print(x,y) end
+ 10 20
+ 30 40
+
+Note the default behaviour for bad fields, which is to show the offending line
+number:
+
+ > for x,y in input.fields(2,' ','10 20\n30 40x\n') do print(x,y) end
+ 10 20
+ line 2: cannot convert '40x' to number
+
+This behaviour of `input.fields` is appropriate for a script which you want to
+fail immediately with an appropriate _user_ error message if conversion fails.
+The fourth optional parameter is an options table: `{no_fail=true}` means that
+conversion is attempted but if it fails it just returns the string, rather as AWK
+would operate. You are then responsible for checking the type of the returned
+field. `{no_convert=true}` switches off conversion altogether and all fields are
+returned as strings.
+
+@lookup pl.data
+
+Sometimes it is useful to bring a whole dataset into memory, for operations such
+as extracting columns. Penlight provides a flexible reader specifically for
+reading this kind of data, using the `data` module. Given a file looking like this:
+
+ x,y
+ 10,20
+ 2,5
+ 40,50
+
+Then `data.read` will create a table like this, with each row represented by a
+sublist:
+
+ > t = data.read 'test.txt'
+ > pretty.dump(t)
+ {{10,20},{2,5},{40,50},fieldnames={'x','y'},delim=','}
+
+You can now analyze this returned table using the supplied methods. For instance,
+the method `column_by_name` returns a table of all the values of that column.
+
+ -- testdata.lua
+ require 'pl'
+ d = data.read('fev.txt')
+ for _,name in ipairs(d.fieldnames) do
+ local col = d:column_by_name(name)
+ if type(col[1]) == 'number' then
+ local total,n = seq.sum(col)
+ utils.printf("Average for %s is %f\n",name,total/n)
+ end
+ end
+
+`data.read` tries to be clever when given data; by default it expects a first
+line of column names, unless any of them are numbers. It tries to deduce the
+column delimiter by looking at the first line. Sometimes it guesses wrong; these
+things can be specified explicitly. The second optional parameter is an options
+table: can override `delim` (a string pattern), `fieldnames` (a list or
+comma-separated string), specify `no_convert` (default is to convert), numfields
+(indices of columns known to be numbers, as a list) and `thousands_dot` (when the
+thousands separator in Excel CSV is '.')
+
+A very powerful feature is a way to execute SQL-like queries on such data:
+
+ -- queries on tabular data
+ require 'pl'
+ local d = data.read('xyz.txt')
+ local q = d:select('x,y,z where x > 3 and z < 2 sort by y')
+ for x,y,z in q do
+ print(x,y,z)
+ end
+
+Please note that the format of queries is restricted to the following syntax:
+
+ FIELDLIST [ 'where' CONDITION ] [ 'sort by' FIELD [asc|desc]]
+
+Any valid Lua code can appear in `CONDITION`; remember it is _not_ SQL and you
+have to use `==` (this warning comes from experience.)
+
+For this to work, _field names must be Lua identifiers_. So `read` will massage
+fieldnames so that all non-alphanumeric chars are replaced with underscores.
+However, the `original_fieldnames` field always contains the original un-massaged
+fieldnames.
+
+`read` can handle standard CSV files fine, although doesn't try to be a
+full-blown CSV parser. With the `csv=true` option, it's possible to have
+double-quoted fields, which may contain commas; then trailing commas become
+significant as well.
+
+Spreadsheet programs are not always the best tool to
+process such data, strange as this might seem to some people. This is a toy CSV
+file; to appreciate the problem, imagine thousands of rows and dozens of columns
+like this:
+
+ Department Name,Employee ID,Project,Hours Booked
+ sales,1231,overhead,4
+ sales,1255,overhead,3
+ engineering,1501,development,5
+ engineering,1501,maintenance,3
+ engineering,1433,maintenance,10
+
+The task is to reduce the dataset to a relevant set of rows and columns, perhaps
+do some processing on row data, and write the result out to a new CSV file. The
+`write_row` method uses the delimiter to write the row to a file;
+`Data.select_row` is like `Data.select`, except it iterates over _rows_, not
+fields; this is necessary if we are dealing with a lot of columns!
+
+ names = {[1501]='don',[1433]='dilbert'}
+ keepcols = {'Employee_ID','Hours_Booked'}
+ t:write_row (outf,{'Employee','Hours_Booked'})
+ q = t:select_row {
+ fields=keepcols,
+ where=function(row) return row[1]=='engineering' end
+ }
+ for row in q do
+ row[1] = names[row[1]]
+ t:write_row(outf,row)
+ end
+
+`Data.select_row` and `Data.select` can be passed a table specifying the query; a
+list of field names, a function defining the condition and an optional parameter
+`sort_by`. It isn't really necessary here, but if we had a more complicated row
+condition (such as belonging to a specified set) then it is not generally
+possible to express such a condition as a query string, without resorting to
+hackery such as global variables.
+
+With 1.0.3, you can specify explicit conversion functions for selected columns.
+For instance, this is a log file with a Unix date stamp:
+
+ Time Message
+ 1266840760 +# EE7C0600006F0D00C00F06010302054000000308010A00002B00407B00
+ 1266840760 closure data 0.000000 1972 1972 0
+ 1266840760 ++ 1266840760 EE 1
+ 1266840760 +# EE7C0600006F0D00C00F06010302054000000408020A00002B00407B00
+ 1266840764 closure data 0.000000 1972 1972 0
+
+We would like the first column as an actual date object, so the `convert`
+field sets an explicit conversion for column 1. (Note that we have to explicitly
+convert the string to a number first.)
+
+ Date = require 'pl.Date'
+
+ function date_convert (ds)
+ return Date(tonumber(ds))
+ end
+
+ d = data.read(f,{convert={[1]=date_convert},last_field_collect=true})
+
+This gives us a two-column dataset, where the first column contains `Date` objects
+and the second column contains the rest of the line. Queries can then easily
+pick out events on a day of the week:
+
+ q = d:select "Time,Message where Time:weekday_name()=='Sun'"
+
+Data does not have to come from files, nor does it necessarily come from the lab
+or the accounts department. On Linux, `ps aux` gives you a full listing of all
+processes running on your machine. It is straightforward to feed the output of
+this command into `data.read` and perform useful queries on it. Notice that
+non-identifier characters like '%' get converted into underscores:
+
+ require 'pl'
+ f = io.popen 'ps aux'
+ s = data.read (f,{last_field_collect=true})
+ f:close()
+ print(s.fieldnames)
+ print(s:column_by_name 'USER')
+ qs = 'COMMAND,_MEM where _MEM > 5 and USER=="steve"'
+ for name,mem in s:select(qs) do
+ print(mem,name)
+ end
+
+I've always been an admirer of the AWK programming language; with `filter` you
+can get Lua programs which are just as compact:
+
+ -- printxy.lua
+ require 'pl'
+ data.filter 'x,y where x > 3'
+
+It is common enough to have data files without headers of field names.
+`data.read` makes a special exception for such files if all fields are numeric.
+Since there are no column names to use in query expressions, you can use AWK-like
+column indexes, e.g. '$1,$2 where $1 > 3'. I have a little executable script on
+my system called `lf` which looks like this:
+
+ #!/usr/bin/env lua
+ require 'pl.data'.filter(arg[1])
+
+And it can be used generally as a filter command to extract columns from data.
+(The column specifications may be expressions or even constants.)
+
+ $ lf '$1,$5/10' < test.dat
+
+(As with AWK, please note the single-quotes used in this command; this prevents
+the shell trying to expand the column indexes. If you are on Windows, then you
+must quote the expression in double-quotes so
+it is passed as one argument to your batch file.)
+
+As a tutorial resource, have a look at `test-data.lua` in the PL tests directory
+for other examples of use, plus comments.
+
+The data returned by `read` or constructed by `Data.copy_select` from a query is
+basically just an array of rows: `{{1,2},{3,4}}`. So you may use `read` to pull
+in any array-like dataset, and process with any function that expects such a
+implementation. In particular, the functions in `array2d` will work fine with
+this data. In fact, these functions are available as methods; e.g.
+`array2d.flatten` can be called directly like so to give us a one-dimensional list:
+
+ v = data.read('dat.txt'):flatten()
+
+The data is also in exactly the right shape to be treated as matrices by
+[LuaMatrix](http://lua-users.org/wiki/LuaMatrix):
+
+ > matrix = require 'matrix'
+ > m = matrix(data.read 'mat.txt')
+ > = m
+ 1 0.2 0.3
+ 0.2 1 0.1
+ 0.1 0.2 1
+ > = m^2 -- same as m*m
+ 1.07 0.46 0.62
+ 0.41 1.06 0.26
+ 0.24 0.42 1.05
+
+`write` will write matrices back to files for you.
+
+Finally, for the curious, the global variable `_DEBUG` can be used to print out
+the actual iterator function which a query generates and dynamically compiles. By
+using code generation, we can get pretty much optimal performance out of
+arbitrary queries.
+
+ > lua -lpl -e "_DEBUG=true" -e "data.filter 'x,y where x > 4 sort by x'" < test.txt
+ return function (t)
+ local i = 0
+ local v
+ local ls = {}
+ for i,v in ipairs(t) do
+ if v[1] > 4 then
+ ls[#ls+1] = v
+ end
+ end
+ table.sort(ls,function(v1,v2)
+ return v1[1] < v2[1]
+ end)
+ local n = #ls
+ return function()
+ i = i + 1
+ v = ls[i]
+ if i > n then return end
+ return v[1],v[2]
+ end
+ end
+
+ 10,20
+ 40,50
+
+### Reading Configuration Files
+
+The `config` module provides a simple way to convert several kinds of
+configuration files into a Lua table. Consider the simple example:
+
+ # test.config
+ # Read timeout in seconds
+ read.timeout=10
+
+ # Write timeout in seconds
+ write.timeout=5
+
+ #acceptable ports
+ ports = 1002,1003,1004
+
+This can be easily brought in using `config.read` and the result shown using
+`pretty.write`:
+
+ -- readconfig.lua
+ local config = require 'pl.config'
+ local pretty= require 'pl.pretty'
+
+ local t = config.read(arg[1])
+ print(pretty.write(t))
+
+and the output of `lua readconfig.lua test.config` is:
+
+ {
+ ports = {
+ 1002,
+ 1003,
+ 1004
+ },
+ write_timeout = 5,
+ read_timeout = 10
+ }
+
+That is, `config.read` will bring in all key/value pairs, ignore # comments, and
+ensure that the key names are proper Lua identifiers by replacing non-identifier
+characters with '_'. If the values are numbers, then they will be converted. (So
+the value of `t.write_timeout` is the number 5). In addition, any values which
+are separated by commas will be converted likewise into an array.
+
+Any line can be continued with a backslash. So this will all be considered one
+line:
+
+ names=one,two,three, \
+ four,five,six,seven, \
+ eight,nine,ten
+
+
+Windows-style INI files are also supported. The section structure of INI files
+translates naturally to nested tables in Lua:
+
+ ; test.ini
+ [timeouts]
+ read=10 ; Read timeout in seconds
+ write=5 ; Write timeout in seconds
+ [portinfo]
+ ports = 1002,1003,1004
+
+ The output is:
+
+ {
+ portinfo = {
+ ports = {
+ 1002,
+ 1003,
+ 1004
+ }
+ },
+ timeouts = {
+ write = 5,
+ read = 10
+ }
+ }
+
+You can now refer to the write timeout as `t.timeouts.write`.
+
+As a final example of the flexibility of `config.read`, if passed this simple
+comma-delimited file
+
+ one,two,three
+ 10,20,30
+ 40,50,60
+ 1,2,3
+
+it will produce the following table:
+
+ {
+ { "one", "two", "three" },
+ { 10, 20, 30 },
+ { 40, 50, 60 },
+ { 1, 2, 3 }
+ }
+
+`config.read` isn't designed to read all CSV files in general, but intended to
+support some Unix configuration files not structured as key-value pairs, such as
+'/etc/passwd'.
+
+This function is intended to be a Swiss Army Knife of configuration readers, but
+it does have to make assumptions, and you may not like them. So there is an
+optional extra parameter which allows some control, which is table that may have
+the following fields:
+
+ {
+ variablilize = true,
+ convert_numbers = tonumber,
+ trim_space = true,
+ list_delim = ',',
+ trim_quotes = true,
+ ignore_assign = false,
+ keysep = '=',
+ smart = false,
+ }
+
+`variablilize` is the option that converted `write.timeout` in the first example
+to the valid Lua identifier `write_timeout`. If `convert_numbers` is true, then
+an attempt is made to convert any string that starts like a number. You can
+specify your own function (say one that will convert a string like '5224 kb' into
+a number.)
+
+`trim_space` ensures that there is no starting or trailing whitespace with
+values, and `list_delim` is the character that will be used to decide whether to
+split a value up into a list (it may be a Lua string pattern such as '%s+'.)
+
+For instance, the password file in Unix is colon-delimited:
+
+ t = config.read('/etc/passwd',{list_delim=':'})
+
+This produces the following output on my system (only last two lines shown):
+
+ {
+ ...
+ {
+ "user",
+ "x",
+ "1000",
+ "1000",
+ "user,,,",
+ "/home/user",
+ "/bin/bash"
+ },
+ {
+ "sdonovan",
+ "x",
+ "1001",
+ "1001",
+ "steve donovan,28,,",
+ "/home/sdonovan",
+ "/bin/bash"
+ }
+ }
+
+You can get this into a more sensible format, where the usernames are the keys,
+with this (the `tablex.pairmap` function must return value, key!)
+
+ t = tablex.pairmap(function(k,v) return v,v[1] end,t)
+
+and you get:
+
+ { ...
+ sdonovan = {
+ "sdonovan",
+ "x",
+ "1001",
+ "1001",
+ "steve donovan,28,,",
+ "/home/sdonovan",
+ "/bin/bash"
+ }
+ ...
+ }
+
+Many common Unix configuration files can be read by tweaking these parameters.
+For `/etc/fstab`, the options `{list_delim='%s+',ignore_assign=true}` will
+correctly separate the columns. It's common to find 'KEY VALUE' assignments in
+files such as `/etc/ssh/ssh_config`; the options `{keysep=' '}` make
+`config.read` return a table where each KEY has a value VALUE.
+
+Files in the Linux `procfs` usually use ':` as the field delimiter:
+
+ > t = config.read('/proc/meminfo',{keysep=':'})
+ > = t.MemFree
+ 220140 kB
+
+That result is a string, since `tonumber` doesn't like it, but defining the
+`convert_numbers` option as `function(s) return tonumber((s:gsub(' kB$','')))
+end` will get the memory figures as actual numbers in the result. (The extra
+parentheses are necessary so that `tonumber` only gets the first result from
+`gsub`). From `tests/test-config.lua':
+
+ testconfig([[
+ MemTotal: 1024748 kB
+ MemFree: 220292 kB
+ ]],
+ { MemTotal = 1024748, MemFree = 220292 },
+ {
+ keysep = ':',
+ convert_numbers = function(s)
+ s = s:gsub(' kB$','')
+ return tonumber(s)
+ end
+ }
+ )
+
+
+The `smart` option lets `config.read` make a reasonable guess for you; there
+are examples in `tests/test-config.lua`, but basically these common file
+formats (and those following the same pattern) can be processed directly in
+smart mode: 'etc/fstab', '/proc/XXXX/status', 'ssh_config' and 'pdatedb.conf'.
+
+Please note that `config.read` can be passed a _file-like object_; if it's not a
+string and supports the `read` method, then that will be used. For instance, to
+read a configuration from a string, use `stringio.open`.
+
+
+<a id="lexer"/>
+
+### Lexical Scanning
+
+Although Lua's string pattern matching is very powerful, there are times when
+something more powerful is needed. `pl.lexer.scan` provides lexical scanners
+which _tokenize_ a string, classifying tokens into numbers, strings, etc.
+
+ > lua -lpl
+ Lua 5.1.4 Copyright (C) 1994-2008 Lua.org, PUC-Rio
+ > tok = lexer.scan 'alpha = sin(1.5)'
+ > = tok()
+ iden alpha
+ > = tok()
+ = =
+ > = tok()
+ iden sin
+ > = tok()
+ ( (
+ > = tok()
+ number 1.5
+ > = tok()
+ ) )
+ > = tok()
+ (nil)
+
+The scanner is a function, which is repeatedly called and returns the _type_ and
+_value_ of the token. Recognized basic types are 'iden','string','number', and
+'space'. and everything else is represented by itself. Note that by default the
+scanner will skip any 'space' tokens.
+
+'comment' and 'keyword' aren't applicable to the plain scanner, which is not
+language-specific, but a scanner which understands Lua is available. It
+recognizes the Lua keywords, and understands both short and long comments and
+strings.
+
+ > for t,v in lexer.lua 'for i=1,n do' do print(t,v) end
+ keyword for
+ iden i
+ = =
+ number 1
+ , ,
+ iden n
+ keyword do
+
+A lexical scanner is useful where you have highly-structured data which is not
+nicely delimited by newlines. For example, here is a snippet of a in-house file
+format which it was my task to maintain:
+
+ points
+ (818344.1,-20389.7,-0.1),(818337.9,-20389.3,-0.1),(818332.5,-20387.8,-0.1)
+ ,(818327.4,-20388,-0.1),(818322,-20387.7,-0.1),(818316.3,-20388.6,-0.1)
+ ,(818309.7,-20389.4,-0.1),(818303.5,-20390.6,-0.1),(818295.8,-20388.3,-0.1)
+ ,(818290.5,-20386.9,-0.1),(818285.2,-20386.1,-0.1),(818279.3,-20383.6,-0.1)
+ ,(818274,-20381.2,-0.1),(818274,-20380.7,-0.1);
+
+Here is code to extract the points using `pl.lexer`:
+
+ -- assume 's' contains the text above...
+ local lexer = require 'pl.lexer'
+ local expecting = lexer.expecting
+ local append = table.insert
+
+ local tok = lexer.scan(s)
+
+ local points = {}
+ local t,v = tok() -- should be 'iden','points'
+
+ while t ~= ';' do
+ c = {}
+ expecting(tok,'(')
+ c.x = expecting(tok,'number')
+ expecting(tok,',')
+ c.y = expecting(tok,'number')
+ expecting(tok,',')
+ c.z = expecting(tok,'number')
+ expecting(tok,')')
+ t,v = tok() -- either ',' or ';'
+ append(points,c)
+ end
+
+The `expecting` function grabs the next token and if the type doesn't match, it
+throws an error. (`pl.lexer`, unlike other PL libraries, raises errors if
+something goes wrong, so you should wrap your code in `pcall` to catch the error
+gracefully.)
+
+The scanners all have a second optional argument, which is a table which controls
+whether you want to exclude spaces and/or comments. The default for `lexer.lua`
+is `{space=true,comments=true}`. There is a third optional argument which
+determines how string and number tokens are to be processsed.
+
+The ultimate highly-structured data is of course, program source. Here is a
+snippet from 'text-lexer.lua':
+
+ require 'pl'
+
+ lines = [[
+ for k,v in pairs(t) do
+ if type(k) == 'number' then
+ print(v) -- array-like case
+ else
+ print(k,v)
+ end
+ end
+ ]]
+
+ ls = List()
+ for tp,val in lexer.lua(lines,{space=true,comments=true}) do
+ assert(tp ~= 'space' and tp ~= 'comment')
+ if tp == 'keyword' then ls:append(val) end
+ end
+ test.asserteq(ls,List{'for','in','do','if','then','else','end','end'})
+
+Here is a useful little utility that identifies all common global variables found
+in a lua module (ignoring those declared locally for the moment):
+
+ -- testglobal.lua
+ require 'pl'
+
+ local txt,err = utils.readfile(arg[1])
+ if not txt then return print(err) end
+
+ local globals = List()
+ for t,v in lexer.lua(txt) do
+ if t == 'iden' and _G[v] then
+ globals:append(v)
+ end
+ end
+ pretty.dump(seq.count_map(globals))
+
+Rather then dumping the whole list, with its duplicates, we pass it through
+`seq.count_map` which turns the list into a table where the keys are the values,
+and the associated values are the number of times those values occur in the
+sequence. Typical output looks like this:
+
+ {
+ type = 2,
+ pairs = 2,
+ table = 2,
+ print = 3,
+ tostring = 2,
+ require = 1,
+ ipairs = 4
+ }
+
+You could further pass this through `tablex.keys` to get a unique list of
+symbols. This can be useful when writing 'strict' Lua modules, where all global
+symbols must be defined as locals at the top of the file.
+
+For a more detailed use of `lexer.scan`, please look at `testxml.lua` in the
+examples directory.
+
+### XML
+
+New in the 0.9.7 release is some support for XML. This is a large topic, and
+Penlight does not provide a full XML stack, which is properly the task of a more
+specialized library.
+
+#### Parsing and Pretty-Printing
+
+The semi-standard XML parser in the Lua universe is [lua-expat](http://matthewwild.co.uk/projects/luaexpat/).
+In particular,
+it has a function called `lxp.lom.parse` which will parse XML into the Lua Object
+Model (LOM) format. However, it does not provide a way to convert this data back
+into XML text. `xml.parse` will use this function, _if_ `lua-expat` is
+available, and otherwise switches back to a pure Lua parser originally written by
+Roberto Ierusalimschy.
+
+The resulting document object knows how to render itself as a string, which is
+useful for debugging:
+
+ > d = xml.parse "<nodes><node id='1'>alice</node></nodes>"
+ > = d
+ <nodes><node id='1'>alice</node></nodes>
+ > pretty.dump (d)
+ {
+ {
+ "alice",
+ attr = {
+ "id",
+ id = "1"
+ },
+ tag = "node"
+ },
+ attr = {
+ },
+ tag = "nodes"
+ }
+
+Looking at the actual shape of the data reveals the structure of LOM:
+
+ * every element has a `tag` field with its name
+ * plus a `attr` field which is a table containing the attributes as fields, and
+also as an array. It is always present.
+ * the children of the element are the array part of the element, so `d[1]` is
+the first child of `d`, etc.
+
+It could be argued that having attributes also as the array part of `attr` is not
+essential (you cannot depend on attribute order in XML) but that's how
+it goes with this standard.
+
+`lua-expat` is another _soft dependency_ of Penlight; generally, the fallback
+parser is good enough for straightforward XML as is commonly found in
+configuration files, etc. `doc.basic_parse` is not intended to be a proper
+conforming parser (it's only sixty lines) but it handles simple kinds of
+documents that do not have comments or DTD directives. It is intelligent enough
+to ignore the `<?xml` directive and that is about it.
+
+You can get pretty-printing by explicitly calling `xml.tostring` and passing it
+the initial indent and the per-element indent:
+
+ > = xml.tostring(d,'',' ')
+
+ <nodes>
+ <node id='1'>alice</node>
+ </nodes>
+
+There is a fourth argument which is the _attribute indent_:
+
+ > a = xml.parse "<frodo name='baggins' age='50' type='hobbit'/>"
+ > = xml.tostring(a,'',' ',' ')
+
+ <frodo
+ type='hobbit'
+ name='baggins'
+ age='50'
+ />
+
+#### Parsing and Working with Configuration Files
+
+It's common to find configurations expressed with XML these days. It's
+straightforward to 'walk' the [LOM](http://matthewwild.co.uk/projects/luaexpat/lom.html)
+data and extract the data in the form you want:
+
+ require 'pl'
+
+ local config = [[
+ <config>
+ <alpha>1.3</alpha>
+ <beta>10</beta>
+ <name>bozo</name>
+ </config>
+ ]]
+ local d,err = xml.parse(config)
+
+ local t = {}
+ for item in d:childtags() do
+ t[item.tag] = item[1]
+ end
+
+ pretty.dump(t)
+ --->
+ {
+ beta = "10",
+ alpha = "1.3",
+ name = "bozo"
+ }
+
+The only gotcha is that here we must use the `Doc:childtags` method, which will
+skip over any text elements.
+
+A more involved example is this excerpt from `serviceproviders.xml`, which is
+usually found at `/usr/share/mobile-broadband-provider-info/serviceproviders.xml`
+on Debian/Ubuntu Linux systems.
+
+ d = xml.parse [[
+ <serviceproviders format="2.0">
+ ...
+ <country code="za">
+ <provider>
+ <name>Cell-c</name>
+ <gsm>
+ <network-id mcc="655" mnc="07"/>
+ <apn value="internet">
+ <username>Cellcis</username>
+ <dns>196.7.0.138</dns>
+ <dns>196.7.142.132</dns>
+ </apn>
+ </gsm>
+ </provider>
+ <provider>
+ <name>MTN</name>
+ <gsm>
+ <network-id mcc="655" mnc="10"/>
+ <apn value="internet">
+ <dns>196.11.240.241</dns>
+ <dns>209.212.97.1</dns>
+ </apn>
+ </gsm>
+ </provider>
+ <provider>
+ <name>Vodacom</name>
+ <gsm>
+ <network-id mcc="655" mnc="01"/>
+ <apn value="internet">
+ <dns>196.207.40.165</dns>
+ <dns>196.43.46.190</dns>
+ </apn>
+ <apn value="unrestricted">
+ <name>Unrestricted</name>
+ <dns>196.207.32.69</dns>
+ <dns>196.43.45.190</dns>
+ </apn>
+ </gsm>
+ </provider>
+ <provider>
+ <name>Virgin Mobile</name>
+ <gsm>
+ <apn value="vdata">
+ <dns>196.7.0.138</dns>
+ <dns>196.7.142.132</dns>
+ </apn>
+ </gsm>
+ </provider>
+ </country>
+ ....
+ </serviceproviders>
+ ]]
+
+Getting the names of the providers per-country is straightforward:
+
+ local t = {}
+ for country in d:childtags() do
+ local providers = {}
+ t[country.attr.code] = providers
+ for provider in country:childtags() do
+ table.insert(providers,provider:child_with_name('name'):get_text())
+ end
+ end
+
+ pretty.dump(t)
+ -->
+ {
+ za = {
+ "Cell-c",
+ "MTN",
+ "Vodacom",
+ "Virgin Mobile"
+ }
+ ....
+ }
+
+#### Generating XML with 'xmlification'
+
+This feature is inspired by the `htmlify` function used by
+[Orbit](http://keplerproject.github.com/orbit/) to simplify HTML generation,
+except that no function environment magic is used; the `tags` function returns a
+set of _constructors_ for elements of the given tag names.
+
+ > nodes, node = xml.tags 'nodes, node'
+ > = node 'alice'
+ <node>alice</node>
+ > = nodes { node {id='1','alice'}}
+ <nodes><node id='1'>alice</node></nodes>
+
+The flexibility of Lua tables is very useful here, since both the attributes and
+the children of an element can be encoded naturally. The argument to these tag
+constructors is either a single value (like a string) or a table where the
+attributes are the named keys and the children are the array values.
+
+#### Generating XML using Templates
+
+A template is a little XML document which contains dollar-variables. The `subst`
+method on a document is fed an array of tables containing values for these
+variables. Note how the parent tag name is specified:
+
+ > templ = xml.parse "<node id='$id'>$name</node>"
+ > = templ:subst {tag='nodes', {id=1,name='alice'},{id=2,name='john'}}
+ <nodes><node id='1'>alice</node><node id='2'>john</node></nodes>
+
+Substitution is very related to _filtering_ documents. One of the annoying things
+about XML is that it is a document markup language first, and a data language
+second. Standard parsers will assume you really care about all those extra
+text elements. Consider this fragment, which has been changed by a five-year old:
+
+ T = [[
+ <weather>
+ boops!
+ <current_conditions>
+ <condition data='$condition'/>
+ <temp_c data='$temp'/>
+ <bo>whoops!</bo>
+ </current_conditions>
+ </weather>
+ ]]
+
+Conformant parsers will give you text elements with the line feed after `<current_conditions>`
+although it makes handling the data more irritating.
+
+ local function parse (str)
+ return xml.parse(str,false,true)
+ end
+
+Second argument means 'string, not file' and third argument means use the built-in
+Lua parser (instead of LuaExpat if available) which _by default_ is not interested in
+keeping such strings.
+
+How to remove the string `boops!`? `clone` (also called `filter` when called as a
+method) copies a LOM document. It can be passed a filter function, which is applied
+to each string found. The powerful thing about this is that this function receives
+structural information - the parent node, and whether this was a tag name, a text
+element or a attribute name:
+
+ d = parse (T)
+ c = d:filter(function(s,kind,parent)
+ print(stringx.strip(s),kind,parent and parent.tag or '?')
+ if kind == '*TEXT' and #parent > 1 then return nil end
+ return s
+ end)
+ --->
+ weather *TAG ?
+ boops! *TEXT weather
+ current_conditions *TAG weather
+ condition *TAG current_conditions
+ $condition data condition
+ temp_c *TAG current_conditions
+ $temp data temp_c
+ bo *TAG current_conditions
+ whoops! *TEXT bo
+
+We can pull out 'boops' and not 'whoops' by discarding text elements which are not
+the single child of an element.
+
+
+
+#### Extracting Data using Templates
+
+Matching goes in the opposite direction. We have a document, and would like to
+extract values from it using a pattern.
+
+A common use of this is parsing the XML result of API queries. The
+[(undocumented and subsequently discontinued) Google Weather
+API](http://blog.programmableweb.com/2010/02/08/googles-secret-weather-api/) is a
+good example. Grabbing the result of
+`http://www.google.com/ig/api?weather=Johannesburg,ZA" we get something like
+this, after pretty-printing:
+
+ <xml_api_reply version='1'>
+ <weather module_id='0' tab_id='0' mobile_zipped='1' section='0' row='0'
+mobile_row='0'>
+ <forecast_information>
+ <city data='Johannesburg, Gauteng'/>
+ <postal_code data='Johannesburg,ZA'/>
+ <latitude_e6 data=''/>
+ <longitude_e6 data=''/>
+ <forecast_date data='2010-10-02'/>
+ <current_date_time data='2010-10-02 18:30:00 +0000'/>
+ <unit_system data='US'/>
+ </forecast_information>
+ <current_conditions>
+ <condition data='Clear'/>
+ <temp_f data='75'/>
+ <temp_c data='24'/>
+ <humidity data='Humidity: 19%'/>
+ <icon data='/ig/images/weather/sunny.gif'/>
+ <wind_condition data='Wind: NW at 7 mph'/>
+ </current_conditions>
+ <forecast_conditions>
+ <day_of_week data='Sat'/>
+ <low data='60'/>
+ <high data='89'/>
+ <icon data='/ig/images/weather/sunny.gif'/>
+ <condition data='Clear'/>
+ </forecast_conditions>
+ ....
+ </weather>
+ </xml_api_reply>
+
+Assume that the above XML has been read into `google`. The idea is to write a
+pattern looking like a template, and use it to extract some values of interest:
+
+ t = [[
+ <weather>
+ <current_conditions>
+ <condition data='$condition'/>
+ <temp_c data='$temp'/>
+ </current_conditions>
+ </weather>
+ ]]
+
+ local res, ret = google:match(t)
+ pretty.dump(res)
+
+And the output is:
+
+ {
+ condition = "Clear",
+ temp = "24"
+ }
+
+The `match` method can be passed a LOM document or some text, which will be
+parsed first.
+
+But what if we need to extract values from repeated elements? Match templates may
+contain 'array matches' which are enclosed in '{{..}}':
+
+ <weather>
+ {{<forecast_conditions>
+ <day_of_week data='$day'/>
+ <low data='$low'/>
+ <high data='$high'/>
+ <condition data='$condition'/>
+ </forecast_conditions>}}
+ </weather>
+
+And the match result is:
+
+ {
+ {
+ low = "60",
+ high = "89",
+ day = "Sat",
+ condition = "Clear",
+ },
+ {
+ low = "53",
+ high = "86",
+ day = "Sun",
+ condition = "Clear",
+ },
+ {
+ low = "57",
+ high = "87",
+ day = "Mon",
+ condition = "Clear",
+ },
+ {
+ low = "60",
+ high = "84",
+ day = "Tue",
+ condition = "Clear",
+ }
+ }
+
+With this array of tables, you can use `tablex` or `List`
+to reshape into the desired form, if you choose. Just as with reading a Unix password
+file with `config`, you can make the array into a map of days to conditions using:
+
+ `tablex.pairmap`('|k,v| v,v.day',conditions)
+
+(Here using the alternative string lambda option)
+
+However, xml matches can shape the structure of the output. By replacing the `day_of_week`
+line of the template with `<day_of_week data='$_'/>` we get the same effect; `$_` is
+a special symbol that means that this captured value (or simply _capture_) becomes the key.
+
+Note that `$NUMBER` means a numerical index, so
+that `$1` is the first element of the resulting array, and so forth. You can mix
+numbered and named captures, but it's strongly advised to make the numbered captures
+form a proper array sequence (everything from `1` to `n` inclusive). `$0` has a
+special meaning; if it is the only capture (`{[0]='foo'}`) then the table is
+collapsed into 'foo'.
+
+ <weather>
+ {{<forecast_conditions>
+ <day_of_week data='$_'/>
+ <low data='$1'/>
+ <high data='$2'/>
+ <condition data='$3'/>
+ </forecast_conditions>}}
+ </weather>
+
+Now the result is:
+
+ {
+ Tue = {
+ "60",
+ "84",
+ "Clear"
+ },
+ Sun = {
+ "53",
+ "86",
+ "Clear"
+ },
+ Sat = {
+ "60",
+ "89",
+ "Clear"
+ },
+ Mon = {
+ "57",
+ "87",
+ "Clear"
+ }
+ }
+
+Applying matches to this config file poses another problem, because the actual
+tags matched are themselves meaningful.
+
+ <config>
+ <alpha>1.3</alpha>
+ <beta>10</beta>
+ <name>bozo</name>
+ </config>
+
+So there are tag 'wildcards' which are element names ending with a hyphen.
+
+ <config>
+ {{<key->$value</key->}}
+ </config>
+
+You will then get `{{alpha='1.3'},...}`. The most convenient format would be
+returned by this (note that `_-` behaves just like `$_`):
+
+ <config>
+ {{<_->$0</_->}}
+ </config>
+
+which would return `{alpha='1.3',beta='10',name='bozo'}`.
+
+We could play this game endlessly, and encode ways of converting captures, but
+the scheme is complex enough, and it's easy to do the conversion later
+
+ local numbers = {alpha=true,beta=true}
+ for k,v in pairs(res) do
+ if numbers[v] then res[k] = tonumber(v) end
+ end
+
+
+#### HTML Parsing
+
+HTML is an unusually degenerate form of XML, and Dennis Schridde has contributed
+a feature which makes parsing it easier. For instance, from the tests:
+
+ doc = xml.parsehtml [[
+ <BODY>
+ Hello dolly<br>
+ HTML is <b>slack</b><br>
+ </BODY>
+ ]]
+
+ asserteq(xml.tostring(doc),[[
+ <body>
+ Hello dolly<br/>
+ HTML is <b>slack</b><br/></body>]])
+
+That is, all tags are converted to lowercase, and empty HTML elements like `br`
+are properly closed; attributes do not need to be quoted.
+
+Also, DOCTYPE directives and comments are skipped. For truly badly formed HTML,
+this is not the tool for you!
+
+
+