From 42ec7286b2d36a9ba22925f816a17cb1cc2aa5ce Mon Sep 17 00:00:00 2001 From: chai Date: Sat, 30 Oct 2021 11:32:16 +0800 Subject: + Penlight --- .../Libraries/Penlight/docs/manual/06-data.md.html | 1633 ++++++++++++++++++++ 1 file changed, 1633 insertions(+) create mode 100644 Data/Libraries/Penlight/docs/manual/06-data.md.html (limited to 'Data/Libraries/Penlight/docs/manual/06-data.md.html') diff --git a/Data/Libraries/Penlight/docs/manual/06-data.md.html b/Data/Libraries/Penlight/docs/manual/06-data.md.html new file mode 100644 index 0000000..585e23e --- /dev/null +++ b/Data/Libraries/Penlight/docs/manual/06-data.md.html @@ -0,0 +1,1633 @@ + + + + + Penlight Documentation + + + + +
+ +
+ +
+
+
+ + +
+ + + + + + +
+ + +

Data

+ +

+

Reading Data Files

+ +

The first thing to consider is this: do you actually need to write a custom file +reader? And if the answer is yes, the next question is: can you write the reader +in as clear a way as possible? Correctness, Robustness, and Speed; pick the first +two and the third can be sorted out later, if necessary.

+ +

A common sort of data file is the configuration file format commonly used on Unix +systems. This format is often called a property file in the Java world.

+ + +
+# Read timeout in seconds
+read.timeout=10
+
+# Write timeout in seconds
+write.timeout=10
+
+ +

Here is a simple Lua implementation:

+ + +
+-- property file parsing with Lua string patterns
+props = []
+for line in io.lines() do
+    if line:find('#',1,true) ~= 1 and not line:find('^%s*$') then
+        local var,value = line:match('([^=]+)=(.*)')
+        props[var] = value
+    end
+end
+
+ +

Very compact, but it suffers from a similar disease in equivalent Perl programs; +it uses odd string patterns which are 'lexically noisy'. Noisy code like this +slows the casual reader down. (For an even more direct way of doing this, see the +next section, 'Reading Configuration Files')

+ +

Another implementation, using the Penlight libraries:

+ + +
+-- property file parsing with extended string functions
+require 'pl'
+stringx.import()
+props = []
+for line in io.lines() do
+    if not line:startswith('#') and not line:isspace() then
+        local var,value = line:splitv('=')
+        props[var] = value
+    end
+end
+
+ +

This is more self-documenting; it is generally better to make the code express +the intention, rather than having to scatter comments everywhere - comments are +necessary, of course, but mostly to give the higher view of your intention that +cannot be expressed in code. It is slightly slower, true, but in practice the +speed of this script is determined by I/O, so further optimization is unnecessary.

+ +

+

Reading Unstructured Text Data

+ +

Text data is sometimes unstructured, for example a file containing words. The +pl.input module has a number of functions which makes processing such files +easier. For example, a script to count the number of words in standard input +using import.words:

+ + +
+-- countwords.lua
+require 'pl'
+local k = 1
+for w in input.words(io.stdin) do
+    k = k + 1
+end
+print('count',k)
+
+ +

Or this script to calculate the average of a set of numbers using input.numbers:

+ + +
+-- average.lua
+require 'pl'
+local k = 1
+local sum = 0
+for n in input.numbers(io.stdin) do
+    sum = sum + n
+    k = k + 1
+end
+print('average',sum/k)
+
+ +

These scripts can be improved further by eliminating loops In the last case, +there is a perfectly good function seq.sum which can already take a sequence of +numbers and calculate these numbers for us:

+ + +
+-- average2.lua
+require 'pl'
+local total,n = seq.sum(input.numbers())
+print('average',total/n)
+
+ +

A further simplification here is that if numbers or words are not passed an +argument, they will grab their input from standard input. The first script can +be rewritten:

+ + +
+-- countwords2.lua
+require 'pl'
+print('count',seq.count(input.words()))
+
+ +

A useful feature of a sequence generator like numbers is that it can read from +a string source. Here is a script to calculate the sums of the numbers on each +line in a file:

+ + +
+-- sums.lua
+for line in io.lines() do
+    print(seq.sum(input.numbers(line))
+end
+
+ +

+

Reading Columnar Data

+ +

It is very common to find data in columnar form, either space or comma-separated, +perhaps with an initial set of column headers. Here is a typical example:

+ + +
+EventID    Magnitude    LocationX    LocationY    LocationZ
+981124001    2.0    18988.4    10047.1    4149.7
+981125001    0.8    19104.0    9970.4    5088.7
+981127003    0.5    19012.5    9946.9    3831.2
+...
+
+ +

input.fields is designed to extract several columns, given some delimiter +(default to whitespace). Here is a script to calculate the average X location of +all the events:

+ + +
+-- avg-x.lua
+require 'pl'
+io.read() -- skip the header line
+local sum,count = seq.sum(input.fields {3})
+print(sum/count)
+
+ +

input.fields is passed either a field count, or a list of column indices, +starting at one as usual. So in this case we're only interested in column 3. If +you pass it a field count, then you get every field up to that count:

+ + +
+for id,mag,locX,locY,locZ in input.fields (5) do
+....
+end
+
+ +

input.fields by default tries to convert each field to a number. It will skip +lines which clearly don't match the pattern, but will abort the script if there +are any fields which cannot be converted to numbers.

+ +

The second parameter is a delimiter, by default spaces. ' ' is understood to mean +'any number of spaces', i.e. '%s+'. Any Lua string pattern can be used.

+ +

The third parameter is a data source, by default standard input (defined by +input.create_getter.) It assumes that the data source has a read method which +brings in the next line, i.e. it is a 'file-like' object. As a special case, a +string will be split into its lines:

+ + +
+> for x,y in input.fields(2,' ','10 20\n30 40\n') do print(x,y) end
+10      20
+30      40
+
+ +

Note the default behaviour for bad fields, which is to show the offending line +number:

+ + +
+> for x,y in input.fields(2,' ','10 20\n30 40x\n') do print(x,y) end
+10      20
+line 2: cannot convert '40x' to number
+
+ +

This behaviour of input.fields is appropriate for a script which you want to +fail immediately with an appropriate user error message if conversion fails. +The fourth optional parameter is an options table: {no_fail=true} means that +conversion is attempted but if it fails it just returns the string, rather as AWK +would operate. You are then responsible for checking the type of the returned +field. {no_convert=true} switches off conversion altogether and all fields are +returned as strings.

+ + +

Sometimes it is useful to bring a whole dataset into memory, for operations such +as extracting columns. Penlight provides a flexible reader specifically for +reading this kind of data, using the data module. Given a file looking like this:

+ + +
+x,y
+10,20
+2,5
+40,50
+
+ +

Then data.read will create a table like this, with each row represented by a +sublist:

+ + +
+> t = data.read 'test.txt'
+> pretty.dump(t)
+{{10,20},{2,5},{40,50},fieldnames={'x','y'},delim=','}
+
+ +

You can now analyze this returned table using the supplied methods. For instance, +the method column_by_name returns a table of all the values of that column.

+ + +
+-- testdata.lua
+require 'pl'
+d = data.read('fev.txt')
+for _,name in ipairs(d.fieldnames) do
+    local col = d:column_by_name(name)
+    if type(col[1]) == 'number' then
+        local total,n = seq.sum(col)
+        utils.printf("Average for %s is %f\n",name,total/n)
+    end
+end
+
+ +

data.read tries to be clever when given data; by default it expects a first +line of column names, unless any of them are numbers. It tries to deduce the +column delimiter by looking at the first line. Sometimes it guesses wrong; these +things can be specified explicitly. The second optional parameter is an options +table: can override delim (a string pattern), fieldnames (a list or +comma-separated string), specify no_convert (default is to convert), numfields +(indices of columns known to be numbers, as a list) and thousands_dot (when the +thousands separator in Excel CSV is '.')

+ +

A very powerful feature is a way to execute SQL-like queries on such data:

+ + +
+-- queries on tabular data
+require 'pl'
+local d = data.read('xyz.txt')
+local q = d:select('x,y,z where x > 3 and z < 2 sort by y')
+for x,y,z in q do
+    print(x,y,z)
+end
+
+ +

Please note that the format of queries is restricted to the following syntax:

+ + +
+FIELDLIST [ 'where' CONDITION ] [ 'sort by' FIELD [asc|desc]]
+
+ +

Any valid Lua code can appear in CONDITION; remember it is not SQL and you +have to use == (this warning comes from experience.)

+ +

For this to work, field names must be Lua identifiers. So read will massage +fieldnames so that all non-alphanumeric chars are replaced with underscores. +However, the original_fieldnames field always contains the original un-massaged +fieldnames.

+ +

read can handle standard CSV files fine, although doesn't try to be a +full-blown CSV parser. With the csv=true option, it's possible to have +double-quoted fields, which may contain commas; then trailing commas become +significant as well.

+ +

Spreadsheet programs are not always the best tool to +process such data, strange as this might seem to some people. This is a toy CSV +file; to appreciate the problem, imagine thousands of rows and dozens of columns +like this:

+ + +
+Department Name,Employee ID,Project,Hours Booked
+sales,1231,overhead,4
+sales,1255,overhead,3
+engineering,1501,development,5
+engineering,1501,maintenance,3
+engineering,1433,maintenance,10
+
+ +

The task is to reduce the dataset to a relevant set of rows and columns, perhaps +do some processing on row data, and write the result out to a new CSV file. The +write_row method uses the delimiter to write the row to a file; +Data.select_row is like Data.select, except it iterates over rows, not +fields; this is necessary if we are dealing with a lot of columns!

+ + +
+names = {[1501]='don',[1433]='dilbert'}
+keepcols = {'Employee_ID','Hours_Booked'}
+t:write_row (outf,{'Employee','Hours_Booked'})
+q = t:select_row {
+    fields=keepcols,
+    where=function(row) return row[1]=='engineering' end
+}
+for row in q do
+    row[1] = names[row[1]]
+    t:write_row(outf,row)
+end
+
+ +

Data.select_row and Data.select can be passed a table specifying the query; a +list of field names, a function defining the condition and an optional parameter +sort_by. It isn't really necessary here, but if we had a more complicated row +condition (such as belonging to a specified set) then it is not generally +possible to express such a condition as a query string, without resorting to +hackery such as global variables.

+ +

With 1.0.3, you can specify explicit conversion functions for selected columns. +For instance, this is a log file with a Unix date stamp:

+ + +
+Time Message
+1266840760 +# EE7C0600006F0D00C00F06010302054000000308010A00002B00407B00
+1266840760 closure data 0.000000 1972 1972 0
+1266840760 ++ 1266840760 EE 1
+1266840760 +# EE7C0600006F0D00C00F06010302054000000408020A00002B00407B00
+1266840764 closure data 0.000000 1972 1972 0
+
+ +

We would like the first column as an actual date object, so the convert +field sets an explicit conversion for column 1. (Note that we have to explicitly +convert the string to a number first.)

+ + +
+Date = require 'pl.Date'
+
+function date_convert (ds)
+    return Date(tonumber(ds))
+end
+
+d = data.read(f,{convert={[1]=date_convert},last_field_collect=true})
+
+ +

This gives us a two-column dataset, where the first column contains Date objects +and the second column contains the rest of the line. Queries can then easily +pick out events on a day of the week:

+ + +
+q = d:select "Time,Message where Time:weekday_name()=='Sun'"
+
+ +

Data does not have to come from files, nor does it necessarily come from the lab +or the accounts department. On Linux, ps aux gives you a full listing of all +processes running on your machine. It is straightforward to feed the output of +this command into data.read and perform useful queries on it. Notice that +non-identifier characters like '%' get converted into underscores:

+ + +
+require 'pl'
+f = io.popen 'ps aux'
+s = data.read (f,{last_field_collect=true})
+f:close()
+print(s.fieldnames)
+print(s:column_by_name 'USER')
+qs = 'COMMAND,_MEM where _MEM > 5 and USER=="steve"'
+for name,mem in s:select(qs) do
+    print(mem,name)
+end
+
+ +

I've always been an admirer of the AWK programming language; with filter you +can get Lua programs which are just as compact:

+ + +
+-- printxy.lua
+require 'pl'
+data.filter 'x,y where x > 3'
+
+ +

It is common enough to have data files without headers of field names. +data.read makes a special exception for such files if all fields are numeric. +Since there are no column names to use in query expressions, you can use AWK-like +column indexes, e.g. '$1,$2 where $1 > 3'. I have a little executable script on +my system called lf which looks like this:

+ + +
+#!/usr/bin/env lua
+require 'pl.data'.filter(arg[1])
+
+ +

And it can be used generally as a filter command to extract columns from data. +(The column specifications may be expressions or even constants.)

+ + +
+$ lf '$1,$5/10' < test.dat
+
+ +

(As with AWK, please note the single-quotes used in this command; this prevents +the shell trying to expand the column indexes. If you are on Windows, then you +must quote the expression in double-quotes so +it is passed as one argument to your batch file.)

+ +

As a tutorial resource, have a look at test-data.lua in the PL tests directory +for other examples of use, plus comments.

+ +

The data returned by read or constructed by Data.copy_select from a query is +basically just an array of rows: {{1,2},{3,4}}. So you may use read to pull +in any array-like dataset, and process with any function that expects such a +implementation. In particular, the functions in array2d will work fine with +this data. In fact, these functions are available as methods; e.g. +array2d.flatten can be called directly like so to give us a one-dimensional list:

+ + +
+v = data.read('dat.txt'):flatten()
+
+ +

The data is also in exactly the right shape to be treated as matrices by +LuaMatrix:

+ + +
+> matrix = require 'matrix'
+> m = matrix(data.read 'mat.txt')
+> = m
+1       0.2     0.3
+0.2     1       0.1
+0.1     0.2     1
+> = m^2  -- same as m*m
+1.07    0.46    0.62
+0.41    1.06    0.26
+0.24    0.42    1.05
+
+ +

write will write matrices back to files for you.

+ +

Finally, for the curious, the global variable _DEBUG can be used to print out +the actual iterator function which a query generates and dynamically compiles. By +using code generation, we can get pretty much optimal performance out of +arbitrary queries.

+ + +
+> lua -lpl -e "_DEBUG=true" -e "data.filter 'x,y where x > 4 sort by x'" < test.txt
+return function (t)
+        local i = 0
+        local v
+        local ls = {}
+        for i,v in ipairs(t) do
+            if v[1] > 4  then
+                    ls[#ls+1] = v
+            end
+        end
+        table.sort(ls,function(v1,v2)
+            return v1[1] < v2[1]
+        end)
+        local n = #ls
+        return function()
+            i = i + 1
+            v = ls[i]
+            if i > n then return end
+            return v[1],v[2]
+        end
+end
+
+10,20
+40,50
+
+ +

+

Reading Configuration Files

+ +

The config module provides a simple way to convert several kinds of +configuration files into a Lua table. Consider the simple example:

+ + +
+# test.config
+# Read timeout in seconds
+read.timeout=10
+
+# Write timeout in seconds
+write.timeout=5
+
+#acceptable ports
+ports = 1002,1003,1004
+
+ +

This can be easily brought in using config.read and the result shown using +pretty.write:

+ + +
+-- readconfig.lua
+local config = require 'pl.config'
+local pretty= require 'pl.pretty'
+
+local t = config.read(arg[1])
+print(pretty.write(t))
+
+ +

and the output of lua readconfig.lua test.config is:

+ + +
+{
+  ports = {
+    1002,
+    1003,
+    1004
+  },
+  write_timeout = 5,
+  read_timeout = 10
+}
+
+ +

That is, config.read will bring in all key/value pairs, ignore # comments, and +ensure that the key names are proper Lua identifiers by replacing non-identifier +characters with '_'. If the values are numbers, then they will be converted. (So +the value of t.write_timeout is the number 5). In addition, any values which +are separated by commas will be converted likewise into an array.

+ +

Any line can be continued with a backslash. So this will all be considered one +line:

+ + +
+names=one,two,three, \
+four,five,six,seven, \
+eight,nine,ten
+
+ +

Windows-style INI files are also supported. The section structure of INI files +translates naturally to nested tables in Lua:

+ + +
+; test.ini
+[timeouts]
+read=10 ; Read timeout in seconds
+write=5 ; Write timeout in seconds
+[portinfo]
+ports = 1002,1003,1004
+
+ +

The output is:

+ + +
+{
+  portinfo = {
+    ports = {
+      1002,
+      1003,
+      1004
+    }
+  },
+  timeouts = {
+    write = 5,
+    read = 10
+  }
+}
+
+ +

You can now refer to the write timeout as t.timeouts.write.

+ +

As a final example of the flexibility of config.read, if passed this simple +comma-delimited file

+ + +
+one,two,three
+10,20,30
+40,50,60
+1,2,3
+
+ +

it will produce the following table:

+ + +
+{
+  { "one", "two", "three" },
+  { 10, 20, 30 },
+  { 40, 50, 60  },
+  { 1, 2, 3 }
+}
+
+ +

config.read isn't designed to read all CSV files in general, but intended to +support some Unix configuration files not structured as key-value pairs, such as +'/etc/passwd'.

+ +

This function is intended to be a Swiss Army Knife of configuration readers, but +it does have to make assumptions, and you may not like them. So there is an +optional extra parameter which allows some control, which is table that may have +the following fields:

+ + +
+{
+   variablilize = true,
+   convert_numbers = tonumber,
+   trim_space = true,
+   list_delim = ',',
+   trim_quotes = true,
+   ignore_assign = false,
+   keysep = '=',
+   smart = false,
+}
+
+ +

variablilize is the option that converted write.timeout in the first example +to the valid Lua identifier write_timeout. If convert_numbers is true, then +an attempt is made to convert any string that starts like a number. You can +specify your own function (say one that will convert a string like '5224 kb' into +a number.)

+ +

trim_space ensures that there is no starting or trailing whitespace with +values, and list_delim is the character that will be used to decide whether to +split a value up into a list (it may be a Lua string pattern such as '%s+'.)

+ +

For instance, the password file in Unix is colon-delimited:

+ + +
+t = config.read('/etc/passwd',{list_delim=':'})
+
+ +

This produces the following output on my system (only last two lines shown):

+ + +
+{
+  ...
+  {
+    "user",
+    "x",
+    "1000",
+    "1000",
+    "user,,,",
+    "/home/user",
+    "/bin/bash"
+  },
+  {
+    "sdonovan",
+    "x",
+    "1001",
+    "1001",
+    "steve donovan,28,,",
+    "/home/sdonovan",
+    "/bin/bash"
+  }
+}
+
+ +

You can get this into a more sensible format, where the usernames are the keys, +with this (the tablex.pairmap function must return value, key!)

+ + +
+t = tablex.pairmap(function(k,v) return v,v[1] end,t)
+
+ +

and you get:

+ + +
+{ ...
+  sdonovan = {
+    "sdonovan",
+    "x",
+    "1001",
+    "1001",
+    "steve donovan,28,,",
+    "/home/sdonovan",
+    "/bin/bash"
+  }
+...
+}
+
+ +

Many common Unix configuration files can be read by tweaking these parameters. +For /etc/fstab, the options {list_delim='%s+',ignore_assign=true} will +correctly separate the columns. It's common to find 'KEY VALUE' assignments in +files such as /etc/ssh/ssh_config; the options {keysep=' '} make +config.read return a table where each KEY has a value VALUE.

+ +

Files in the Linux procfs usually use ':` as the field delimiter:

+ + +
+> t = config.read('/proc/meminfo',{keysep=':'})
+> = t.MemFree
+220140 kB
+
+ +

That result is a string, since tonumber doesn't like it, but defining the +convert_numbers option as `function(s) return tonumber((s:gsub(' kB$',''))) +end` will get the memory figures as actual numbers in the result. (The extra +parentheses are necessary so that tonumber only gets the first result from +gsub). From `tests/test-config.lua':

+ + +
+testconfig([[
+MemTotal:        1024748 kB
+MemFree:          220292 kB
+]],
+{ MemTotal = 1024748, MemFree = 220292 },
+{
+ keysep = ':',
+ convert_numbers = function(s)
+    s = s:gsub(' kB$','')
+    return tonumber(s)
+  end
+ }
+)
+
+ +

The smart option lets config.read make a reasonable guess for you; there +are examples in tests/test-config.lua, but basically these common file +formats (and those following the same pattern) can be processed directly in +smart mode: 'etc/fstab', '/proc/XXXX/status', 'ssh_config' and 'pdatedb.conf'.

+ +

Please note that config.read can be passed a file-like object; if it's not a +string and supports the read method, then that will be used. For instance, to +read a configuration from a string, use stringio.open.

+ + +

+ +

+

Lexical Scanning

+ +

Although Lua's string pattern matching is very powerful, there are times when +something more powerful is needed. pl.lexer.scan provides lexical scanners +which tokenize a string, classifying tokens into numbers, strings, etc.

+ + +
+> lua -lpl
+Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
+> tok = lexer.scan 'alpha = sin(1.5)'
+> = tok()
+iden    alpha
+> = tok()
+=       =
+> = tok()
+iden    sin
+> = tok()
+(       (
+> = tok()
+number  1.5
+> = tok()
+)       )
+> = tok()
+(nil)
+
+ +

The scanner is a function, which is repeatedly called and returns the type and +value of the token. Recognized basic types are 'iden','string','number', and +'space'. and everything else is represented by itself. Note that by default the +scanner will skip any 'space' tokens.

+ +

'comment' and 'keyword' aren't applicable to the plain scanner, which is not +language-specific, but a scanner which understands Lua is available. It +recognizes the Lua keywords, and understands both short and long comments and +strings.

+ + +
+> for t,v in lexer.lua 'for i=1,n do' do print(t,v) end
+keyword for
+iden    i
+=       =
+number  1
+,       ,
+iden    n
+keyword do
+
+ +

A lexical scanner is useful where you have highly-structured data which is not +nicely delimited by newlines. For example, here is a snippet of a in-house file +format which it was my task to maintain:

+ + +
+points
+    (818344.1,-20389.7,-0.1),(818337.9,-20389.3,-0.1),(818332.5,-20387.8,-0.1)
+    ,(818327.4,-20388,-0.1),(818322,-20387.7,-0.1),(818316.3,-20388.6,-0.1)
+    ,(818309.7,-20389.4,-0.1),(818303.5,-20390.6,-0.1),(818295.8,-20388.3,-0.1)
+    ,(818290.5,-20386.9,-0.1),(818285.2,-20386.1,-0.1),(818279.3,-20383.6,-0.1)
+    ,(818274,-20381.2,-0.1),(818274,-20380.7,-0.1);
+
+ +

Here is code to extract the points using pl.lexer:

+ + +
+-- assume 's' contains the text above...
+local lexer = require 'pl.lexer'
+local expecting = lexer.expecting
+local append = table.insert
+
+local tok = lexer.scan(s)
+
+local points = {}
+local t,v = tok() -- should be 'iden','points'
+
+while t ~= ';' do
+    c = {}
+    expecting(tok,'(')
+    c.x = expecting(tok,'number')
+    expecting(tok,',')
+    c.y = expecting(tok,'number')
+    expecting(tok,',')
+    c.z = expecting(tok,'number')
+    expecting(tok,')')
+    t,v = tok()  -- either ',' or ';'
+    append(points,c)
+end
+
+ +

The expecting function grabs the next token and if the type doesn't match, it +throws an error. (pl.lexer, unlike other PL libraries, raises errors if +something goes wrong, so you should wrap your code in pcall to catch the error +gracefully.)

+ +

The scanners all have a second optional argument, which is a table which controls +whether you want to exclude spaces and/or comments. The default for lexer.lua +is {space=true,comments=true}. There is a third optional argument which +determines how string and number tokens are to be processsed.

+ +

The ultimate highly-structured data is of course, program source. Here is a +snippet from 'text-lexer.lua':

+ + +
+require 'pl'
+
+lines = [[
+for k,v in pairs(t) do
+    if type(k) == 'number' then
+        print(v) -- array-like case
+    else
+        print(k,v)
+    end
+end
+]]
+
+ls = List()
+for tp,val in lexer.lua(lines,{space=true,comments=true}) do
+    assert(tp ~= 'space' and tp ~= 'comment')
+    if tp == 'keyword' then ls:append(val) end
+end
+test.asserteq(ls,List{'for','in','do','if','then','else','end','end'})
+
+ +

Here is a useful little utility that identifies all common global variables found +in a lua module (ignoring those declared locally for the moment):

+ + +
+-- testglobal.lua
+require 'pl'
+
+local txt,err = utils.readfile(arg[1])
+if not txt then return print(err) end
+
+local globals = List()
+for t,v in lexer.lua(txt) do
+    if t == 'iden' and _G[v] then
+        globals:append(v)
+    end
+end
+pretty.dump(seq.count_map(globals))
+
+ +

Rather then dumping the whole list, with its duplicates, we pass it through +seq.count_map which turns the list into a table where the keys are the values, +and the associated values are the number of times those values occur in the +sequence. Typical output looks like this:

+ + +
+{
+  type = 2,
+  pairs = 2,
+  table = 2,
+  print = 3,
+  tostring = 2,
+  require = 1,
+  ipairs = 4
+}
+
+ +

You could further pass this through tablex.keys to get a unique list of +symbols. This can be useful when writing 'strict' Lua modules, where all global +symbols must be defined as locals at the top of the file.

+ +

For a more detailed use of lexer.scan, please look at testxml.lua in the +examples directory.

+ +

+

XML

+ +

New in the 0.9.7 release is some support for XML. This is a large topic, and +Penlight does not provide a full XML stack, which is properly the task of a more +specialized library.

+ +

Parsing and Pretty-Printing

+ +

The semi-standard XML parser in the Lua universe is lua-expat. +In particular, +it has a function called lxp.lom.parse which will parse XML into the Lua Object +Model (LOM) format. However, it does not provide a way to convert this data back +into XML text. xml.parse will use this function, if lua-expat is +available, and otherwise switches back to a pure Lua parser originally written by +Roberto Ierusalimschy.

+ +

The resulting document object knows how to render itself as a string, which is +useful for debugging:

+ + +
+> d = xml.parse "<nodes><node id='1'>alice</node></nodes>"
+> = d
+<nodes><node id='1'>alice</node></nodes>
+> pretty.dump (d)
+{
+  {
+    "alice",
+    attr = {
+      "id",
+      id = "1"
+    },
+    tag = "node"
+  },
+  attr = {
+  },
+  tag = "nodes"
+}
+
+ +

Looking at the actual shape of the data reveals the structure of LOM:

+ +
    +
  • every element has a tag field with its name
  • +
  • plus a attr field which is a table containing the attributes as fields, and + also as an array. It is always present.
  • +
  • the children of the element are the array part of the element, so d[1] is + the first child of d, etc.
  • +
+ +

It could be argued that having attributes also as the array part of attr is not +essential (you cannot depend on attribute order in XML) but that's how +it goes with this standard.

+ +

lua-expat is another soft dependency of Penlight; generally, the fallback +parser is good enough for straightforward XML as is commonly found in +configuration files, etc. doc.basic_parse is not intended to be a proper +conforming parser (it's only sixty lines) but it handles simple kinds of +documents that do not have comments or DTD directives. It is intelligent enough +to ignore the <?xml directive and that is about it.

+ +

You can get pretty-printing by explicitly calling xml.tostring and passing it +the initial indent and the per-element indent:

+ + +
+> = xml.tostring(d,'','  ')
+
+<nodes>
+  <node id='1'>alice</node>
+</nodes>
+
+ +

There is a fourth argument which is the attribute indent:

+ + +
+> a = xml.parse "<frodo name='baggins' age='50' type='hobbit'/>"
+> = xml.tostring(a,'','  ','  ')
+
+<frodo
+  type='hobbit'
+  name='baggins'
+  age='50'
+/>
+
+ +

Parsing and Working with Configuration Files

+ +

It's common to find configurations expressed with XML these days. It's +straightforward to 'walk' the LOM +data and extract the data in the form you want:

+ + +
+require 'pl'
+
+local config = [[
+<config>
+    <alpha>1.3</alpha>
+    <beta>10</beta>
+    <name>bozo</name>
+</config>
+]]
+local d,err = xml.parse(config)
+
+local t = {}
+for item in d:childtags() do
+    t[item.tag] = item[1]
+end
+
+pretty.dump(t)
+--->
+{
+  beta = "10",
+  alpha = "1.3",
+  name = "bozo"
+}
+
+ +

The only gotcha is that here we must use the Doc:childtags method, which will +skip over any text elements.

+ +

A more involved example is this excerpt from serviceproviders.xml, which is +usually found at /usr/share/mobile-broadband-provider-info/serviceproviders.xml +on Debian/Ubuntu Linux systems.

+ + +
+d = xml.parse [[
+<serviceproviders format="2.0">
+...
+<country code="za">
+    <provider>
+        <name>Cell-c</name>
+        <gsm>
+            <network-id mcc="655" mnc="07"/>
+            <apn value="internet">
+                <username>Cellcis</username>
+                <dns>196.7.0.138</dns>
+                <dns>196.7.142.132</dns>
+            </apn>
+        </gsm>
+    </provider>
+    <provider>
+        <name>MTN</name>
+        <gsm>
+            <network-id mcc="655" mnc="10"/>
+            <apn value="internet">
+                <dns>196.11.240.241</dns>
+                <dns>209.212.97.1</dns>
+            </apn>
+        </gsm>
+    </provider>
+    <provider>
+        <name>Vodacom</name>
+        <gsm>
+            <network-id mcc="655" mnc="01"/>
+            <apn value="internet">
+                <dns>196.207.40.165</dns>
+                <dns>196.43.46.190</dns>
+            </apn>
+            <apn value="unrestricted">
+                <name>Unrestricted</name>
+                <dns>196.207.32.69</dns>
+                <dns>196.43.45.190</dns>
+            </apn>
+        </gsm>
+    </provider>
+    <provider>
+        <name>Virgin Mobile</name>
+        <gsm>
+            <apn value="vdata">
+                <dns>196.7.0.138</dns>
+                <dns>196.7.142.132</dns>
+            </apn>
+        </gsm>
+    </provider>
+</country>
+....
+</serviceproviders>
+]]
+
+ +

Getting the names of the providers per-country is straightforward:

+ + +
+local t = {}
+for country in d:childtags() do
+    local providers = {}
+    t[country.attr.code] = providers
+    for provider in country:childtags() do
+        table.insert(providers,provider:child_with_name('name'):get_text())
+    end
+end
+
+pretty.dump(t)
+-->
+{
+  za = {
+    "Cell-c",
+    "MTN",
+    "Vodacom",
+    "Virgin Mobile"
+  }
+  ....
+}
+
+ +

Generating XML with 'xmlification'

+ +

This feature is inspired by the htmlify function used by +Orbit to simplify HTML generation, +except that no function environment magic is used; the tags function returns a +set of constructors for elements of the given tag names.

+ + +
+> nodes, node = xml.tags 'nodes, node'
+> = node 'alice'
+<node>alice</node>
+> = nodes { node {id='1','alice'}}
+<nodes><node id='1'>alice</node></nodes>
+
+ +

The flexibility of Lua tables is very useful here, since both the attributes and +the children of an element can be encoded naturally. The argument to these tag +constructors is either a single value (like a string) or a table where the +attributes are the named keys and the children are the array values.

+ +

Generating XML using Templates

+ +

A template is a little XML document which contains dollar-variables. The subst +method on a document is fed an array of tables containing values for these +variables. Note how the parent tag name is specified:

+ + +
+> templ = xml.parse "<node id='$id'>$name</node>"
+> = templ:subst {tag='nodes', {id=1,name='alice'},{id=2,name='john'}}
+<nodes><node id='1'>alice</node><node id='2'>john</node></nodes>
+
+ +

Substitution is very related to filtering documents. One of the annoying things +about XML is that it is a document markup language first, and a data language +second. Standard parsers will assume you really care about all those extra +text elements. Consider this fragment, which has been changed by a five-year old:

+ + +
+T = [[
+  <weather>
+    boops!
+    <current_conditions>
+      <condition data='$condition'/>
+      <temp_c data='$temp'/>
+      <bo>whoops!</bo>
+    </current_conditions>
+  </weather>
+]]
+
+ +

Conformant parsers will give you text elements with the line feed after <current_conditions> +although it makes handling the data more irritating.

+ + +
+local function parse (str)
+    return xml.parse(str,false,true)
+end
+
+ +

Second argument means 'string, not file' and third argument means use the built-in +Lua parser (instead of LuaExpat if available) which by default is not interested in +keeping such strings.

+ +

How to remove the string boops!? clone (also called filter when called as a +method) copies a LOM document. It can be passed a filter function, which is applied +to each string found. The powerful thing about this is that this function receives +structural information - the parent node, and whether this was a tag name, a text +element or a attribute name:

+ + +
+d = parse (T)
+c = d:filter(function(s,kind,parent)
+    print(stringx.strip(s),kind,parent and parent.tag or '?')
+    if kind == '*TEXT' and #parent > 1 then return nil end
+    return s
+end)
+--->
+weather    *TAG    ?
+boops!    *TEXT    weather
+current_conditions    *TAG    weather
+condition    *TAG    current_conditions
+$condition    data    condition
+temp_c    *TAG    current_conditions
+$temp    data    temp_c
+bo    *TAG    current_conditions
+whoops!    *TEXT    bo
+
+ +

We can pull out 'boops' and not 'whoops' by discarding text elements which are not +the single child of an element.

+ + + +

Extracting Data using Templates

+ +

Matching goes in the opposite direction. We have a document, and would like to +extract values from it using a pattern.

+ +

A common use of this is parsing the XML result of API queries. The +(undocumented and subsequently discontinued) Google Weather +API is a +good example. Grabbing the result of +`http://www.google.com/ig/api?weather=Johannesburg,ZA" we get something like +this, after pretty-printing:

+ + +
+<xml_api_reply version='1'>
+  <weather module_id='0' tab_id='0' mobile_zipped='1' section='0' row='0'
+
+ +

mobile_row='0'>

+ +
+<forecast_information>
+  <city data='Johannesburg, Gauteng'/>
+  <postal_code data='Johannesburg,ZA'/>
+  <latitude_e6 data=''/>
+  <longitude_e6 data=''/>
+  <forecast_date data='2010-10-02'/>
+  <current_date_time data='2010-10-02 18:30:00 +0000'/>
+  <unit_system data='US'/>
+</forecast_information>
+<current_conditions>
+  <condition data='Clear'/>
+  <temp_f data='75'/>
+  <temp_c data='24'/>
+  <humidity data='Humidity: 19%'/>
+  <icon data='/ig/images/weather/sunny.gif'/>
+  <wind_condition data='Wind: NW at 7 mph'/>
+</current_conditions>
+<forecast_conditions>
+  <day_of_week data='Sat'/>
+  <low data='60'/>
+  <high data='89'/>
+  <icon data='/ig/images/weather/sunny.gif'/>
+  <condition data='Clear'/>
+</forecast_conditions>
+....
+/weather>
+l_api_reply>
+
+ +

Assume that the above XML has been read into google. The idea is to write a +pattern looking like a template, and use it to extract some values of interest:

+ + +
+t = [[
+  <weather>
+    <current_conditions>
+      <condition data='$condition'/>
+      <temp_c data='$temp'/>
+    </current_conditions>
+  </weather>
+]]
+
+local res, ret = google:match(t)
+pretty.dump(res)
+
+ +

And the output is:

+ + +
+{
+  condition = "Clear",
+  temp = "24"
+}
+
+ +

The match method can be passed a LOM document or some text, which will be +parsed first.

+ +

But what if we need to extract values from repeated elements? Match templates may +contain 'array matches' which are enclosed in '{{..}}':

+ + +
+<weather>
+  {{<forecast_conditions>
+    <day_of_week data='$day'/>
+    <low data='$low'/>
+    <high data='$high'/>
+    <condition data='$condition'/>
+  </forecast_conditions>}}
+</weather>
+
+ +

And the match result is:

+ + +
+{
+  {
+    low = "60",
+    high = "89",
+    day = "Sat",
+    condition = "Clear",
+  },
+  {
+    low = "53",
+    high = "86",
+    day = "Sun",
+    condition = "Clear",
+  },
+  {
+    low = "57",
+    high = "87",
+    day = "Mon",
+    condition = "Clear",
+  },
+  {
+    low = "60",
+    high = "84",
+    day = "Tue",
+    condition = "Clear",
+  }
+}
+
+ +

With this array of tables, you can use tablex or List +to reshape into the desired form, if you choose. Just as with reading a Unix password +file with config, you can make the array into a map of days to conditions using:

+ + +
+tablex.pairmap('|k,v| v,v.day',conditions)
+
+ +

(Here using the alternative string lambda option)

+ +

However, xml matches can shape the structure of the output. By replacing the day_of_week +line of the template with <day_of_week data='$_'/> we get the same effect; $_ is +a special symbol that means that this captured value (or simply capture) becomes the key.

+ +

Note that $NUMBER means a numerical index, so +that $1 is the first element of the resulting array, and so forth. You can mix +numbered and named captures, but it's strongly advised to make the numbered captures +form a proper array sequence (everything from 1 to n inclusive). $0 has a +special meaning; if it is the only capture ({[0]='foo'}) then the table is +collapsed into 'foo'.

+ + +
+<weather>
+  {{<forecast_conditions>
+    <day_of_week data='$_'/>
+    <low data='$1'/>
+    <high data='$2'/>
+    <condition data='$3'/>
+  </forecast_conditions>}}
+</weather>
+
+ +

Now the result is:

+ + +
+{
+  Tue = {
+    "60",
+    "84",
+    "Clear"
+  },
+  Sun = {
+    "53",
+    "86",
+    "Clear"
+  },
+  Sat = {
+    "60",
+    "89",
+    "Clear"
+  },
+  Mon = {
+    "57",
+    "87",
+    "Clear"
+  }
+}
+
+ +

Applying matches to this config file poses another problem, because the actual +tags matched are themselves meaningful.

+ + +
+<config>
+    <alpha>1.3</alpha>
+    <beta>10</beta>
+    <name>bozo</name>
+</config>
+
+ +

So there are tag 'wildcards' which are element names ending with a hyphen.

+ + +
+<config>
+    {{<key->$value</key->}}
+</config>
+
+ +

You will then get {{alpha='1.3'},...}. The most convenient format would be +returned by this (note that _- behaves just like $_):

+ + +
+<config>
+    {{<_->$0</_->}}
+</config>
+
+ +

which would return {alpha='1.3',beta='10',name='bozo'}.

+ +

We could play this game endlessly, and encode ways of converting captures, but +the scheme is complex enough, and it's easy to do the conversion later

+ + +
+local numbers = {alpha=true,beta=true}
+for k,v in pairs(res) do
+    if numbers[v] then res[k] = tonumber(v) end
+end
+
+ +

HTML Parsing

+ +

HTML is an unusually degenerate form of XML, and Dennis Schridde has contributed +a feature which makes parsing it easier. For instance, from the tests:

+ + +
+doc = xml.parsehtml [[
+<BODY>
+Hello dolly<br>
+HTML is <b>slack</b><br>
+</BODY>
+]]
+
+asserteq(xml.tostring(doc),[[
+<body>
+Hello dolly<br/>
+HTML is <b>slack</b><br/></body>]])
+
+ +

That is, all tags are converted to lowercase, and empty HTML elements like br +are properly closed; attributes do not need to be quoted.

+ +

Also, DOCTYPE directives and comments are skipped. For truly badly formed HTML, +this is not the tool for you!

+ + + + + +
+
+
+generated by LDoc 1.4.6 +
+
+ + -- cgit v1.1-26-g67d0