XTRAN Example — Normalize Vocabulary in DSV Data
Scenario — you are receiving delimiter-separated value (DSV) data from various sources, and each source has its own vocabulary. You need to normalize terms across those multiple data sources.
XTRAN to the rescue!
The following example uses an XTRAN rules file comprising 124 non-comment lines of "meta-code" (XTRAN's rules language) to process delimiter-separated value data and normalize vocabulary in it, substituting preferred terms for synonyms.
The rules took one hour to write and ½ hour to debug. (That's right, only 1½ hours total!)
You can specify a preferred term for each DSV field, along with a set of synonyms for it. The rules will then, for each specified DSV field, change all synonyms to the preferred term and write out the results.
You specify preferred terms and synonyms for them, for each
DSV field that has them, via a "synonyms" file, in the
following format. Empty lines and lines starting with
||is the DSV field number, 1-n, to be processed|
||is the preferred term for that field|
|| are one or more synonyms for
Note that the rules automatically accommodate DSV data with missing fields.
Here is an English paraphrase of the XTRAN rules:
Open synonym specifications file read-only For each input line Parse specification, store in data base Close synonym specifications file Open input DSV data file read-only Create output file For each DSV data input line Initialize output line to empty For each of input line's DSV fields If field value is synonym for preferred term (direct DB lookup) Add preferred term to output line Else Add field value to output line Write output line Close input and output files
How can such powerful and generalized data processing be automated in only 1½ hours and 124 lines of rules? Because there is so much capability already available as part of XTRAN's rules language. These rules take advantage of the following functionality:
- Text file input and output
- Text manipulation
- Delimited list manipulation
- Regular expression matching
- Environment variable manipulation
- Content-addressable data bases
The input to and output from XTRAN are untouched.
Here is a flowchart for this process, in which the elements are color coded:
- BLUE for XTRAN versions (runnable programs)
- ORANGE for XTRAN rules (text files)
- PURPLE for text data files
Input to XTRAN — synonym specifications:
Input to XTRAN — DSV data to process:
Jones,Arthur,location,USA,years,25 Murgatroyd,Gertrude,venue,France,age,40 Baldwin,Agnes,location,USA Smith,Fred,country,Germany,years,19
Output from XTRAN — normalized DSV data:
Jones,Arthur,country,USA,age,25 Murgatroyd,Gertrude,country,France,age,40 Baldwin,Agnes,country,USA Smith,Fred,country,Germany,age,19