XTRAN Example — Normalize Vocabulary in DSV Data

Scenario — you are receiving delimiter-separated value (DSV) data from various sources, and each source has its own vocabulary. You need to normalize terms across those multiple data sources.

XTRAN to the rescue!

The following example uses an XTRAN rules file comprising 124 non-comment lines of "meta-code" (XTRAN's rules language) to process delimiter-separated value data and normalize vocabulary in it, substituting preferred terms for synonyms.

The rules took one hour to write and ½ hour to debug. (That's right, only 1½ hours total!)

You can specify a preferred term for each DSV field, along with a set of synonyms for it. The rules will then, for each specified DSV field, change all synonyms to the preferred term and write out the results.

You specify preferred terms and synonyms for them, for each DSV field that has them, via a "synonyms" file, in the following format. Empty lines and lines starting with ; are ignored.

<dsvfld>,<prftrm>,<syntrm>[[,...]]

where:

	`<dsvfld>`	is the DSV field number, 1-n, to be processed
	`<prftrm>`	is the preferred term for that field
	`<syntrm>`	are one or more synonyms for `<prftrm>`

Note that the rules automatically accommodate DSV data with missing fields.

Here is an English paraphrase of the XTRAN rules:

    Open synonym specifications file read-only
    For each input line
        Parse specification, store in data base
    Close synonym specifications file
    Open input DSV data file read-only
    Create output file
    For each DSV data input line
        Initialize output line to empty
        For each of input line's DSV fields
            If field value is synonym for preferred term (direct DB lookup)
                Add preferred term to output line
            Else
                Add field value to output line
        Write output line
    Close input and output files

How can such powerful and generalized data processing be automated in only 1½ hours and 124 lines of rules? Because there is so much capability already available as part of XTRAN's rules language. These rules take advantage of the following functionality:

Text file input and output
Text manipulation
Delimited list manipulation
Regular expression matching
Environment variable manipulation
Content-addressable data bases

The input to and output from XTRAN are untouched.

Process Flowchart

Here is a flowchart for this process, in which the elements are color coded:

BLUE for XTRAN versions (runnable programs)
ORANGE for XTRAN rules (text files)
PURPLE for text data files

Input to XTRAN — synonym specifications:

3,country,location,venue
5,age,years

Input to XTRAN — DSV data to process:

Jones,Arthur,location,USA,years,25
Murgatroyd,Gertrude,venue,France,age,40
Baldwin,Agnes,location,USA
Smith,Fred,country,Germany,years,19

Output from XTRAN — normalized DSV data:

Jones,Arthur,country,USA,age,25
Murgatroyd,Gertrude,country,France,age,40
Baldwin,Agnes,country,USA
Smith,Fred,country,Germany,age,19