XTRAN Example — Count Words in Text


Introduction & strategy

Scenario — you want to count word occurrences in text, and you want to either ignore some words or count only specified words.

XTRAN to the rescue!

The following analyses use XTRAN rules comprising 143 non-comment lines of XTRAN's rules language ("meta-code") to count words in text — English prose in this case, but the rules can handle any language (human or computer) — in fact, any text input.

These rules took less than one hour to create and less than ½ hour to debug.  (That's right, just over one hour total!)

The results output is alphabetically ordered and in delimiter separated value (DSV) form, suitable for input to a spreadsheet or data base for further analysis, graphing, etc.

The rules force all input to lower case before counting the words, to avoid issues with upper vs. lower case in the text.

The rules allow you to optionally specify, via environment variables, the following features:

If needed, you can use existing XTRAN rules to summarize the resulting word counts across multiple documents.

We have also adapted these rules to mine HTML Web pages for keywords.

How can such powerful and generalized text manipulation be automated in less than 1½ hours and only 143 code lines of XTRAN rules?  Because there is so much capability already available as part of XTRAN's rules language.  These rules take advantage of the following functionality:

The XTRAN rules used for both scenarios are exactly the same.  The input to and output from XTRAN are untouched.

Process Flowchart

Here is a flowchart for this process, in which the elements are color coded:

The process is the same for both scenarios.

process flowchart

Common XTRAN Input

The text input to XTRAN for analysis is the same for both scenarios:

This is some text on which we will do a word count, writing out the results as
CSV.  The word count will have been filtered according to the contents of a
named filter file.

Scenario 1 — Ignore uninteresting words

In this scenario, we tell the XTRAN rules to ignore a list of insignificant words we provide.  We also specify a format of %07d?%9s for the output lines.

Input to XTRAN — words to ignore:

a     do    the   which
and   have  this  will
are   is    to    would
as    of    we
be    on    when
been  some  where

(By the way, we used XTRAN rules to format the list above into columns from its "one item per line" file.) 

Output from XTRAN:

0000001? contents
0000002?    count
0000001?      csv
0000001?     file
0000001?   filter
0000001? filtered
0000001?    named
0000001?      out
0000001?  results
0000001?     text
0000002?     word
0000001?  writing

Scenario 2 — Process only interesting words

In this scenario, we tell the XTRAN rules to count only a list of specific words we provide.  We also specify a format of %-10s%d for the output lines.

Input to XTRAN — words to count:


Output from XTRAN:

contents  1
count     2
csv       1
filter    1
filtered  1
text      1
word      2