XTRAN Example — Count Words in Text

Characters that delimit words, defaulting to <SPACE> <TAB> , ; . ? !
The name of a text file containing either (but not both):
- Words to ignore in the counts. Typically these will be pronouns, connectives, prepositions, etc., whose counts would not be interesting.
  
  (or)
- Words to process in the counts, ignoring all other words.
The format of each word count in the output, specified as a printf() format spec, which must contain exactly one text string (%s) spec for the word and exactly one integer (%d) spec for the count, in either order, defaulting to %7d,%s

If needed, you can use existing XTRAN rules to summarize the resulting word counts across multiple documents.

We have also adapted these rules to mine HTML Web pages for keywords.

How can such powerful and generalized text manipulation be automated in less than 1½ hours and only 143 code lines of XTRAN rules? Because there is so much capability already available as part of XTRAN's rules language. These rules take advantage of the following functionality:

Text file input and output
Text manipulation
Text formatting
Delimited list manipulation
Regular expression matching
Environment variable manipulation
Content-addressable data bases
Creating new meta-functions written in meta-code, which we call user meta-functions

The XTRAN rules used for both scenarios are exactly the same. The input to and output from XTRAN are untouched.

Process Flowchart

Here is a flowchart for this process, in which the elements are color coded:

BLUE for XTRAN versions (runnable programs)
ORANGE for XTRAN rules (text files)
PURPLE for text data files

The process is the same for both scenarios.

Common XTRAN Input

The text input to XTRAN for analysis is the same for both scenarios:

This is some text on which we will do a word count, writing out the results as
CSV.  The word count will have been filtered according to the contents of a
named filter file.

Scenario 1 — Ignore uninteresting words

In this scenario, we tell the XTRAN rules to ignore a list of insignificant words we provide. We also specify a format of %07d?%9s for the output lines.

Input to XTRAN — words to ignore:

a     do    the   which
and   have  this  will
are   is    to    would
as    of    we
be    on    when
been  some  where

(By the way, we used XTRAN rules to format the list above into columns from its "one item per line" file.)

Output from XTRAN:

0000001?according
0000001? contents
0000002?    count
0000001?      csv
0000001?     file
0000001?   filter
0000001? filtered
0000001?    named
0000001?      out
0000001?  results
0000001?     text
0000002?     word
0000001?  writing

Scenario 2 — Process only interesting words

In this scenario, we tell the XTRAN rules to count only a list of specific words we provide. We also specify a format of %-10s%d for the output lines.

Input to XTRAN — words to count:

contents
count
csv
filter
filtered
text
word

Output from XTRAN:

contents  1
count     2
csv       1
filter    1
filtered  1
text      1
word      2

XTRAN Example — Count Words in Text

Menu

Introduction & strategy

Process Flowchart

Common XTRAN Input

Scenario 1 — Ignore uninteresting words

Input to XTRAN — words to ignore:

Output from XTRAN:

Scenario 2 — Process only interesting words

Input to XTRAN — words to count:

Output from XTRAN: