XTRAN Example — Analyze Word Phrases in Text


Menu



Introduction

Usage patterns of text phrases are important in a broad range of pursuits.  Some examples:

One way to approach analysis of text phrase usage is to use Markov analysis — the frequency with which sequences of words occur.  By extending the length of transition chains beyond the usual 2, this can detect repetition of entire phrases.  And because it analyzes a sequence of delimited tokens, it is insensitive to such physical text issues as line breaks, indentation, white space, etc.



Strategy

The following example uses an XTRAN rules file comprising 243 non-comment lines of "meta-code" (XTRAN's rules language) to analyze word phrases in text.

The rules took 4¼ hours to write and 4½ hours to debug.  (That's right, less than 9 hours total!)

The rules use an XTRAN content-addressable data base to store, organize, and tally the word transitions.

You can optionally specify, to the XTRAN rules, the following analysis controls via environment variables:

The XTRAN rules output, for each phrase that satisfied the analysis controls specified, a count of its occurrences, zero-filled for sorting, followed by the phrase itself.

The XTRAN rules used for this example are not specific to analyzing prose text; they can be used unchanged to analyze any sequence of delimited tokens.  So they effectively constitute a generalized Markov analysis engine.

How can such powerful and generalized analysis be automated in less than 9 hours and only 243 lines of rules?  Because there is so much capability already available as part of XTRAN's rules language.  The rules used for this example take advantage of the following functionality provided by that rules language:

The three scenarios that follow all use exactly the same rules, taking advantage of the controls described above to determine the nature of the analysis.

The input to and outputs from XTRAN are untouched.



Process Flowchart

Here is a flowchart for this process (which is the same for all three scenarios), in which the elements are color coded:

data flowchart

Common XTRAN Input

The text input to XTRAN for analysis is the same for all three scenarios:

This is a sample of prose we will process with XTRAN rules.  It spans a number
of lines, and has a few interesting words and phrases in it for the rules to
analyze.  The quick brown fox jumped over the lazy dog.  I regret that I have
but one life to give for my country.  Give me liberty or give me death.  A
penny saved is a penny earned.

Now we will repeat a few phrases to show how the XTRAN rules tally them.  A
penny saved is a penny earned.  I regret that I have but one life to give for
my country.  A penny not saved is a penny wasted.  The lazy dog woke up and
chased the quick brown fox away.  I regret that I have but one penny to give to
charity, but a penny is better than nothing.


Scenario 1 — Analyze Significant Word Pairs

In this scenario, we are interested in the occurrence of significant word pairs.  This is useful, for example, in analyzing patterns of adjectives and adverbs used to modify nouns and verbs.

To accomplish this, we tell the XTRAN rules to:



Input to XTRAN — words to ignore:

a       is      the
and     it      them
but     me      this
few     my      to
for     now     we
has     of      will
have    or      with
how     over
in      that

(By the way, we used XTRAN rules to format the list above into columns from its "one item per line" file.) 



Output from XTRAN:

0000001,sample prose
0000001,prose process
0000001,process xtran
0000002,xtran rules
0000001,spans number
0000001,number lines
0000001,interesting words
0000001,words phrases
0000001,phrases rules
0000001,rules analyze
0000002,quick brown
0000002,brown fox
0000001,fox jumped
0000001,jumped lazy
0000002,lazy dog
0000003,i regret
0000003,regret i
0000003,i one
0000002,one life
0000002,life give
0000002,give country
0000001,give liberty
0000001,liberty give
0000001,give death
0000002,penny saved
0000003,saved penny
0000002,penny earned
0000001,repeat phrases
0000001,phrases show
0000001,show xtran
0000001,rules tally
0000001,penny not
0000001,not saved
0000001,penny wasted
0000001,dog woke
0000001,woke up
0000001,up chased
0000001,chased quick
0000001,fox away
0000001,one penny
0000001,penny give
0000001,give charity
0000001,penny better
0000001,better than
0000001,than nothing


Scenario 2 — Analyze for Sequences of Interesting Words

In this scenario, we are interested in the occurrence of chains of specific words, ignoring all intervening words.  This is useful, for example, in analyzing the expression of concepts that are of interest, and that are typically described using well-known combinations of words.

This analysis can also be useful in identifying ambiguosity introduced into prose by the use of homonyms, and to establish how to automatically unoverload such homonyms using their context.

To accomplish this, we tell the XTRAN rules to:



Input to XTRAN — words to process:

analyze
country
death
liberty
life
penny
phrases
rules
words
xtran


Output from XTRAN:

0000001,xtran rules
0000002,life country
0000001,liberty death
0000003,penny penny
0000001,phrases xtran rules
0000001,words phrases rules analyze


Scenario 3 — Analyze "Keyword in Context" (KWIC)

In this scenario, we are interested in seeing the context in which a series of keywords are used in the text being analyzed.

To accomplish this, we tell the XTRAN rules to:



Input to XTRAN — keywords to do:

death
fox
penny


Output from XTRAN:

0000001,give me liberty or give me death
0000002,a penny saved is a penny earned
0000001,but a penny is better than nothing
0000001,a penny not saved is a penny wasted
0000001,the quick brown fox jumped over the lazy dog
0000001,the lazy dog woke up and chased the quick brown fox away
0000001,i regret that i have but one penny to give to charity