XTRAN Example — Mine Web Pages for Keywords

Scenario: You know that to help your Web site show up in searches, you need to include significant keywords on each page. So you want to identify and catalog the significant words in your site that the Web spiders will index.

XTRAN to the rescue!

The following example uses an XTRAN rules file comprising 175 non-comment lines of XTRAN's rules language ("meta-code") to analyze HTML and count significant words in nonmarkup text — English in this example, but the rules can handle Web pages in any language.

These rules are actually an adaptation of our existing XTRAN rules to count words in text . It took less than one hour to adapt those rules to mine HTML, and ten minutes to debug the result. (That's right, about an hour total!)

The results output is alphabetically ordered and in delimiter separated value (DSV) form, suitable for input to a spreadsheet or data base for further analysis, graphing, etc.

The HTML mining rules for this example can easily be enhanced to produce DSV output that can be interactively queried using existing XTRAN rules.

The rules force all analyzed text to lower case before counting the words, to avoid issues with upper vs. lower case.

The rules allow you to optionally specify, via environment variables:

The characters that delimit words; the default is
```
    <SPACE><TAB>,:;.?!()[]{}<>^'"+*/=%&
```
The name of a text file containing either (but not both):
- Words to ignore in the counts. Typically these are pronouns, connectives, prepositions, etc., whose counts are not interesting.
- Words to report in the counts, ignoring all other words. This means you can use these same rules to tally occurrences of specific words across a Web site.
The name of a text file containing a series of regular expressions to match HTML tags whose contents are to be ignored.
Whether to write a label row at the beginning of the output file (a common practice for DSV data); the default is no label row.
Whether to prefix each word count with the module name; the default is no module name.

We have actually used these rules to mine each page of this very Web site for significant words, accumulating a list of about 3,300 words to ignore across the entire site.

We then used existing XTRAN rules to summarize keyword occurrences across the entire site, giving about 7,000 occurrences of about 850 keywords.

How can such powerful HTML analysis be automated in only one hour and 175 lines of rules? Because there is so much capability already available as part of XTRAN's rules language. The rules used for this example take advantage of the following functionality provided by that rules language:

Text file input and output
Text manipulation
Text formatting
Delimited list manipulation
Regular expression matching
Environment variable manipulation
Content-addressable data bases
"Per statement" recursive iterators
Access to XTRAN's Internal Representation (XIR)

The input to and output from XTRAN are untouched.

Process Flowchart

Here is a flowchart for this process, in which the elements are color coded:

BLUE for XTRAN versions (runnable programs)
ORANGE for XTRAN rules (text files)
RED for code
PURPLE for text data files

Input to XTRAN — Web page to process (rendered):

Dangerous Dan's Nocturnal Aviation Company

Do you need to get somewhere discreetly, under the cover of darkness? Dangerous Dan can get you there! Our airplanes are stealth equipped, so your adversary's radar won't find you. They run without lights, so our pilots are especially vigilant to avoid collisions with other aircraft, mountains, etc. They are also experts in evasive maneuvers.

We serve snacks, assuming clear weather and no hostile fire.

And if you need discreet ground transportation at your destination, be sure to check out Dangerous Dan's Nocturnal Limousine Service! Our limousines are also stealth equipped, to avoid police radar, and our drivers are thoroughly trained in evasive driving.

Input to XTRAN — Web page to process (raw HTML):

<h4 align="center">Dangerous Dan's Nocturnal Aviation Company</h4>

<p>Do you need to get somewhere discreetly, under the cover of
darkness?&nbsp; <b>Dangerous Dan</b> can get you there!&nbsp; Our airplanes
are stealth equipped, so your adversary's radar won't find you.&nbsp; They
run without lights, so our pilots are especially vigilant to avoid collisions
with other aircraft, mountains, etc.&nbsp; They are also experts in evasive
maneuvers.</p>

<p>We serve snacks, assuming clear weather and no hostile fire.</p>

<p>And if you need discreet ground transportation at your destination, be
sure to check out <b>Dangerous Dan's Nocturnal Limousine
Service</b>!&nbsp; Our limousines are also stealth equipped, to avoid
police radar, and our drivers are thoroughly trained in evasive
driving.</p>

Input to XTRAN — words to ignore:

also
and
are
assuming
at
avoid
be
can
check
clear
cover
dan
dangerous
do
drivers
driving
equipped
etc
find
get
if
in
need
no
of
other
our
out
run
s
serve
service
so
somewhere
sure
t
the
there
they
thoroughly
to
under
we
with
without
won
you
your

Input to XTRAN — Regular expressions for HTML tags to ignore:

^H[1-9]$

Output from XTRAN:

adversary,1
aircraft,1
airplanes,1
collisions,1
darkness,1
destination,1
discreet,1
discreetly,1
especially,1
evasive,2
experts,1
fire,1
ground,1
hostile,1
lights,1
limousine,1
limousines,1
maneuvers,1
mountains,1
nocturnal,1
pilots,1
police,1
radar,2
snacks,1
stealth,2
trained,1
transportation,1
vigilant,1
weather,1