XTRAN Example — Cross Link & Index HTML Documents
Anatomy of a Complex HTML Document with Variants
A document (or Web site) written in HTML may comprise many "chapters" (HTML source files), typically front-ended by a Table of Contents that links to all of the chapter files (and an index). An example is the XTRAN User's Manual.
In some situations, we may have variants of the document, in which not all chapters participate in every variant. For instance, in the case of the XTRAN User's Manual each XTRAN licensee receives a variant that is tailored for the licensed activity. So someone licensing XTRAN for, say, translation of PL/I to C++ will get an XTRAN User's Manual containing only those chapters relevant to that activity. Someone else licensing XTRAN for, say, analysis of mainframe assembler will receive a different variant, containing a different selection of chapters.
Each variant must have its own title page and a table of contents that links to the chapters participating in that variant, and also links to the index we will create for the variant.
Objective: Cross Link and Index HTML Documents
For maximum convenience of use, we would like for each occurrence of a significant term anywhere in a variant of the document to be cross linked to that term's definition, which is likely to be in a different chapter (a practice widely seen in Wikis). However, we don't want to cross link the occurrence of a term whose definition is in a chapter not included in our variant. We also don't want to cross link a term occurrence that's in close proximity to the term's definition.
We also want to have a thorough alphabetical index at the end of our document. Obviously, it should include only entries whose bookmarks are in chapters included in our variant. For the reader's convenience, we would also like for each index entry to show the section and chapter in which it occurs, creating what is commonly called a Key Word in Context (KWIC) index. And we would like to have, at the index's start, a letter-by-letter index to the index.
As any professional indexer will tell you, the hard work is deciding how and where to index terms in the document. Although many attempts have been made to automate this process, with some success , the only way to guarantee a really good job is for a knowledgeable human being to carefully control the indexing process and "fine tune" the final product.
However, once the basic indexing of the chapters has been done, we can automate the rest of the work — cross referencing item occurrences in a document variant's chapters, and generating the document variant's index.
How XTRAN Can Help
HTML is one of many computer languages XTRAN manipulates.
XTRAN represents each tag, segment of nonmarkup text,
and end tag as an HTML "statement", and represents each attribute of
a tag as a "statement attribute", possibly with a value
XTRAN's Internal Representation (XIR) of HTML is essentially the same as for all other computer languages XTRAN manipulates, including assemblers, 3GLs such as COBOL and C++, 4GLs such as RPG and Natural, meta-data languages such as XML, scripting languages, Web languages, data base languages such as SQL, and Domain Specific Languages. This means that the full power of XTRAN's rules language is available to manipulate HTML.
This example employs 1,068 code lines of XTRAN's rules language (which we also call "meta-code"). It shows how we can use several versions of XTRAN to automatically cross link and index all occurrences of specified text items. The example assumes that the target to which each item is to be cross linked and indexed is marked with a bookmark that has an appropriate name. Such a bookmark may or may not enclose related text.
We use two (illegal) HTML
<A> attributes in the
NOINDEX, to control, in the
original HTML files, whether each bookmark is cross linked and/or
indexed. We remove these attributes in the process of cross linking, so
they don't show up in the final document variant.
We also use a set of XTRAN styling rules for rendering HTML that specifies, for each tag, whether it and its end tag (if any) are to get preceding and/or following line breaks in the output.
How can such powerful and generalized HTML manipulation be automated in only 1,068 code lines of XTRAN rules? Because there is so much capability already available as part of XTRAN's rules language. These rules take advantage of the following functionality:
- Text file input and output
- Text manipulation
- Text formatting
- Delimited list manipulation
- Environment variable manipulation
- Content-addressable data bases
- "Per statement" recursive iterator
- Statement and expression pattern matching and replacement
- Access to XTRAN's Internal Representation (XIR)
- Navigation in XIR
- Creating new meta-functions written in meta-code, which we call user meta-functions
- Meta-variable and meta-function pointers
Our strategy is as follows:
- We maintain a Master Document comprising all chapters for all document
variants. We insert, into all HTML files, bookmarks in the
<A NAME="xxx" [[NOLINK]] [[NOINDEX]]>
…where the optional
NOINDEXattributes control our cross linking and indexing activities.
All changes to the document must be made to these versions.
- As part of the Master Document, we maintain, for each document variant, a
title page with a table of contents that links to the chapter HTML files that
comprise that variant, and also links to the index we will create for that
- For convenience, we maintain, for each document variant, a text file named
<variant>.namthat lists all of the chapter HTML files that comprise that variant.
- We run an HTML analysis version of XTRAN with
rules that extract bookmark data from all chapters of the Master Document
(STEP 1 below), creating a bookmark data file used by all subsequent
steps, for each document variant. This data file includes, for each
- The bookmark name.
- Text included in the bookmark, or if none, the bookmark name again.
- The most recent header text preceding the bookmark (what section it's
- The title of the chapter the bookmark is in.
- The name of the HTML file the bookmark is in.
- Whether the bookmark is to be cross linked and/or indexed, based on the
presence or absence of
NOINDEXattributes in the original HTML. If it is to be neither cross linked nor indexed, we don't record an entry for it.
- The bookmark name.
- Because cross linking won't work if there are duplications among the items
to be cross linked, we run a rules-only version of
XTRAN with rules that find and report any such
duplications in the bookmark data file (STEP 2 below). We then
edit the data file and reorder bookmark data as needed to eliminate embedded
duplications. We also edit the Master Document HTML files to eliminate
exact bookmark name duplications. We rerun XTRAN
with the duplication checking rules after every change, iterating this process
until it reports no duplications.
- For each document variant, we run an HTML re-engineering version
of XTRAN with rules that read the bookmark data, then
find each occurrence of each bookmark name in the chapters comprising that
variant and cross link that occurrence to the relevant bookmark, excluding
occurrences of names whose bookmarks are in chapters not included in our
variant (STEP 3 below).
For example, assuming that there is, in
then an occurrence, perhaps in a different chapter file, of
in nonmarkup text would be replaced with
creating a link the reader can follow to the bookmark in
However, if the bookmark's text occurs in the most recent header, the rules don't insert a cross link, since it would probably just link to the same immediate area.
Note that if the bookmark occurrence is in the same chapter file as the bookmark's definition, we don't need to specify that file name, resulting in:
This step also removes the (illegal)
NOINDEXattributes from the bookmarks. We write the resulting HTML to the variant's subdirectory.
- For each variant, we run an HTML re-engineering version
of XTRAN with rules that read the bookmark data and
generate an HTML index for that variant, excluding bookmarks in chapters not
included in our variant, and creating an alphabetical "index to the
index" at its start (STEP 4 below). We write the resulting
HTML to the variant's subdirectory, satisfying the link to it we put in the
variant's Table of Contents.
Here is a flowchart of this process, in which the elements are color coded:
- BLUE for XTRAN versions (runnable programs)
- ORANGE for XTRAN rules (text files)
- RED for HTML files
- PURPLE for text data files
The following is an English paraphrase of the major XTRAN rules used for this example.
Rules that extract bookmark data from HTML (step 1 above)
We run an HTML analysis version of XTRAN with these rules on each HTML file in the Master Document, after the Master Document has been changed or new chapters have been added to it. We first delete the bookmark data file, so that it will be recreated "from scratch".
Read and parse HTML to be analyzed Open bookmark data file to append For each HTML "statement", recursively If <H1> (chapter heading) tag Remember its text Else if <Hn> (heading) tag Remember its text Else if <A NAME="xxx"> tag If item is to be cross linked and/or indexed Write bookmark information to data file Close bookmark data file
Rules that check bookmark data for name duplication (step 2 above)
We run a rules-only version of XTRAN with these rules, after bookmark data have been extracted from all HTML files in the Master Document, to check the bookmark data for name duplication that could mess up the cross linking process.
Read bookmark data from file for all chapters Create bookmark duplication output file For each bookmark If bookmark is not to be cross linked Continue For each bookmark following this one If bookmark is not to be cross linked Continue If same bookmark name Write information to output file If bookmark name contains our bookmark name Write information to output file Close output file
Rules that insert cross links in HTML (step 3 above)
We run an HTML re-engineering version of XTRAN with these rules on each HTML chapter in a document variant, to recreate that variant after changes or additions to the Master Document.
Read list of chapters for our variant from file Read bookmark data from file for our variant only Read and parse HTML from Master Document version of chapter For each HTML "statement", recursively If it's <A NAME="xxx"> tag with NOLINK and/or NOINDEX attributes Remove them Continue If it isn't non-markup text Continue If it's already enclosed in <A> tag (it's a bookmark or already indexed) Continue For each bookmark name to be cross linked If the bookmark's text occurs in the most recent header Continue For each occurrence of bookmark name in this text Replace occurrence with a link to item's bookmark Set to continue with 1st replacement "statement" Output re-engineered HTML for chapter to variant's subdirectory
Rules that generate HTML index from bookmark data (step 4 above)
We run an HTML re-engineering / generating version of XTRAN with these rules, once for each document variant, to create its index, after changes or additions to the Master Document.
Read list of chapters for our variant from file Read bookmark data from file for our variant only Generate HTML index header For each bookmark item If bookmark is not to be indexed Continue Record bookmark text and sequential number (for sorting) If new starting character Record bookmark text's starting character (for sorting) Sort starting characters Generate alphabetical "index to the index" using starting characters Sort bookmark data texts Generate start of HTML table For each bookmark item, alphabetically If item is not to be indexed Continue If new starting character Generate header for it as table row, including bookmark for "index to the index" Generate index entry as table row, including its section and chapter Generate end of HTML table Generate end of HTML document Write out HTML we've generated as document index, to variant's subdirectory
Normally, we generate each document variant into its own subdirectory, so the HTML file names need not change. In this example, however, all of the HTML files live in the same directory, so we manually adjusted their file names and links to them. These are the only changes made by hand, and would not normally be necessary.
Input to, and Output from, XTRAN
This example uses a "mini" version of the XTRAN User's Manual to show the effects of the cross linking and indexing procedures. This "mini" version is not proprietary; the actual XTRAN User's Manual is proprietary and requires a nondisclosure agreement. Also, this "mini" version is for demonstration purposes only and is not necessarily current. Therefore, its contents do not constitute any representation by XTRAN, LLC about XTRAN.
The actual XTRAN User's Manual (including all variants) has more than 1,200 bookmarks in over 50 HTML files containing more than 20,000 nonmarkup text items; cross linking it involves as many as 25 million checks for cross links to insert.
- var1ttl.html —
Title page for variant 1, with links to its chapters and to the index we will
create for it
- var2ttl.html — Title page for variant 2,
with links to its chapters and to the index we will create for it
- chapter1.html — Participates in variants 1
- chapter2.html — Participates in variants 1
- chapter3.html — Participates in variant 2
Variant Chapter Lists
These small text files are read by the XTRAN rules to determine which chapters participate in which document variants.
; master.nam — Chapters in XTRAN User's Manual, all variants ; Revised 2002-02-09.1258 by S. F. Heffner ; var1ttl.html var2ttl.html chapter1.html chapter2.html chapter3.html ; ; End of master.nam
; variant1.nam — Chapters in XTRAN User's Manual variant 1 ; Revised 2002-02-09.1258 by S. F. Heffner ; var1ttl.html chapter1.html chapter2.html ; ; End of variant1.nam
; variant2.nam — Chapters in XTRAN User's Manual variant 2 ; Revised 2002-02-09.1258 by S. F. Heffner ; var2ttl.html chapter1.html chapter2.html chapter3.html ; ; End of variant2.nam