XTRAN Example — Cross Link & Index HTML Documents
Anatomy of a Complex HTML Document with Variants
A document (or Web site) written in HTML may comprise many "chapters" (HTML source files), typically with a title page and a Table of Contents that links to all of the chapter files. An example is the XTRAN User's Manual.
In some situations, we may have variants of the document, in which not all chapters participate in every variant. In the case of the XTRAN User's Manual, each XTRAN licensee receives a variant that is tailored for the licensed activity. Someone licensing XTRAN for, say, translation of Pascal to C would get an XTRAN User's Manual containing only those chapters relevant to that activity. Someone else licensing XTRAN for, say, analysis of VAX assembler would receive a different variant, containing a different selection of chapters.
Each variant must have its own title page and table of contents, which links to the chapters that participate in that variant and to the index we will create for it.
Objective: Cross Link and Index HTML Documents
For maximum convenience of use, we would like for each occurrence of a significant term anywhere in a variant of the document to be cross linked to that term's definition, which is likely to be in a different chapter. However, we don't want to cross link the occurrence of a term whose definition is in a chapter not included in our variant.
We also want to have a thorough alphabetical index at the end of our document. Obviously, it should include only entries whose bookmarks are in chapters included in our variant. For the reader's convenience, we would also like for each index entry to show the section and chapter in which it occurs, for context, and we would like to have a letter by letter index to the index at the start.
As any professional indexer will tell you, the hard work is deciding how and where to index terms in the document. Although many attempts have been made to automate this process, with some success, the only way to guarantee a really good job is for a knowledgeable human being to carefully control the indexing process and "fine tune" the final product.
However, once the basic indexing of the chapters has been done, we can automate the rest of the work — cross referencing item occurrences in a document variant's chapters and generating the document variant's index.
How XTRAN Can Help
XTRAN treats HTML as a computer language, in which XTRAN represents as a "statement" each tag, segment of nonmarkup text, and end tag. XTRAN represents each attribute of a tag as a "statement attribute", possibly with a value.
XTRAN's internal representation of HTML is essentially the same as for all other computer languages XTRAN manipulates, including assemblers, 3GLs such as Pascal and PL/I, 4GLs such as Natural, meta-data languages such as XML, scripting languages, Web languages, data base languages, and Domain Specific Languages. This means that the full power of XTRAN's rules language is available to manipulate HTML.
This example shows how we can use several versions of XTRAN to automatically cross link and index all occurrences of specified text items. The example assumes that the target to which each item is to be cross linked and indexed is marked with a bookmark that has an appropriate name. Such a bookmark may or may not enclose related text.
We use two (illegal) HTML
<A> attributes in the
NOINDEX, to control, in the
original HTML files, whether each bookmark is cross linked and/or
indexed. These attributes are removed in the process of cross linking, so
they don't show up in the final document variant.
We also use a set of XTRAN styling rules for HTML output that specify, for each tag, whether it and its end tag (if any) are to get preceding and/or following line breaks in the output.
Our strategy is as follows:
- We maintain a Master Document comprising all chapters for all document
variants. Into these HTML files we insert bookmarks, in the
<A NAME="xxx" [[NOLINK]] [[NOINDEX]]>
where the optional
NOINDEXattributes control subsequent cross linking and indexing.
All changes to the document must be made to these versions.
- As part of the Master Document, we maintain, for each document variant, a
title page with a table of contents that links to the chapter HTML files that
comprise that variant, and also links to the index we will create for that
- We maintain, for each document variant, a text file named
<variant>.namthat lists all of the chapter HTML files that comprise that variant.
- We use a version of XTRAN that does HTML analysis,
with rules that extract bookmark information from all chapters of the Master
Document (STEP 1 below), creating a bookmark data file used by all
subsequent steps, for all document variants. This data file includes, for
- The bookmark name.
- Text included in the bookmark, or if none, the bookmark name again.
- The most recent header text preceding the bookmark (what section it's
- The title of the chapter the bookmark is in.
- The name of the HTML file the bookmark is in.
- Whether the bookmark is to be cross linked and/or indexed, based on the
presence or absence of
NOINDEXattributes in the original HTML. If it is to be neither cross linked nor indexed, we don't record an entry for it.
- The bookmark name.
- Because cross linking won't work if there are duplications among the items
to be cross linked, we use a version of XTRAN that
only evaluates rules, with rules that find and report any such duplications in
the bookmark data file (STEP 2 below). We then edit the data file
and reorder bookmark data to eliminate embedded duplications. We also
edit the Master Document HTML files to eliminate bookmark name
duplications. We rerun the duplication checking rules after every
- For each document variant, we use a version
of XTRAN that does HTML re-engineering, with rules
that read the bookmark data, then find each occurrence of each bookmark name in
the chapters comprising that variant and cross link that occurrence to the
relevant bookmark, excluding occurrences of names whose bookmarks are in
chapters not included in our variant (STEP 3 below).
For example, assuming that there is, in
then an occurrence, perhaps in a different chapter file, of
in nonmarkup text would be replaced with
creating a link the reader can follow to the bookmark in
However, if the bookmark's text occurs in the most recent header, the rules don't insert a cross link, since it would probably just link to the same immediate area.
This step also removes the (illegal)
NOINDEXattributes from the bookmarks. We write the resulting HTML to the variant's subdirectory.
- For each variant, we use a version of XTRAN that
does HTML re-engineering, with rules that read the bookmark data and generate
an HTML index for that variant, excluding bookmarks in chapters not included in
our variant, and including an alphabetical "index to the index" at
its start (STEP 4 below). We write the resulting HTML to the
Here is a flowchart of this process, in which the elements are color coded:
- BLUE for XTRAN versions (runnable programs)
- ORANGE for XTRAN rules (text files)
- RED for HTML files
- PURPLE for text data files
The following is an English paraphrase of the major XTRAN rules used for this example.
Rules that extract bookmark data from HTML
These rules are run, with a version of XTRAN that analyzes HTML, on each HTML file in the Master Document, after the Master Document has been changed or new chapters have been added to it. We first delete the bookmark data file, so that it will be recreated "from scratch".
Read and parse HTML to be analyzed Open bookmark data file to append For each HTML "statement", recursively If <H1> (chapter heading) tag Remember its text Else if <Hn> (heading) tag Remember its text Else if <A NAME="xxx"> tag If item is to be cross linked and/or indexed Write bookmark information to data file Close bookmark data file
Rules that check bookmark data for name duplication
These rules are run, with a version of XTRAN that only evaluates rules, after bookmark data have been extracted from all HTML files in the Master Document, to check the bookmark data for name duplication that could mess up the cross linking process.
Read bookmark data from file for all chapters Create bookmark duplication output file For each bookmark If bookmark is not to be cross linked Continue For each bookmark following this one If bookmark is not to be cross linked Continue If same bookmark name Write information to output file If bookmark name contains our bookmark name Write information to output file Close output file
Rules that insert cross links in HTML
These rules are run, with a version of XTRAN that re-engineers HTML, on each HTML chapter in a document variant, to recreate that variant after changes or additions to the Master Document.
Read list of chapters for our variant from file Read bookmark data from file for our variant only Read and parse HTML from Master Document version of chapter For each HTML "statement", recursively If it's <A NAME="xxx"> tag with NOLINK and/or NOINDEX attributes Remove them Continue If it isn't non-markup text Continue If it's already enclosed in <A> tag (it's a bookmark or already indexed) Continue For each bookmark name to be cross linked If the bookmark's text occurs in the most recent header Continue For each occurrence of bookmark name in this text Replace occurrence with a link to item's bookmark Set to continue with 1st replacement "statement" Output re-engineered HTML for chapter to variant's subdirectory
Rules that generate HTML index from bookmark data
These rules are run, with a version of XTRAN that re-engineers and generates HTML, once for each document variant, to create its index, after changes or additions to the Master Document.
Read list of chapters for our variant from file Read bookmark data from file for our variant only Generate HTML index header For each bookmark item If bookmark is not to be indexed Continue Record bookmark text and sequential number (for sorting) If new starting character Record bookmark text's starting character (for sorting) Sort starting characters Generate alphabetical "index to the index" using starting characters Sort bookmark data texts Generate start of HTML table For each bookmark item, alphabetically If item is not to be indexed Continue If new starting character Generate header for it as table row, including bookmark for "index to the index" Generate index entry as table row, including its section and chapter Generate end of HTML table Write out HTML we've generated as document index, to variant's subdirectory
Normally, we generate each document variant into its own subdirectory, so the HTML file names need not change. In this example, however, all of the HTML files live in the same directory, so we manually adjusted their file names and links to them. These are the only changes made by hand, and would not normally be necessary.
Input to, and Output from, XTRAN
This example uses a "mini" version of the XTRAN User's Manual to show the effects of the cross linking and indexing procedures. This "mini" version is not proprietary; the actual XTRAN User's Manual is proprietary and requires a nondisclosure agreement. Also, this "mini" version is for demonstration purposes only and is not necessarily current. Therefore, its contents do not constitute any representation by XTRAN, LLC about XTRAN.
The actual XTRAN User's Manual (including all variants) has more than 1,200 bookmarks in over 50 HTML files containing more than 20,000 nonmarkup text items; cross linking it involves as many as 25 million checks for cross links to insert.
- var1ttl.html —
Title page for variant 1, with links to its chapters and to the index we will
create for it
- var2ttl.html — Title page for variant 2,
with links to its chapters and to the index we will create for it
- chapter1.html — Participates in variants 1
- chapter2.html — Participates in variants 1
- chapter3.html — Participates in variant 2
Variant Chapter Lists
These small text files are read by the XTRAN rules to determine which chapters participate in which document variants.
; master.nam — Chapters in XTRAN User's Manual, all variants ; Revised 2002-02-09.1258 by S. F. Heffner ; var1ttl.html var2ttl.html chapter1.html chapter2.html chapter3.html ; ; End of master.nam
; variant1.nam — Chapters in XTRAN User's Manual variant 1 ; Revised 2002-02-09.1258 by S. F. Heffner ; var1ttl.html chapter1.html chapter2.html ; ; End of variant1.nam
; variant2.nam — Chapters in XTRAN User's Manual variant 2 ; Revised 2002-02-09.1258 by S. F. Heffner ; var2ttl.html chapter1.html chapter2.html chapter3.html ; ; End of variant2.nam