XTRAN Example — Cross Link & Index HTML Documents


Anatomy of a Complex HTML Document with Variants

A document (or Web site) written in HTML may comprise many "chapters" (HTML source files), typically with a title page and a Table of Contents that links to all of the chapter files.  An example is the XTRAN User's Manual.

In some situations, we may have variants of the document, in which not all chapters participate in every variant.  In the case of the XTRAN User's Manual, each XTRAN licensee receives a variant that is tailored for the licensed activity.  Someone licensing XTRAN for, say, translation of Pascal to C would get an XTRAN User's Manual containing only those chapters relevant to that activity.  Someone else licensing XTRAN for, say, analysis of VAX assembler would receive a different variant, containing a different selection of chapters.

Each variant must have its own title page and table of contents, which links to the chapters that participate in that variant and to the index we will create for it.


Objective:  Cross Link and Index HTML Documents

For maximum convenience of use, we would like for each occurrence of a significant term anywhere in a variant of the document to be cross linked to that term's definition, which is likely to be in a different chapter.  However, we don't want to cross link the occurrence of a term whose definition is in a chapter not included in our variant.

We also want to have a thorough alphabetical index at the end of our document.  Obviously, it should include only entries whose bookmarks are in chapters included in our variant.  For the reader's convenience, we would also like for each index entry to show the section and chapter in which it occurs, for context, and we would like to have a letter by letter index to the index at the start.

As any professional indexer will tell you, the hard work is deciding how and where to index terms in the document.  Although many attempts have been made to automate this process, with some success, the only way to guarantee a really good job is for a knowledgeable human being to carefully control the indexing process and "fine tune" the final product.

However, once the basic indexing of the chapters has been done, we can automate the rest of the work — cross referencing item occurrences in a document variant's chapters and generating the document variant's index.


How XTRAN Can Help

XTRAN treats HTML as a computer language, in which XTRAN represents as a "statement" each tag, segment of nonmarkup text, and end tag.  XTRAN represents each attribute of a tag as a "statement attribute", possibly with a value.

XTRAN's internal representation of HTML is essentially the same as for all other computer languages XTRAN manipulates, including assemblers, 3GLs such as Pascal and PL/I, 4GLs such as Natural, meta-data languages such as XML, scripting languages, Web languages, data base languages, and Domain Specific Languages.  This means that the full power of XTRAN's rules language is available to manipulate HTML.

This example shows how we can use several versions of XTRAN to automatically cross link and index all occurrences of specified text items.  The example assumes that the target to which each item is to be cross linked and indexed is marked with a bookmark that has an appropriate name.  Such a bookmark may or may not enclose related text.

We use two (illegal) HTML <A> attributes in the bookmarks, NOLINK and NOINDEX, to control, in the original HTML files, whether each bookmark is cross linked and/or indexed.  These attributes are removed in the process of cross linking, so they don't show up in the final document variant.

We also use a set of XTRAN styling rules for HTML output that specify, for each tag, whether it and its end tag (if any) are to get preceding and/or following line breaks in the output.


Strategy

Our strategy is as follows:



Process Flowchart

Here is a flowchart of this process, in which the elements are color coded:

data flowchart

Rules

The following is an English paraphrase of the major XTRAN rules used for this example.


Rules that extract bookmark data from HTML

These rules are run, with a version of XTRAN that analyzes HTML, on each HTML file in the Master Document, after the Master Document has been changed or new chapters have been added to it.  We first delete the bookmark data file, so that it will be recreated "from scratch".

    Read and parse HTML to be analyzed
    Open bookmark data file to append
    For each HTML "statement", recursively
        If <H1> (chapter heading) tag
            Remember its text
        Else if <Hn> (heading) tag
            Remember its text
        Else if <A NAME="xxx"> tag
            If item is to be cross linked and/or indexed
                Write bookmark information to data file
    Close bookmark data file

Rules that check bookmark data for name duplication

These rules are run, with a version of XTRAN that only evaluates rules, after bookmark data have been extracted from all HTML files in the Master Document, to check the bookmark data for name duplication that could mess up the cross linking process.

    Read bookmark data from file for all chapters
    Create bookmark duplication output file
    For each bookmark
        If bookmark is not to be cross linked
            Continue
        For each bookmark following this one
            If bookmark is not to be cross linked
                Continue
            If same bookmark name
                Write information to output file
            If bookmark name contains our bookmark name
                Write information to output file
    Close output file

Rules that insert cross links in HTML

These rules are run, with a version of XTRAN that re-engineers HTML, on each HTML chapter in a document variant, to recreate that variant after changes or additions to the Master Document.

    Read list of chapters for our variant from file
    Read bookmark data from file for our variant only
    Read and parse HTML from Master Document version of chapter
    For each HTML "statement", recursively
        If it's <A NAME="xxx"> tag with NOLINK and/or NOINDEX attributes
            Remove them
            Continue
        If it isn't non-markup text
            Continue
        If it's already enclosed in <A> tag (it's a bookmark
          or already indexed)
            Continue
        For each bookmark name to be cross linked
            If the bookmark's text occurs in the most recent header
                Continue
            For each occurrence of bookmark name in this text
                Replace occurrence with a link to item's bookmark
                Set to continue with 1st replacement "statement"
    Output re-engineered HTML for chapter to variant's subdirectory

Rules that generate HTML index from bookmark data

These rules are run, with a version of XTRAN that re-engineers and generates HTML, once for each document variant, to create its index, after changes or additions to the Master Document.

    Read list of chapters for our variant from file
    Read bookmark data from file for our variant only
    Generate HTML index header
    For each bookmark item
        If bookmark is not to be indexed
            Continue
        Record bookmark text and sequential number (for sorting)
        If new starting character
            Record bookmark text's starting character (for sorting)
    Sort starting characters
    Generate alphabetical "index to the index" using starting characters
    Sort bookmark data texts
    Generate start of HTML table
    For each bookmark item, alphabetically
        If item is not to be indexed
            Continue
        If new starting character
            Generate header for it as table row, including bookmark for
              "index to the index"
        Generate index entry as table row, including its section and chapter
    Generate end of HTML table
    Write out HTML we've generated as document index, to variant's
      subdirectory

NOTE

Normally, we generate each document variant into its own subdirectory, so the HTML file names need not change.  In this example, however, all of the HTML files live in the same directory, so we manually adjusted their file names and links to them.  These are the only changes made by hand, and would not normally be necessary.




Input to, and Output from, XTRAN

This example uses a "mini" version of the XTRAN User's Manual to show the effects of the cross linking and indexing procedures.  This "mini" version is not proprietary; the actual XTRAN User's Manual is proprietary and requires a nondisclosure agreement.  Also, this "mini" version is for demonstration purposes only and is not necessarily current.  Therefore, its contents do not constitute any representation by XTRAN, LLC about XTRAN.

The actual XTRAN User's Manual (including all variants) has more than 1,200 bookmarks in over 50 HTML files containing more than 20,000 nonmarkup text items; cross linking it involves as many as 25 million checks for cross links to insert.

Master Document

Variant Chapter Lists

These small text files are read by the XTRAN rules to determine which chapters participate in which document variants.

master.nam

; master.nam — Chapters in XTRAN User's Manual, all variants
; Revised 2002-02-09.1258 by S. F. Heffner
;
var1ttl.html
var2ttl.html
chapter1.html
chapter2.html
chapter3.html
;
; End of master.nam

variant1.nam

; variant1.nam — Chapters in XTRAN User's Manual variant 1
; Revised 2002-02-09.1258 by S. F. Heffner
;
var1ttl.html
chapter1.html
chapter2.html
;
; End of variant1.nam

variant2.nam

; variant2.nam — Chapters in XTRAN User's Manual variant 2
; Revised 2002-02-09.1258 by S. F. Heffner
;
var2ttl.html
chapter1.html
chapter2.html
chapter3.html
;
; End of variant2.nam

Resulting Document Variant 1