Some of Tuesday's design decisions and issues.

Some of these will be subject to change/refinement, however... Let me know if you notice anything that looks wrong.

Sections of the site included in initial implementation:

News and Analysis, including site's Daily Update page.
Features.
Start work on a Glossary.
Some information for the InfoBank, although it's not yet clear what access will be given to this initially.
Contact details for Webmaster, Editor, etc.
Begin to collecte data for Site Index.
Home Page, with links to all sections, Site DISCLAIMER, plus site blurb (possibly with link to longer blurb). Sunil will need to supply suitable text for latter two items.
Possibly a section for reports --- we'll have more idea exactly what's involved here when we actually see one!

Glossary: the structure of an entry

The TERM to be glossed.
Synonyms, Antonyms, related terms, see alsos (will this be structured?).
Brief definition.
Longer definition (via link on full Glossary page, but immediate if access if reached via glossary link) if possible or necessary. This might include links to other Glossary entries, or to a definitive definition (which could be on an external site, but not to anywhere else (unless anybody's got a better case here).
Links: to a page for this term if there is one (e.g. in InfoBank), or perhaps to its Site Index entry.
Full glossary entry may be integrated with page for entity in InfoBank. This might also solve some problems of deciding where a link goes to.

Indexing Specification

Indexing data should be collected from very early on. The (human generated) Site Index may well become on of the most valuable ways to access the site.

Each indexed phrase requires a named anchor in the document, and a code pointing to this (which will convert into a URL) is stored with the entry.
The Index term or terms for the entry (deafults to phrase).
We should probably record the actual phrase indexed with the entry.
Some kind of descriptor will be needed to describe the reference's context (e.g. a code for News Item + date).
A caption for the entry.
Some blurb for the index, e.g. giving the context sentence in which the reference occurs. By default, this will be the actual phrase itself.
A (very rough)percentage score giving the relevance or usefulness of this particular reference to the term (don't think hard about this, just pick a number).
Some very rough idea of an expiry date; e.g. a reference to a News item might be less relevant a year later than one in a Market Survey. The reference won't necessarily disappear after this date, but it may be displayed after more current references. Of course this is a very crude way to indicate that something has gradually declining significance, but anything more sophisticated will be too complicated.

Eventually it would be nice to produce classified indices (e.g. index of Names, etc.) but some of this will be covered by database. A method to identify cognate terms would be useful.

Some security and administrative details

UserIDs will be based on a short representation of the subscriber's e-mail address. If possible, the full e-mail address will be accepted as an alternative.
Passwords for the initial release of the site will be set and issued by the site administrator, although the subscriber can of course request the password of their choice. (They cannot change their own password).
The whole site is readable by anybody with a web account on NetLink's server, as is our password file! The passwords are scrambled of course, but this is easy to crack. Subscribers must NOT under any circumstances use a password that is the same as one that they use on any other system. I don't know exactly how this news should be explained to them, but we must in any case indemnify ourseles against liabilities arising from this problem in the site disclaimer, which should be on paper and signed before any account is activated!
NetLink's error log files are similarly public! A smart compliance officer will take subscribers off the site faster than you can say "click here".
The service agreement, including disclaimer, must stipulate that a UserId is for the use of one person only. This is not simply a neat scam to increase revenue --- when the site becomes interactive, those operating it could, for example, be held liable for any libellous or illegal material posted whose source could not be firmly identified. Subscribers must therefore agree to indemnify against anything arising from use of their account details. However, it's very hard to see how we could claim to have taken due care to keep these details secret at present! (Flexible licences with concurrent user options will make this no longer a problem).
Ditto user data.
The disclaimer will also appear on the HomePage. As it will not be necessary to go through this to access the site, the paper version is essential. In the longer term, a brief copy could be displayed every time a subscriber logs in to the site.
While very little on the web is truly secure, all of the above shortcomings are due to flaws in NetLink's system, and would not apply in the same way to a better configured server under our control.

Document Identifiers

Should work as filenames on any system (beware Mac limit of 31? characters --- pray that IMM's Macs' OS isn't so old as to make this smaller!).

A code (a couple of letters, say) for the type of document. This will probably consist of one letter for the smallest section in which the document lives, plus a code for it's type (e.g. NI, NS, could stand for News Item and Daily news page (with links to items) respectively.
A six digit number unique to the main document, of which this is a part. e.g. we could be numbering nodes which are chapters of a report here. This number belongs to the whole report. Number has all leading zeroes.
One letter code describing the level of the main document. It probably makes sense to start with, say, B or C, for Section (Level 1), then D for Chapter, etc. N.B: this is the opposite order to the original proposal. I think this code may simply be for convenience.
A dash-separated list of the various sub-sections identifiers at their various levels, i.e. in a book, -1-2 would stand for Section 2 of Chapter 1.
Extra components of a document are given an abbreviation to add on, e.g. -abs for Abstract.
On UNIX system, we may well replace some or all dashes with slashes, i.e. create a hierarchy of small sub-directories. This is a lousy idea on a Mac!
What do we do with included figures, tables, etc. For now, I think we use a sub-directory.
These Document identifiers should work somewhat like URLS, although they are designed to be simple and foolproof for humans to use. For example, #-notation should be used to refer to a named section in a document in this system. The identifier structure may yet be altered to facilitate this URL-like behaviour, or simply to make it look more like URL structure.

Some brief notes on Document and Node structure:

Documents' text is generally split into nodes, the largest being a Chapter (or equivalent), and the smallest a sub-section.
For example, a big article might correspond roughly to our conceptual idea of Chapter, whereas a small article might be regarded as a Section.
A text node which is sub-divided starts with a mini-ToC: this has only 2 levels, and only goes down as far as sub-subsections at the very worst. Usually it will only reach down to subsections.
ToCs generally have at least two levels (if they exist). I can't imagine circumstances where they should have more than three.
If a division of a doc is large enough to require its own ToC, then it should have a node of its own. Occasionally a ToC may occur at the start of a division within the body text of a node --- this should be a one-level ToC at most.
If it doesn't have a heading, then we can't see it!
There is a case for ToCs to include sections at a higher lever. at least along the prefix to this node in the document tree, e.g. a Section ToC might carry the Chapter headings for the main document, and the Section names within its own chapter.
The following design decision might change with experience, however:
- The primary ToC for a main document usually has at most two levels.
- It will only ever have three if this is necessary to provide direct links to nodes of body text (we are talking big document here), and even then it might not.
Abstract ToCs (i.e. those without body text in the node) usually all live on sections of one page for all levels, so they will all download in one go.
sub-subsections (level 3) never feature in ToCs of documents with level 0 (Chapter) nodes, and only rarely in documents with level 1 nodes.

Tim Heap

Last modified: Thu Nov 13 18:59:52 GMT 1997