PETZOLD BOOK BLOG

Charles Petzold on writing books, reading books, and exercising the internal UTM


Recent Entries
< PreviousBrowse the ArchivesNext >
Subscribe to the RSS Feed

Roget’s Hierarchical Thesaurus in a Silverlight App

August 24, 2009
Roscoe, N.Y.

To most people these days, a Thesaurus (from the Greek for "treasury") is something invoked from a word processor to suggest synonyms, and help prevent one's prose from appearing tired and repetitive. But the original conception of English scientist and physician Peter Mark Roget (1779 – 1869) was an assemble of all the words of the English language into 1000 categories, which are then grouped into various classes, sections, sub-sections, and sub-sub-sections — in short, a hierarchy.

I have not been able to find an 1852 first edition of Roget's Thesaurus of English Words and Phrases, Classified and Arranged so as to Facilitate the Expression of Ideas and Assist in Literary Composition on Google Book Search, but here's the 1853 second edition, which is probably close enough.

I thought it might be a fun challenge to write a Silverlight application that allows navigating through Roget's hierarchy. On the Project Gutenberg site I found what I needed: an ancient ASCII text file containing the bulk of a 1911 edition of Roget's Thesaurus in the public domain. This file was originally assembled by a company called MICRA, Inc. in 1991, and it is Gutenberg E-Text Number 22, which means it originates from the very early days. The 1911 American edition of Roget's Thesaurus used to create this text file is also available on Google Book Search. MICRA then added a bunch of words to this file and (apparently) ran a spell check on it and flagged a bunch of words as "obsolete". By 1911, several more categories had been added to bring the total past 1000.

The first step for me was to write a program that converted this ASCII text file to XML — actually two XML files. The first one contains the hierarchy up to but not including the ~1000 categories. This came out to be 17.5K. The second XML file weighs in at 2.26M and contains all the words groups associated with the ~1000 categories. Both XML files required some hand cleaning.

The Silverlight app accesses these two XML files and presents the hierarchy in a way that seemed to me to be simple, useful, and intuitive, but which also allowed for some fun animations. Total coding time for producing the XML files and the Silverlight app: 30 hours.

You can run the program from this link:


Roget1911Experiment1.html

I targetted a browser client area of 1200 pixels wide and 960 pixels high. If you can't get your browser window that large, scroll bars will be displayed. Otherwise, scrolling has been almost entirely eliminated, but there is still a little that sometimes occurs in the word-list area at the bottom (the part that's colored Alice-blue).

Roget began his hierarchy with six classes: Abstract Relations, Space, Matter, Intellect, Volition, and Affections. (Obviously words like Volition and Affections betray the Victorian origins of this thing.) My program displays these six classes at the top of the screen. You click one to see the further hierarchy, and you keep clicking until you get a list of numbered categories in the ListBox at the lower left. (Some of these will have tooltips associated with them.) Click one of these categories to gets parts of speech (Noun, Verb, etc), and then click one of those to get the groupings of words in the Alice-blue area.

Often the groupings of words have cross-references to other categories. These function as normal hyperlinks. Obsolete words are flagged with daggers. I have removed the additional flags for obsolete words added in 1991 and discussed in the roget15a.txt file available at the above Gutenberg Project link. Considering the state of the original file and the limited time I spent trying to fix it, it is extremely likely you'll find some errors or oddities in my program.

Getting a grip on Roget's hierarchy was the first challenge. It starts out simple with the six classes, but right away any chance that this is a nice balanced hierarchy with a lot of parallel structure entirely collapses. Four of the classes are divided into sections (as few as 3, as many as 8) but two of the classes (Intellect and Volition) are first divided into two divisions each, and then into sections.

Sometimes the sections are then composed of categories, but often the hierarchy goes further. The longest is: Matter, Organic Matter, Sensation, Special, and then either Sound (which has four nested sections) or Light (which has three). In some parts of the tree, only two categories are listed in the ListBox in the lower left; but the hierarchy Space, Motion, Motion with Reference to Direction has 38 Categories, numbered 278 through 315. I agonized how I would display such a wide range of categories, and I settled upon using a WrapPanel with a Vertical orientation as the ItemsPanel, so the ListBox actually gets wide to display multiple columns of up to 14 categories each.

The WrapPanel also came to the rescue in the part of the program that implemented the hyperlinks from one category to another. Something like this is fairly easy in WPF, but not in Silverlight. Of course I wanted to use a TextBlock to get text wrapping for multi-line entries, and it was easy enough putting multiple Run objects in a TextBlock, some of which were underlined and colored blue. But when the TextBlock is clicked, there is no easy way to determine which Run in the TextBlock got the mouse click. I ended up creating a TextBlock for each word (including each hyperlink), and then using a WrapPanel to do the text wrapping.

The code is a little too chaotic — disorderly, untidy, anarchical, disjointed — to post at this time, but I'll try to get it in shape if there's sufficient demand.


Comments:

This is fascinating! Thanks for sharing it.

Josh Smith, Mon, 24 Aug 2009 17:24:28 -0400 (EDT)

What I like best is the re-presentation of the hierarchy when one clicks the embedded links. Extremely cool!

— John Dell, Tue, 25 Aug 2009 02:49:44 -0400 (EDT)

This is excellent! I love the visual depiction that you have created, a great way to present hierarchical data! I would love to see the source for this if you have time. Thanks!

Matt Serbinski, Tue, 25 Aug 2009 08:21:38 -0400 (EDT)

This is so cool (couldn't think of a better word) in multiple ways: historically, ontologically, and for the GUI design. Thank you!

— Brad Williams, Tue, 25 Aug 2009 14:17:23 -0400 (EDT)

This is way Cool!

— Coal, Tue, 25 Aug 2009 19:30:19 -0400 (EDT)

I am a programmer who learned a lot by reading three editions of "Programming Windows" and who is now trying to master WPF, Silverlight, and ASP.NET. One project I have is to make my father's genealogy research available on a web site, and this blog entry gave me some new ideas. I have previously wrestled with the problem of how to present up to fifteen generations of ancestors on a (big) piece of paper, but computer screens offer a different type of challenge.

Mattias Wikstrom, Wed, 26 Aug 2009 02:26:46 -0400 (EDT)

yes please can you release the code Charles.

— JPG, Wed, 26 Aug 2009 07:19:36 -0400 (EDT)

This is fantastic. A great way to browse and find interesting words!

Stuart, Wed, 26 Aug 2009 17:40:07 -0400 (EDT)

Charles,

This sample covers some aspects of what I am trying to figure out how to do.

- generate a connected graph and display (and your tree is a good example)

- trigger a rearrangement and zooming in when a node is clicked on

I would be interested in how you define your animations, whether they are in code or defined in markup.

— Doug, Fri, 28 Aug 2009 09:04:31 -0400 (EDT)

All the animations are defined in code. Every time a node is clicked, four animations are applied to the clicked node and each of the clicked node's siblings. All animations for all the siblings are consolidated in a single Storyboard, and they all have the same duration. All the animations target transforms: The clicked node and each of its siblings is translated along the X axis; the clicked node and each of its siblings in scaled in the X direction to possibly make siblings narrower than the clicked node. Each node's children are in a canvas. The two other animations scale that canvas in the X and Y directions, making it bigger if the node is selected, and smaller if the node is unselected. — Charles

This is a fantastic way of representing family tree's, look forward to seeing the source code and taking it to the next level with a display photo on each node!

Steve, Sat, 26 Sep 2009 00:18:40 -0400 (EDT)

This is great. Look forward to seeing the code.

— John, Mon, 28 Sep 2009 03:45:50 -0400 (EDT)

Wow - this is a great way to show a nested set or hierarchical data. I also 'cut my teath' on "Programming Windows" back in the 3.1 days. I learned so much. I'd love to see the code behind this application.

— Russell, Wed, 18 Nov 2009 22:57:00 -0500


Recent Entries
< PreviousBrowse the ArchivesNext >
Subscribe to the RSS Feed

(c) Copyright Charles Petzold
www.charlespetzold.com