Haskell approaches; matches!: A little spec writing

In my last posting, I discussed my views on the benefit of exploratory programming. However, even when I decide to start pounding the keyboard, with a priority on exploring the solution space, it's essential to first write a problem specification for all non-trivial programming problems. Most importantly, this practice prevents me from solving the wrong problem, that is, getting to the end of a solution or getting stuck somewhere after a lot of effort and finding out that I've assumed wrongly and implemented the wrong functionality. Solving the wrong problem can be a learning experience, though wasteful, for small projects. For large projects, it sometimes qualifies as a disaster. Almost as important, writing even a short and simple spec gives the writer a much better handle on the boundaries of the problem. This is then used as a launching point for sparking even deeper abstract thinking about what kind of solution will best fit the problem. Following is a very small scale example of mine, reflecting this idea:

During one of my Scheme programming sessions last year, I needed to reference the PCRE library documentation. (That's Perl Compatible Regular Expressions, the regular expression library that Chicken Scheme uses.) However, what PCRE provides for documentation on it's site is the PCRE man pages, concatenated together into one 150+ page length .txt file. No table of contents and, of course, no section links. So, I wrote a Scheme program to parse the PCRE text file and create an HTML page out of the sections that I was interested in, making a table of contents derived from each sections headings and sub-headings, that were links back into the correct location within the section. From analyzing the PCRE man-page text, it's easy to pick out main and sub headings. Main having text flush with the left margin, while the sub-headings are indented once (4 spaces) and the remaining body content is all block indented twice. The only minor complication would be creating the table of contents, because the sub headings need to be grouped and contained in an HTML UL tag. So a little extra abstraction would be needed to take care of each sub-heading group when it could be determined that a group span had ended. Here is the implementation of my Scheme script. This is how the resulting HTML page should look.

After I had learned enough Haskell to be slightly dangerous, I decided to see how this little conversion task translated over. I kept the Scheme implementation in mind as I worked through my Haskell one. I soon began realizing, WOW!, this is not going to translate very smoothly to Haskell at all. Scheme has good support for it's own vision of functional programming (not pure), but Haskell, as a purely functional, referentially transparent language, simply doesn't allow for the update of variables. Haskell forces a much different and usually more elegant code structure, but when writing code, it's usually not trivial deciding on the most appropriate abstractions to use. Since my explanation here is weak without an example, I'll walk through my first implementation in Haskell and then revisit with a better second implementation that I coded later, inspired by the creation of a simple problem specification.

Haskell script - Take 1

Here is the code of my first attempt. It's less than 100 lines and fairly straight-forward, so it shouldn't take much time to understand how it operates. I'll outline the main implementation ideas below:

It opens the text file and uses a regular expression pattern to match the sections that I am interested in:

 -- The regular expression section template.
sectPatt :: String -> String
sectPatt shead = "(?ms)^" ++ shead ++ ".*?^NAME.*?(^PCRE.*?)^SEE ALSO"

Here, the shead parameter indicates the unique man-page pattern.

The main heading and sub-heading patterns match against the section text, resulting in a list for both heading types of heading string, offset relation pairs.

The two heading lists are merged into a list of TocEntry data:

-- Level1 is a main heading and Level2 is a sub-heading.
-- Constructor arguments are text of heading, offset location with the section, and
-- heading anchor label.
data TocEntry = Level1 String Offset String | Level2 String Offset String

The index entry list is now in the correct order and is used to assemble the table of contents and final section text for each section.

This solution abstracts the headings out into a TocEntry data structure for them to be handily processed at the assembly stage. This implementation always seemed a little awkward and excessive for this short little script, but it worked and I didn't bother with contemplating on a better abstraction at the time.

Haskell script - Take 2

I had it in the back of my mind that a solution to this problem needed a better abstraction, so several months later, I built up enough resolve to try my hand at it again. This time I decided to write out a simple problem specification, something I probably should have done at the very beginning. Even if a problem seems too small to warrant any kind of formal specification, writing something like this organizes the true nature of the problem, nice-and-neat, in one's mind and can lead to subtle insights into a clearly better implementation. So, here are the basic properties I documented about this problem:

I only want to see two of the man page sections. Take the section beginning with label PCREPATTERN to be the 'Pattern Syntax and Semantics' section and the section beginning with label PCRESYNTAX to be the 'Quick Syntax Reference' section.
The lines without indent are the major headings and the lines
starting with one indent are the minor headings (sub-headings) within a major
heading. This is in contrast to the body content which begins with
two indents.
Each major heading will have a top level entry in the table of
contents and each minor heading will have a sub level entry under the
top level entry that it belongs to. Each table of contents (TOC) entry will hyperlink to
an anchor placed at that entry's heading line inside the body content.
The final structure is Pattern Syntax and Semantics TOC, Quick
Syntax Reference TOC, then followed by the bodies content of the two
sections in the respective order.
The procedure for building the TOC and anchors within the body
involves walking down the list of headings matched within the body,
each time picking the next individual heading based on the offset
location within the section body.
Each section starts with a major heading. When a minor heading is
encountered a sub-list is started that spans until a major heading is
encountered again or the until the section is ended.
If both heading lists have run out, check for previous minor headings
waiting to be made a sub-list.

That's the bulk of it, leaving out lesser details. Reflecting on this list and thinking around the edges gave me valuable insights that led to a different approach for Version 2.

You can see that the intermediate data type I had used for the first Haskell script is gone. It's not needed anymore because no explicit merging is performed. Each section's table of contents and body text recursively assemble themselves as the header lists are consumed. Alas, the magnitude of improvement here, over the first Haskell script, is debatable. I don't think I sped the program up much for something that only takes on the order of milliseconds to complete anyway. I see the real improvement being in the more direct style of implementation. No unnecessary abstractions and no code convolution. Haskell shines at allowing for a shorter and more direct implementation than probably any other programming language. This is very important for readability and maintainability in larger programs. Whenever a definite abstraction is needed, Haskell will bend over backwards to accommodate, but it's a good practice to carefully consider each additional abstraction and use sparingly and purposefully.

Haskell approaches; matches!

Friday, May 30, 2008

A little spec writing

Haskell script - Take 1

Haskell script - Take 2

1 comment:

About Me

Blog Archive