Thursday, September 6, 2012

Red-black & blue


One of my neglected hobbies involves enliterating programs for the literate programming wiki. Donald Knuth originally described a literate program as one that has explanation of the code to humans as its primary goal. (Structured nicely, it even can be torn apart to assemble a real program with the right software.) Javadocs are nice, but focus on describing the public functions independently. That's fine if I'm only after the api, but a new programmer - so the reasoning goes - should study how great programs work the way english majors deconstruct Shakespeare.

However, most of the gaps in the literate programming wiki are either code dumps I can't easily enlitarate (language I don't know or algorithm I don't understand) or have no code at all. One page that I thought threaded that line was a python implementation of a red black tree. So, I began describing the tree node and so on, before I noticed that the implementation lacked deletion code.

Understand, I can do this as I have seen the algorithms before. But, they are tedious to implement and I have other coding projects I want to pursue. This was undertaken in a different spirit. So I let it languish. I dumped in some pseudocode and went back to working on Simpletron (which I will eventually submit to the wiki). A month and one half passed and no one else has touched it. So, I thought, why not check for something I can shoehorn in from github.

Big mistake of course. Little choices are stock and trade of writing a program. I call the values stored values, so & so calls them keys. (Only found one worth a damn, by headius.) I use tabs, he uses spaces and the compiler/interpreter expects consistent depth whichever is used. Suffice to say, I wasted half an hour correcting those things in the copypasta inserted in the established code. Then, I spent another half hour getting his native code to print nicely. Headius appearantly doesn't care about color or whether it looks vaguely treelike. Rather than leave you doubting me, here is an example. I had to add the periods so blogger wouldn't trim the spaces.

20
 . . 10
 . . . . . 3
 . . . . . 15
 . . 70
 . . . . . 55
 . . . . . . . . 30
 . . . . . . . . 62
 . . . . . 80
 . . . . . . . . 98

The point here isn't that I'm wiser than headius. I am faced with three choices.

A) Throw out everything, get some reliable pseudocode, and build it myself. While I may relish the finished product more, it is a waste of time and concession of failure on my part. ie, I couldn't understand what was going on so I spent good money after bad when something workable was at hand. Not likely.

B) Throw out everything I've enliterated on the wiki so far and enliterate this version that works totally. (After refactoring it, so I don't cringe looking at the one letter variables and so on.) This is a blow to my pride since I've essentially wasted effort so far. However, - after refactoring - this is the decision closest to my actual goal: just describing how the code is working.

C) Keep what I have and translate a deletion algorithm to work with the code (and explanation) already present. One path involves translating headius' algorithm. The other involves translating pseudocode to python. The problem with each involves rectifying various assumptions we've made even more fundamental than calling the stored integer value or key. Headius uses shells rather than nulls as the leaves, which means he can query the null's parents & so on whereas the current code has to check whether it was just handed a null. I haven't checked Cormen's pseudocode in a while but it may feature the same, or a different, disjoint assumption.

The trick in deciding between the latter two choices stems from how much perceived work comes from the translation and debugging it against reenliterating and refactoring a code base that won't fight me. To be fair, I haven't enliterated much as I left the insertion code unrefactored and uncommented. Printing is the same, thanks to that last half hour. Intro and some of the normal node are the same or trivially retargetable. And yet, I shouldn't take it's uncomplaining operation on faith. I need to test that this indeed deletes like a red black tree should.

While I began this post thinking that I would argue myself into patching the deletion hole with translated pseudocode, I will take on headius' legacy. Make it better, readable, literate. This just won't show up instantly on the wiki. Honestly, I can't complain about this. In similar realization last time, I will restrict simpletron to describing its operation (filling its wiki). Besides learning clojure, that means for coding practice I would have turned to Java and either finish the game portion of Conway's trees or Edwin's game. (Conway's trees is an implementation of his game with different types of trees underlying the map.)

Wednesday, September 5, 2012

Really Simple


So I find out my assessment, potentially, on Wednsday and see what path the remainder of my life will be like. It's a bit unsettling to think that - finally - I may have ended it. Its nice to think I can reapply but if I'm working fulltime or 'full'time, then I won't really have enough to spare. Oh well.

I had thought of not reporting about my progress on Simpletron and the Simple compiler. That's silly, of course, because I've linked to my github account where you can see all the commits and delays and current state. I actually gain by explaining what's going on (outside of the commits no one else will read).

Anyway, I was thinking on the nanopass compiler article and find it appealing for changing the Simple Compiler. Rather than cart around useless rem commands, I can just delete them. Unless, of course, a goto refers to it. And that goes for all the other line numbers too. But this is forward, I'll be more explicit for those who haven't read the article.

Very quickly, in case you don't recall or know that I've been working on Deitel's Simpletron/Simple Compiler and what those are: Deitel is a publisher with some computer texts. There are a few exercises that involve simulating a cpu and then a compiler with a subset of the computer language BASIC. I've completed the canonical implementation of the exercises (albiet in Python rather than Java or one of the C's). Working on the extension problems currently.

These, potentially, involve changing the structure of the compiler quite a bit so I've been dragging my feet. A tiny part of that is anticipating breaking the compiler, but barely. Reverting is effortless and there isn't any real danger anyway since these thoughts are clear. The more ambiguous aspect comes from when to split wildly divergent extensions and when to freeze a version to document. (Not necessarily the same point.)

One extension involves changing the compiler to an interpreter. Rather than transform 'Simple' code to simpletron instructions, the interpreter puts the instructions into memory. Then it runs just that command (or block if multiple) This restricts, for example, goto commands to pointing at already declared instructions.

Another extension (my own), involves introducing an assembler in between the compiler and simpletron. It isn't necessary since the language is barely above the assembler level anyway. However, introducing the interface lets me fantasize about either retargeting the compiler to another computer simulator. (Say, Notch's Dcpu10) Or, I could write a compiler for another simple language (a pascal? a scheme even?) to target the simpletron.

But, those are changes I will pursue later. For the moment, I am more concerned about incorporating the eerie basic subroutine. Normally, it would have a name, associated commands, and be callable from anywhere. Because 'simple' requires line numbers, a client can call any part as a subroutine. I can already goto any line (and depend on another pass to resolve lines I haven't enountered yet). But, a gosub must return to the point after the gosub. Consider

  1. print 4
  2. return
  3. gosub 1
  4. gosub 2

One of the problems of writing programs at this level is unstructured programming techniques like this. The print is fine, but then I move to the return. If the return were below the gosub, then - upon reaching the gosub - I insert a goto in the 'return's spot that points at the instruction right after the gosub. Here, I hit it early, so what is supposed to be there is ambiguous. I hadn't thought deeply and appear not to be able as I'm about to argue myself into a hole.

Suffice that it is late and I am tired. Returns need to initialize with a goto [next instruction] in case of this situation. But then what about after I've returned? Do I reset it to next instruction or leave it pointing at the first gosub? If I'm to protect the client, then the former is arguable. But, basic is so far down, that I cant' really justify it. If you want to make an infinite loop, who am I to stop you? If you want spagetti code, I have no right to stop you. So why would I make subroutines better than the spec calls for?

Mainly, because I know that it can be better and implementing a hexagonal wheel because it matches my presumed experience feels less useful than implementing the octagonal wheel I learned about earlier. In this case, that means named functions/regions of code. Those suggest some scope and I even heard how compilers (generally) graduate to that level: an abstract syntax tree for each function.

This is why I want an intermediate layer I can manipulate, unacknowleged by the spec. If I examine the code and find forward or backward references (a certainty), then I can resolve the reference before producing code. Consider a for loop

  1. for x = 3 to 5
  2. print x
  3. next

Without making an intermediate change to the simple code, I have to save the information about the for somewhere to use when I hit the next command that loops back to the for and increments x.

However, there is another paradigm. I encountered it in the Nanopass compiler article. Rather than make ad hoc changes to code in between compiler passes, the professor who wrote it demanded his students provide a grammer for the intermediate language that a particular pass expects and emits. So, in the above case, I would append the line number of the for to the next and the next to the for. The gosubs recieve the line number where the return's goto will be stored. For clarity:

  1. for x = 3 to 5 3
  2. print x
  3. next 1



  1. print 4
  2. return
  3. gosub 1 2
  4. gosub 2 2

This requires, though, that I have the whole source code available to me, as a list of strings. Otherwise, the separate passes would be opening the same file multiple times and resolving their adjustments inefficiently. Were I to do so though, then I can make a number of nice transformations. Like, after all the forward references are taken care of, I can cut all the line numbers and comments that aren't referenced. (Or even increase the line number past a comment and then cut them all.)

While I am quite excited to make that change, it flies in the face of both the interpreter version (which can't import the whole file) and documenting the code. The documentation, to my mind, should reflect Deitel's description of how to make the simple compiler. As I improve the compiler with functions or strings, some basic elements move farther from the original working version.

On one hand, it is unlikely that anyone will look at my implementation, much less the wiki documentation for it. And yet, I've read a fair amount of the only implementation uploaded before mine (that I could find) of both the simpletron and simple compiler. (Just the simpletron is more common.) If someone were to reference mine as a guide, then a version with multiple passes that defy Deitel's spec (to satisfy one or more of the embellishments) would likely disincline that person to continue. S/he either has to investigate where the old pass remains in the new version, construct the undocumented, canonical version from an old commit, or ignore Deitel and work from my template. Obviously, none of these are likely. So, I would have just failed that person as guide.

On the other hand, If I were to regard the literateprogramming wiki as my primary audience, I could justify completing the exercise and then writing the documentation. Of course, they could also see that I had gone beyond the canonical spec and mutter. They may not be invested in adherence, but if someone were to submit another implementation in a different language but only satisfy the canonical spec, there would be a significant disparity.

In the end, I see that I should redact my for loop changes and refactor scompiler and simpletron to their very best selves for documentation. I'll need to move the 1.0 tag up so the wiki references the right version. Only then can I branch out with extensions and interpreter. Honestly, I don't need the extensions for the interpreter so I won't feel bad about not rejiggering them for the interpreter.

I'm also working on some java projects and learning clojure, but that's another night's work.