Ruminations on Forth
Text Processing

April 23, 2016

Brad Nelson / @flagxor

Motivation - Word counts


16 middle
142 youth
130 strong
3 Resembling
11 hill,
52 heavenly
1 steep-up
1 climbed
17 majesty,
46 sacred
132 looks
          
It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures. — Alan Perlis

A "nice" type space

  • Scalars
  • Lists
  • Associative Arrays

Self consistent mental model

  • Lists of ordered "things"
  • Associative Arrays mapping thing to thing
  • Structures layed out as Associative Arrays

{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    }
  ]
}
          

SNOBOL

  • String matching
  • Arrays
  • Associative Arrays (Tables)

SNOBOL


FOO = TABLE()
FOO<key> = value
          

Perl

  • Regex matching
  • Arrays (Lists)
  • Associative Arrays (Hashes)
    • String keys only

Perl


$x = 1
@y = [
%z = {
  'foo', 'bar',
  'baz', 'qux',
};
$z{foo} = 3;
          

Python

  • Regex matching
  • Arrays
  • Associative Arrays (Dict)

Python


data = open(filename).read()
lines = data.split('\n')
for line in lines:
  words = line.split(' ')
  for word in words:
    print word
          

Forth

Forth

  • No "real" built-in associative arrays
  • No "real" built-in array type
  • No "real" built-in string type
  • Bare support for string literals
  • String Compare!
Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy. — Alan Perlis

Boilerplate (Python)


#! /usr/bin/env python

import re
import sys
          

Boilerplate (Forth)


#! /usr/bin/env gforth
          

Argument Check (Python)


if len(sys.argv) != 2:
  print 'Usage: ./ten.py <src>'
  sys.exit(1)
          

Argument Check (Forth)


: usage   ." Usage: ten.fs <src>" cr bye ;
: usage-check   argc @ 2 <> if usage then ;
usage-check
          

Load the whole file (Python)


data = open(sys.argv[1]).read()
          

Load the whole file (Forth)


1 arg slurp-file
          

Split words (Python)


data = re.sub('[\n\r\t ]+', ' ', data)
words = [i for i in data.split(' ') if i != '']
          

Split words (Forth)


variable front
variable back
variable on-word
: space? ( ch -- ) dup 32 = over 10 = or over 13 = or swap 8 = or ;
: this-word ( -- a n ) front @ back @ over - ;
: process-word   this-word dup if on-word @ execute else 2drop then ;
: chomp ( a -- ) back ! process-word back @ 1+ front ! ;
: process-char ( a -- ) dup c@ space? if chomp else drop then ;
: split-words ( a n -- )
   over front ! over + dup rot ?do i process-char loop chomp ;
          

Tally counts (Python)


tally = {}
for word in words:
  if word not in tally:
    tally[word] = 0
  tally[word] += 1
          

Keep a list of words and counts (Forth)


variable word-list
: visit-list ( a op -- )
   begin over while dup >r over @ -rot execute r> repeat 2drop ;
: end-visit ( op a -- ) drop 0 ;
: add-word ( a n -- ) here >r word-list @ , , , 1 , r> word-list ! ;
: word>name ( a -- a n ) dup cell+ cell+ @ swap cell+ @ ;
: word>count ( a -- a ) 3 cells + ;
          

Simple hash function (Forth)


variable hash-value
: 0hash   0 hash-value ! ;
: +hash1 ( n -- )
   hash-value @ + dup 10 lshift + dup 6 rshift xor hash-value !  ;
: +hash ( a n -- ) over + swap ?do i c@ +hash1 loop ;
: nhash ( n -- n ) hash-value @ swap mod ;
          

Keep a hashtable (Forth)


50000000 constant table-size
table-size cells  allocate throw constant word-table
word-table table-size cells 0 fill
: table-cell ( a n -- a ) 0hash +hash table-size nhash cells word-table + ;
: add-to-table ( a a n -- ) table-cell ! ;
: get-from-table ( a n -- a ) table-cell @ ;
          

Tally up counts (Forth)


variable target   variable target-len
variable found
: match-word ( a -- )
   dup word>name target @ target-len @ compare 0=
   if found ! end-visit else drop then ;
: find-word ( a n -- )
   target-len ! target ! 0 found !
   target @ target-len @ get-from-table
   ['] match-word visit-list ;
: handle-word ( a n -- )
   2dup find-word found @ if 2drop 1 found @ word>count +!
   else 2dup add-word word-list @ -rot add-to-table then ;
' handle-word on-word !
          

Print list (Python)


for word in tally.keys():
  print tally[word], word
          

Print list (Forth)


: dump-word ( a -- ) dup word>count @ . word>name type cr ;
word-list @ ' dump-word visit-list
          

Ponderings

  • Arena allocation
  • Hashtable as cache

A Challenge!

  • Make it smaller
  • Make it faster

Slides at:
github.com/flagxor

Thank you