k on pdp11

pd.k compiler
architecture
size
stack
memory
calling convention
indirect calls
cond
switch
loops
io
link
emulator

ktye/k ktye.github.io/kdoc.htm
+ flp add  '  ech both bin lin
- neg sub  /  ovr/fix echright
* fst mul  \  scn/fix eachleft
% sqr div      / join   decode
! til key  mod \ split  encode
& wer min  $[a;b;...]     cond
| rev max  while[c;a;b;d;e;..]
< asc les  f:{x+y}   [bl;o;ck]
> dsc mor              "abc" c
= grp eql  01234567   1 2 3  i
~ not mtc   :+-*%&|   4 5 6. f
, enl cat  <>=~!,^#   2a300  z
^ srt cut  _$?@.     (1;2 3) L
# cnt tak           `a`b!5 6 D
_ flr drp  t,d t,t t,'t   join
$ str cst           k!t    key
? unq fnd  in       k?t  group
@ typ atx  @[x;i;+;y]    amend
. val cal  .[x;i;+;y]    dmend
                              
abs sin cos exp log find angle
imag conj  types:cisfzLDTvcdlx
?n(uniform) ?-n(normal) ?z(bi)
n?n(with)   random   -n?n(w/o)

pd.k

pd.k (source)

pd.k compiles k/os to pdp11 assembly instructions. the compiler translates an intermediate form of k's source to machine code. the intermediate form is given as a k table: +`t`p`i`s!(T;P;I;S).

 907_+`id`t`p`i`s!(!#T;T;P;I;S)
id  t   p   i  s                 this is the k-table representation
-----------------                of function tp(x) which returns the
907 fun 0   1  tp                type of k objects.
908 arg 907 0N k                 
909 sym 908 0  x                 in go this is:
910 res 907 0N i                 
911 ast 907 0N                   func tp(x uint64) int32 {
912 ret 911 0N i                  return int32(x >> 59)
913 cst 912 0N i                 } 
914 typ 913 0N k
915 shr 913 2  k
916 cst 915 0N k
917 typ 916 0N k
918 get 916 0N x
919 lit 915 56 k

it's relatively easy to query the table, e.g. listing the most used literals of type i:

 |^#'=I@&(S=`i)&T=`lit
0  |236
1  |173
8  |89
16 |83
4  |73
2  |45
15 |20
..

or calculate the ratio between function calls and definitions:

 (#&T=`cal)%#&T=`fun
7

to save code, this suggests that the calling convention should minimize instructions on the call side but can spend some more for the callee.

architecture

the pdp11 is a 16bit machine while k uses mainly 32bit integer operations but stores k values in 64bit integers.

there are 8 registers:

r0 r1 r2 r3	accumulator for 64 bit values
r4	return address in microcode
r5(fp)	stack frame for function calls
r6(sp)	stack pointer
r7(pc)	instruction pointer

of which we use the first 4 as the stack top for both 32bit and 64bit integers. since 32bit operations need multiple 16bit instructions, we combine most operations in microcode stored at a fixed address. for the caller this needs 2 words jsr pc,#addr. the microcode stores the return address in r4, does it's operation and jumps back. as opposed to function calls, microcode generally has a stack effect.

as an example, this is the code for adding both 32 and 64bit integers, assuming x is in r0..r4 and y the top 4 words of the stack: jsr pc,#addr. the code at #addr is:

mov (sp)+, r4   12604
add (sp)+, r0   62600
adc r1           5501
adc r2           5502
adc r3           5503
add (sp)+, r1   62601
adc r2           5502
adc r3           5503
add (sp)+, r2   62602
adc r3           5503
add (sp)+, r3   62603
jmp r4            104

size

with 16bit words we can address 64kb or 32 kilo words. this is not enough for both code and data. the pdp11/45 has an operation mode that splits code and data which effectively doubles addressable memory.

the code size of kos is 31626 words and just fits into the address space leaving another 64kb for data.

as a comparison, the same code compiles to 24478 bytes of wasm (37%).

to simplify the excercise, kos/2024 removed all floating point operations, which is indicated by thruck strough sections in the reference.

stack

the stack lives in the data section. k reserves a bucket within it's memory allocator for the cpu stack, that is not touched by k directly.

the stack space is 1kb below 11776 growing downwards.

most stack operations use 4 words at a time, storing the first 4 registers. since calling push/pop microcode uses 2 words already and it's a common operation, they are included in other microcodes where needed. e.g. any leaf node like get-local or literal needs to push the registers first, and all dyadic functions pop the second argument from the stack updating the first argument in the registers.

calling convention

the calling convention is arranged to use only two words for the caller and does most work within the function pro and epilogue.

in pd.k the function call is generated by:

cal:{(,/y),4727,(afun S x)}   /y:evaluated child nodes, x:node index

it evaluates the child nodes (function arguments) first then jumps to the function address that is calculated at linkage.

function arguments are evaluated sequentially, used for debugging, comparing function calls and arguments between the pdp11 and the go version.

the callee finds the last function argument in the registers r0..r3, the others are on the stack in reverse order.

the callee makes space for local variables on the stack and stores the old frame pointer.

the new fp points to the stack below the args and local variables.

as the first argument is in the registers, it will be pushed by any leaf node and can then be referenced by -8(fp)

caller                    count      stack
 push x y, z on r0..r3    (words)    y
                                     x
 jsr pc,#f                2          ret addr              4727 -f
callee
 mov fp,-(sp)             1      +-> old fp               10546
 sub 4+8*#loc,sp          2      |   gap(align)           162706 (4+8*#loc)
                                 |   [locs]
 mov sp,fp                1      |         ^-fp           10605
 (body)                          |   z (body pushes z)
 mov fp,sp                1                               10506
 add (4+8*#loc),sp        2      sp                       62706 (4+8*#loc)
 mov (sp)+,fp             1                               12605
 mov (sp)+,r4             1                               12604
 add 8*-1+#args,sp        2                               62706 (8*-1+#args)
 jmp r4                   1                                 104
total:                           caller(2) callee(12)
access:
 arg0 z                          -8(fp)
 argi y(1),x(2)..                8*i+nloc(fp)
 loci a(0),b(1)..                8*1+i(fp)

while functions always correct the stack pointer on exit, it is still necessary to keep the stack in balance. e.g. niladic functions need to push the registers before the call, otherwise the first leaf node might overwrite the callers last argument. similar for functions that don't return a result: it still needs one pop after the call, or the stack grow's within a loop.

indirect calls

indirect function calls calculate which function to call at runtime.

the compiler creates code that evaluates the arguments first, then calculates the function index and jump to 1004 where the function table starts.

cli:{$[1~#y;4727 -99999;,/1_y],(*y),4727 1004}  /indirect call

the function table header calculates the jump position. it stores the index in r4 and pops the first argument to the registers. it also moves the return address to the top stack position where the callee expects it. then the index is multiplied by 4 and added to the instruction pointer.

1004: mov r0, r4        10004         /store function index
      mov 2(sp), r0     16600   2     /pop first argument
      mov 4(sp), r1     16601   4
      mov 6(sp), r2     16602   6
      mov 10(sp), r3    16603  10
      mov (sp), 10(sp)  11666  10     /move return address
      add #10, sp       62706  10     /shrink stack
      asl r4             6304         /multiply index
      asl r4             6304
      add r4, pc        60407         /index function table
      jmp #f0             104  ..     /jump to function 
      jmp #f1             104
      ..

the intro is followed by jump instructions, one for each entry in the function table.

for the callee everything is in the same position as in a direct function call.

cond

conditionals come in two forms: if and if-else blocks, depending on the number of child nodes:

cnd:{(iff;ife)[3~#y].(x;y)}
iff:{$[(T 1+x)?`bnd`bor;(*y),5700 1401 402;(-4_*y)],62707,(O@2*#y 1),(y 1),po}
ife:{$[(T 1+x)?`bnd`bor;(*y),5700 1401 402;(-4_*y)],62707,(O@2*#b),(b:(y 1),62707,O@2*#y 2),(y 2),po}

as an example this is a function with a simple conditional:

func f(x int32) int32 {
        if x > 3 {
                return 2*x
        }
        return x
}

the function body compiles to:

jsr pc, #2512    4727   2512    /get x
jsr pc, #1116    4727   1116    /push also zeros r0..r3
mov #3, r0      12700      3    /literal 3
jsr pc, #2400    4727   2400    /compare x and y
bgt 1046         3002           /branch if greater (jump over next instr)
add #24, pc     62707     24    /otherwise: jump over if block
jsr pc, #1116    4727   1116    /if block..
mov #2, r0      12700      2
jsr pc, #2512    4727   2512
jsr pc, #1760    4727   1760
add #10, pc     62707     10    /return (jump to epilog)
jsr pc, #1142    4727   1142    /pop
jsr pc, #2512    4727   2512    /get x

switch

switch handles only sequential case expressions 0,1,2.., commonly used to switch over k types. non-matching cases are handled by the default block.

they are compiled by:

swc:{b:(1_y),\-55555 0;brk@-2_,/(!0
 (*y)                                /switch-expr in r0
 22700,O@-1+#b                       /cmp #(n-1),r0
 101402 62707,O@2*#*|b               /jump over default
 *|b                                 /default
 70027 4 10100 60007                 /mul #4,r0;mov r1,r0;add r0,pc (jump to case offset)
 ,/62707,/'O(4*|-1_!#b)+2*-2_0,+\#'b /branch offsets
 ,/-1_b                              /cases
 po)}                                /pop switch expr

every case gets an autmatic break statement appended. the default branch is handled first and needs a conditional. the remaining cases are implemented as a jump table.

break (and continue) placeholders are replaced by

brk:{j:(#x)-2+i:&x=-55555;x[i]:62707;x[1+i]:O 2*j;x}   /replace -55555 0(break)    with add #n,pc
cnt:{j:2+i:&x=-44444;x[i]:162707;x[1+i]:O 2*j;x}       /replace -44444 0(continue) with sub #n,pc

loop

loops have two forms: a simplyfied loop with a post conditional (like do-while in c) and a normal loop with a pre conditional. they are compiled by:

for:{(lop;slp)[I x].(x;y)}                    /slp:do-while; lop:while-do
lop:{p:$[#*y;po;!0];(cnt brk $[#*y;(y 0),5700 1002 62707,(O@2*#b);!0],b:(p,,/|1_y),-44444 0),p}
slp:{c:(*y),5700 1402 162707;b:po,cnt brk(-2_y 2),y 1;  pu,b,c,O 2+2*(#b)+#c}

io

io routines include reading a line from the terminal and printing output.

they are:

read: 12604 11600 10001 62706 10 105727 177560 100375 112721 177562 1375 160100 5400 5300 5001 104
write:12604 11601 62706 20 5700 1404 112127 177566 77003 5000 5001 104

io in pdp11 is memory mapped, that means moving data to/from a special memory location is all that needs to be done. tstb 177560 returns if input is available, moving a byte from 177562 reads a character.

these are implemented as hot loops, but the emulator stops and waits for input on the test instruction. the emulator continues when a newline is present.

linking

the compiler generates instructions as 16bit octal integers disguised as normal k ints, which are 32 bit. the domain ranges from 0 to 177777 (65k) all unsigned, which means negative integers are outside the domain of the code.

we use negative integers as unresolved addresses which are fixed during the last step.

e.g. small negative numbers are function addresses, -1 is the first function (0 is main, never called).

other negative numbers have a fixed meaning and are replaced by built-in microcode, such as push:-99999, get-literal:-77000, memcpy:-55000, add:-50000, etc.

emulator

the emulator is written in javascript. it's not a general pdp11 emulator, but grew along with the compiler and just implements what is needed.

kos has been manually booted. the code was loaded at 1000 and executed until the repl was about to read.
at this time all memory content was downloaded to an image k.img that is now loaded every time the page loads, ready to evaluate k input from the terminal.

it has 3 speeds: one instruction per second, one per animation frame, and unrestricted, as fast as possible without animation.
after printing the result, it shows an instruction histogram.

the first two also make noise (or music, that depends). they set the frequency according to the current instruction in octal scaled by 0.334*22050/0177777.
special recognition has the clz microcode in the allocator to calculate the bucket size.

ktye/kos for pdp11 2024.01.01

contents