Mark Alan Richards

Complex Primitives

computer-science

I have a crazy idea: create a cross-platform language, no not Java: something better. Primitives are supposed to the simplest form of data in a programming language. So how hard can it be to work with them...

Typical representations

Boolean is complex

In typical computing systems, everything is a 0 or a 1, except usually nothing is.

CPUs typically look at numbers at bit sets of length 8, 16, 32 or 64: not usually 1.

Although most have somewhere that this doesn't hold true, either with longer primitive sizes (128/256), special floating point versions (56, 80, ...) or slightly weird 31 bitset sizes (bad IBM).

The easiest way to manage boolean is to choose 0 as true or false (often false) and anything else as the opposite. However, what size of bitset do you use? If you use the defacto int then it might be different in different compilers (32bit vs 64bit).

Luckily, all you need to know is the size of the bitset and the offset in memory.

Integer is much more complex

So boolean is an offset in ram and a size of the bitset to use: all 0s then it's false and anything else it is true.

Integers share the problem that you need to know the size of the bitset, but suffer a further problem: order and signing.

The signed number part is simple: a bit is reserved at the most significant bit to be used for representing positive (zero) or negative (1: with 2s complement), but order gets complex..

Bits and Byte Ordering

So what is order: well that "most significant bit" is the problem. Endian of bits and bytes comes into play (and they're not always the same as each other).

The order of bits varies between processors and usually this problem is something that is more likely to affect you at a very low level (drivers, hardware, etc): to make it more fun most computers have more than one cpu. Your sound card, graphics card, network card, etc might all see bits a different way around: nevermind the busses.

At the software level you usually find everything is the same (let me know if I've got this wrong), but here you suffer byte order differences where different protocols (network, inter-process, file  formats) can each represent things larger than a byte in different order.  This isn't too difficult to solve (https://github.com/markalanrichards/bitcoin/blob/master/src/compat/byteswap.h) you just need to remember to use it everywhere your program interfaces with the world.

Floating Points

Luckily floating points are strictly described in IEEE-754, well I say luckily: except the complexities of implementing it mean that not all languages actually adhere to it: https://en.wikipedia.org/wiki/Criticism_of_Java#Floating_point_arithmetic

Characters

Characters and Strings are terrifying.

There are hundreds (maybe thousands) of character sets: in a large part because some base character sets (Latin1) have multiple versions for different languages. Not all are easy to work with (I remember something odd about Turkish EBCDIC and xml processing problems as symbols can be remapped). The simplest solution is to make everyone fit into a box and force UTF-8: then hope that nobody adds a BOM, let's hope UTF-8 never gets deprecated.

References

So you have a reference to some data.... how do you reference it and what kind of data might you have to reference:

The reference could be a nice compile time fixed size (like a pair of integers, I mean a pair of 64 bit integers).

It might be a variable size (String of characters) holding some JSON: so maybe a block of memory.

Or it might be a continuous stream (/dev/urandom)

Or it might be a channel of offset data (File on disk) with parts that might no longer be available later in the day or new parts that arrive whilst reading.

It's easiest to manage the fixed size case (c style) and then re-use the fixed size blocks for streams of data, but sometimes you need more complex references like File handles.

So a VLQ (https://en.wikipedia.org/wiki/Variable-length_quantity) might do for the simple case, and then a VLQ that contains References to further VLQs might be usable for the rest of the use cases.

Arrays and Lists

I don't think I really consider these to be primitive types

Great, I don't need to write these in my language: I can borrow them? Well maybe, except when it comes to mixing primitives with polymorphic types: well maybe I can still use them I guess I can just put the type into the box as the first entry.

It's all in the bus

Eventually, much of the data you use ends up moving through the buses on your system and they have different sizes and then on top of that you get fixed sized pages that move over buses, which you hope are an integer multiple of the bus size: typically the one everyone knows that isn't is the MTU (which varies between broadband, ethernet and modem systems).

So when you use these complex primitives you might not want to just use the language primitives: but optimise for the bus/packet/page sizes involved. Should these be primitives? Well that might depend on your architecture and for a cross platform language I guess you should let it be a language specific optimization to curry in an outside primitive.

So how to solve this for writing a new language?

Copy Scala and Groovy: use the JVM to solve this for you and give you a consistent view of the world and force everyone to map using Java data structure until later... although I'm tempted to checkout the CLR/Mono too.