Kai language
============
- inspired by concept of writing own langauge
by Jonathan Blow writing Jai
- inspired by state of C/C++ with all the UB
and complexity of C++11 to C++20
- too much legacy
- too many ways to do the same thing
- inspired by Nikolas Wirth interview in 2018
on youtube
Motivation:
C++ complexity, how many ways to just initialize a variable?
- with C++11 there are atleast 3 ways to initialize an int.
- atleast 4 ways to init a struct or class
int main()
{
struct Struct1 { int x = 0; };
struct Struct2 { int x; };
struct Object { Object(int a = 0) {} };
int x1 = 0;
int x2 = { 0 };
int x3{ 0 };
Struct1 s1;
Struct2 s2 = { 1 };
Struct2 s3{ 1 };
Struct2 s4 = Struct2{ 1 };
Object obj1;
Object obj2(1);
Object obj3{1};
Object obj4 = Object(1);
Object obj5 = Object{1};
}
- the preferred way and styles keep changing
- perhaps it is too much to design something which is enduring
- backwards compatibility is great but C++ doesn't recognize that backwards
compatibility has two edges - there is the backwards compat of the code
with new compilers, but also the backwards compat of running new code on
old machines. Not everywhere will get the new support for new compiler features
- if it is just syntactic sugar, backwards compat with old machines could be
maintained by having code adapters / converters, but no such thing exists
(originally C++ was implemented with CFront which was this, but those days
are long gone). Nothing converts C++20 in to C++98 or C but theoretically
it is possible, but not a great idea anymore.
- C was great because it thinly abstracted the underlying machine which allowed
real progress because routine code didn't need to be coded in assembler.
- C++ hitched a ride on the success of C and tried to elevate concepts to higher
levels. Unfortunately this muddies the design of C, in C it is WYSIWYG in that
when you add two variables you get addition, when you call a function, it goes
to that function etc. C is close to the machine. C++ allowed for C code to
compile with its compilers, so all C code was essentially C++ code too, and
so it was a type of embrace and extend strategy that has continued with the
evolution of C++. Generally embrace and extend is an evil strategy but it works.
It is a kind of parasitic strategy. It is retrospect that when it succeeds it
can be seen as evil, but failed or less successful attempts to embrace and
extend don't seem to be quite seen that way. There were other attempts to
embrace and extend C, but the most successful was C++ which became the main
line branch on which C continues. Java on the otherhand broke backwards
compatibility so was not really applying an embrace and extend strategy, so
was less evil and more pure in this respect and became wildly successful.
However Java is not Native, but can get native like performance but you also
must take with it its memory model which can hurt performance.
- C++ was designed around mutation which is bad. It is bad from a performance
point of view for the current evolution of CPUs. This needs to be explained
a bit. C++ has classes, a class is like a struct, but with member functions.
These member functions change the state of the member variables. This is how
C++ is taught at university.
class Color { public: setColor(int a,..) { r = a; } private: int r,g,b; }
In C you might not write it in such a way if you know what you are doing.
In functional programming, you don't mutate the state, you return a new
object each time. With CPUs we will reach a point where single threaded
performance plateaus but multi-threaded performance continues to grow.
Programming needs to be more multi-threaded. The issue with multi-threaded
programming is sychronization of memory between threads. A CPU thread can
work fast if it is operating on memory that no other thread is operating
on. Things that are referenced that mutate memory are bad for this.
Local variables on the stack are in the set of memory that a single thread
uses so these are fine to mutate. Pointers to things which can be mutated
or any global mutable state are bad. A object of a class in C++ though
is pretty much this, the this pointer is a pointer to mutable state.
If you have a pure function which doesn't mutate any of the input parameters
or any global state, and then outputs a new value, you can safely run this
on any thread. All of the program if written like this can more easily be
made to run different parts of the program on different threads.
Example in C:
struct Color { int r,g,b; };
Color newColor(int r, int g, int b) { return Color{r,g,b}; }
Color setRed(const Color& col, int r) { return Color{r,col.g,col.b}; }
In the above example, setting the red value didn't mutate the orginal color
object, instead it created a new color object with the specified red value.
Cross-cutting concerns, aspect oriented programming, subject-oriented programming:
Wikipedia: Subject-oriented programming as a "third dimension"[edit]
Method dispatch in object oriented programming can be thought of as
"two dimensional" in the sense that the code executed depends on both the
method name and the object in question. This can be contrasted[20] with
procedural programming, where a procedure name resolves directly, or one
dimensionally, onto a subroutine, and also to subject oriented programming,
where the sender or subject is also relevant to dispatch, constituting
a third dimension.
So concept here is dispatch depends on current context, not just the object.
Aspect oriented programming is simpler, it gives a way to intercept function
calls and inject code in to the preamble and postamble of any function. This
might be something which could be adapted to be used in a functional programming
context.
Design goals:
At first this langauge will be implemented as
a translator which parses this and outputs C code
which is then compiled with a C compiler (or C++ compiler).
Rational - C compilers are mature at creating
optimized code and will do better. Also aides in
debugging the output - checking with linting tools etc.
Can also be used to generate C code for use perhaps.
Perhaps extension to this is outputting javascript,
although enscripten/wasm allows converting C/C++ to
web, perhaps the quality of javascript code from
something like this might be better.
A later stage of development will be able to
output directly machine code without converting to
C first, perhaps by adapting the output to be the
IR (intermediate representation) for another compiler
tool chain such as LLVM or GNU.
Another direction perhaps is using this as a scripting
langauge, so if can build an interpreter, or similar is
JIT converting to JS and running it via V8 or similar.
And another stretch goal might be to be able to convert
to GLSL/CUDA/OpenCL to execute on a GPU or some mix.
Definately some modules will make writing something similar
to shader code in this language possible.
Proposed language grammar/syntax sample
Syntax inspired by a mix of pascal and C
a context free grammar
ideally no preprocessor, code can be processed in a single parse
where practical, fewer characters are required for a task
template programming should feel like normal programming
language designed to prevent common programmer errors
Functions are pure functions
- they do not mutate the input parameters
- they do not mutate any global state
- they can output a new object
- because they don't mutate global state or inputs, then the only thing a function
can do is return something. This means that an expression will do nothing unless
there is both a rhs and lhs. So need to be assigning every expression. eg: can't
call log("some message"); how to relax this or still make useful?
How about point-cuts? Aspect-oriented programming? It might solve logging.
Are callbacks a thing in functional programming?
Other Features / Inspiration:
Rust
- unsafe keyword to mark area using raw pointers etc.
- heavy use of static analysis
- but static analysis is not zero cost
- cost to compile time is still a cost (not that C++ is fast to compile)
- range-based loops can avoid range checks
- data locking, not code locking
- initialization of all variables
Qt
- ownership of objects - parent/child relationship, so clear ownership semantics
of pointers, so easy memory management
D
- slices - views in to arrays
Rust
- cargo - build for multiple architectures / unknown architecture
Variable initialization
- OS when giving pages to a process needs to clear/wipe them first anyway
- would be a security risk if a page of memory that was used by
another process was handed to us without being cleared
- within our process when memory is freed and then reused, it might not
be cleared.
- what reasons might it be valid to not initialize the memory of variables?
- when a large allocation happens, it might just be virtually mapped,
but after the first write it may then need to actually be assigned.
- what needs to be avoided is reading before the memory has been written.
- it can be avoided if we know it will be zero when we first read it.
- OS policy might make this so, so can avoid clearing to zero again
and avoid causing the memory to be committed. (if it is on demand
allocated when written to)
- imagine alloating GBs of memory, perhaps via a file mapping. That memory
is just page table entries at first. When write to a specific address in
the range, on the first write a fault occurs which is caught and this
causes the OS to then find a free page which it then clears and then
resumes execution at the instruction that caused the fault which will
retry writing to that address which will then write to that page. Clearing
the memory will cause the OS to then try to allocate all those pages,
even if virtually they exist and if were attempted to be read would have
returned zero, so in a virtual sense are all cleared already. It would
be a waste. But it wouldn't be much of a waste if we did immediately
need all of that memory and began reading and writing to all of it.
Dlang mixins
- instead of the C preprocessor and C++ templates, a mixin can be a function
which generates code from strings. A template mixin is like a template
pattern with substitution using the parameters to insert this in to the
program. Instead of typename T, they use 'alias' as a generic type.
- D mixins are not as general or as powerful as what can be done with the C
preprocessor, but that may be deliberate so as they are more hygenic and
are also not another language. There is more safety using mixins.
- D template mixins are not as powerful as C++ templates, as there are
limitations about type inference that C++ templates can use to generate
code based on more nuanced context. D template mixins are good for simple
cases.
TypeScript mixins
- similar to composition of interfaces with default implementations
- side-steps the multiple-inheritance diamond problem
TypeScript tsconfig.json
- instead of Makefiles, a json file has compiler flags and list of sources.
- perhaps defining how to invoke the compiler is part of the language?
Vue/webpack/hot-reload
- the hot-reload feature is great for fast prototyping
TypeScript & SASS transcription
- great idea to maintain compat, 'compiles' down to js/css
Refactoring tools
- being able for tools to parse the code is important and helps with better
tooling.
- eg: resharper
- rename variables, functions etc.
- reformat code according to a style guide
- alphabetically order imports or other unordered lists
- format tables to be aligned
- consistent whitespace and other style
- generate various code blocks automatically,
- eg: switch statement for a given enum
- a new function or module
GraphQL
- defined schema, query to this, result in preferred format
extensible
Atomontage
- point clouds
ECS
- Entity component systems
AOS/SOA
- structs/arrays - packing, access - easily switchable without large code changes (Jai)
Qt
- UI framework and libraries
Reflection/Object-tree/property pages
- reflect to UI the state in memory
Rendering backends
- OpenGL/Vulkan/DirectX/Metal
Scene-Graph
- rendering driven by a logical scene-graph (in-memory might not be a graph/tree)
SIMD optimizations
- instruction level optimization
Multi-threading
- automatic thread / worker splitting / dynamic
web
- integrate with websockets and browsers/html etc
security
- security primatives supplied. Makes it easy to make things secure.
Compile-time checks. Static strong typing.
Static pointer/reference checking. Bounds checking.
Interface/Implementation:
Binary only modules?
Need to be able to have either:
- module with source for interface and implementation
- or source of the interface and binary of the implementation
How to handle different platforms/cross-platform with binary distribution?
Publish process probably should be able to sign the binary
Publish ideally not to machine code, but to binary IR which can be
compiled and optimized for target.
Perhaps binary might be compiled C/C++ code, and the interface is just
the equivelent Kai for the interface in C/C++. This means linkage
with other languages is possible.
Mutation concept:
In games, there are frames. Perhaps in a given frame there is the state.
Then all the mutations are gathered for given input and given deltas and recalcs and are queued as
a list of transactions to apply ready for the next frame.
Then for the next frame the state is recalculated by appling all the mutations.
I wonder if the concept could be applied to not just games.
It means that can not make a state change request and then immediately get the new state. Must wait
a frame for this, so not synchronous, but async as to when a change happens.
class State
{
int GetValue() const { return m_value; }
State SetValue(int a_newValue) { State s = this; s.m_value = a_newValue; return s; }
int m_value;
};
class Frame
{
State m_state;
};
void frameLoop()
{
Frame current;
Frame next;
while (true)
{
// render using current
// operate / update based on current
next.m_state = current.m_state.SetValue(10);
// swap
current = next;
}
}
Now imagine if make multi-threaded. Avoids race between readers and writers to current.
However still race between multiple writers to next.
Nullable concept:
Is nullable a requied concept? In a pointer based or references based language/system it
can become required. But in a stack based system, there are no pointers so no concept of null.
The object always exists. So pointer based objects can be not pointed to, which is null, or
can point to an uninitialized or empty object, or can point to a valid object. A stack based
system can not implicitly get a null for free, but is it desirable. Null pointers are one
of the problems of pointer based systems. If fail to check a pointer and then dereference it,
the program will likely crash. But then the code becomes littered with null checks decreasing
readability. Both a real problems with having pointers. And it is not limited to pointers,
the same applies to references which are pointers in disguise. It is only by convention that
in C++ it is assumed a reference is a non-null pointer and an actual pointer is a nullable one.
You can have a function which takes a reference, and with a pointer to an object, dereference it
and pass it to this function without doing a null check and the reference will be a null one.
It is by convention that you wouldn't do that and would null check the pointer before
dereferencing it to ensure there are no null references, only nullable pointers.
Is a null semantically just meaning empty or invalid? Sometimes a value could have sentinel
value which means uninitialized or invalid or empty. Like a string which is empty or zero length.
But how about an int. A value of 0 or -1 might be valid so can not be the sentinel value to mean
empty or invalid. An extra bool needs to be attached to this to handle the concept. This might be
quite a cost in space for small objects but may be preferable to having pointers to individual ints.
With floats, there is a special sentinel value of NaN which could be used for this.
For a bool, it would need to be one of three values, null, true or false. In JavaScript, these
three are three different types as apposed to values. Javascript is not implicitly pointer based,
but the values that are used are dynamically typed which means the type is carried around with
the value. This allows coding in to the type a null state. The size of a null, true or false value
will all be the same size which could be a byte.
So there are two kinds of safety to consider. Run-time safety to prevent crashing, and semantic
safety to ensure correctness of the meaning/interpretation of the data. Run-time safety can be
ensured by not having nulls, instead relying on sentinel/default values which might be also valid
values. This sacrifices semantic safety as it is ambiguous as to if the sentinel is valid or not.
Semantic safety comes at the cost of runtime safety if it can not be guarenteed that all required
null checking has been done.
A language could perhaps give semantic safety and statically ensure runtime safety at compile
time by static analysis of the dereferences to ensure they are after the required null checks.
This might not need to be too costly if designed in to the language. This can be achieved with
C++ and other languages with linting/analysis tools, but it is not a defined compile time error
by the language, so it is an optional thing.
Rust is a good example of a language which does this.
Optional - compiler checks that you have checked it is not None
Borrow checker - can have one mutable reference and no const references, or can have one or many
const references. Can then pass things using the restrict keyword to say they don't alias.
In kai:
Int32! X; // nullable value
if (X)
{
Int32 x = X; // can now use it
}
translation:
template <typename T>
struct Nullable
{
bool m_isNull;
T m_value;
};
Nullable<Int32> _X;
if (!_X.m_isNull)
{
Int32& X = _X.m_value;
Int32 x = X.m_value;
}
Mutex - lock data, not code.
Mutex is added to a struct with templates and need to lock to use the data
The lock call on the data returns the accessible data with borrow-checker lifetime of the current scope.
Implementation:
In Kai code:
type Time := Int32; // strong typedef
type Duration := Time; // strong typedef, derivative type
Int8 x;
Time t;
x := 1;
t := 1;
Translates to:
Basic types:
struct Int8 { int8_t m_value = 0; } // always initialized
struct Int16 { int16_t m_value = 0; }
struct Int32 { int32_t m_value = 0; }
Derived types:
struct Time { int32_t m_value = 0; }
struct Duration { int32_t m_value = 0; } // The derivative type
Consistent access is always with .m_value to get to the value.
Int8 x;
Time t;
x.m_value = 1;
t.m_value = 1;
Rust lifetimes
Foo a;
let b = a; // lifetime of 'a' ends here
In C++:
b = std::move(a);
But in C++ it is up to the programmer to not use 'a' after this.
It is similar to swap. If 'b' was originally uninitialized, it would be
bad to use 'a' now.
The goal is to avoid doing deep-copies to transfer values from one
variable to another.
Using references has the problem of ownership semantics. Which is the
owning reference.
Another solution is shared_pointers. The cost is loss of explict control
of when the deallocation happens.
Syntax:
'//' - end of line comment
'/*'...'*/' - multi-line comment
Keywords:
- module, interface, enum, type, struct, implementation, function, tests
In C/C++, signed integer addition/subtraction have undefined behaviour
int x, y;
x = x + y; // UB, compile can set fire to computer if it wants
unsigned x, y;
x = x + y; // not-UB, this is fine
Assumes 2's complement int representation
- if a machine has an unusual representation, then users of
that machine pay this price not everyone else with normal hardware
- in many cases, sign vs unsigned are no different
- can trivially cast to unsigned, perform the operation, and cast back
- this avoids UB if one can assume 2's complement
In C, code like this can be accidently written:
if ( a = b )
To prevent accidental assignment when wanting to compare:
:= assignment
= comparison
Because functional style language, there are no ++ or += etc
Assignment only happens to a new variable. Perhaps can infer the
type.
In C++ templates have been tacked on, templated functions are written like this:
template <typename T>
T func(T a, T b)
{
return a + b;
}
More natural would be:
func(type T, T a, T b) => T
{
return a + b;
}
So in this way, a type is itself a type that can be passed around. This is C# like.
Iteration, C#, python and C++11 range-based loops:
for (x : arrayVector)
for (x in array)
for (x in range(0, 100))
map/reduce/filter, co-routines ?
how to make things like map/vector etc with functional programming?
vector might be a pointer + length
if add elements to it, then it might be the same pointer if there is space at the end
the original vector is unaltered if it contains the original pointer + original length
the new vector contains the same pointer but a new length
same with removing elements at the end
a linked list might be possible if can store some kind of edit history
perhaps think of transactional DBs and journals. Perhaps keep a journal until
no more references to the original.
actually, really just create a new vector/container
map,reduce,forEach,filter,zip,some/exists,every,find
creating a new copy of a vector is fine if the size is not too large or the reduced size is small
the internal implementation details can't construct a new vector and still be functional. I think
the constraint is on the passed in parameters, not on the local variables, so a local variable can
be mutated just fine. That is because it is on the stack and another thread shouldn't have a pointer
or reference to it. It begs the question, how to have objects with functions which mutate them but
restrict calling them. Perhaps the trick is that parameters to a function will always be passed by
const reference, and therefore can never be modified. We can always return by value (RVO will make
this cheap). Constructing the returned value can be done as the local variable will not be const
and can then call anything on it which mutates it (or constructs it).
sized types only (for portability and explicitness - but limits targetting really low-end devices)
null, bool, int8, int16, int32, int64, flt32, flt64
perhaps profiles? Can mark a module as not using floats or anything 64bit?
string is a first class type
string_view is not needed, strings contain the length so can construct a string from another string
strings are utf8
string formats are translatable
because functional, only operators which make a new output,
eg: a = b + c
not: a += b // this mutates a
so operator overload is limited to a few less things
+ (T, T) -> T
- (T, T) -> T
* (T, T) -> T
/ (T, T) -> T
% (T, T) -> T
null as a type and a value?
void?
pointers - ?
if functional, are pointers a thing?
Some concepts from rust could be 'borrowed'
borrowed pointers if need pointers
crates
unsafe
static_asserts supported
inline C code perhaps - so can mark as unsafe?
keywords:
module, interface, enum, type, struct, implementation, function, tests
alpha:
[a-zA-Z]
num:
[0-9]
alnum:
alpha|num
identifier:
alpha alnum*
number:
num [0-9a-fA-F]* [h
EBNF
Starting from a Pascal example:
Lexical Grammar
<letter> ::= [a-ZA-Z]
<digit> ::= [0-9]
<id> ::= <letter> { <letter> | <digit> | "_" }
<digits> ::= <digit> { <digit> }
<integer literal> ::= <digits>
<string literal> ::= "\"" { < a char or escape char > } "\""
<real literal> ::= <digits> "." <digits> [ "e" [ <sign> ] <digits>] <string literal> ::= "\"" { < a char or escape char > } "\""
<literal> ::= <integer literal> | <real literal> | <string literal>
<operators> ::= "+" | "-" | "*" | "%" | "=" | "!=" | "<" | ">" | "<=" | ">=" | "(" | ")" | "[" | "]" | ":=" | "." | "," | ";" | ":" |
"!" | "||" | "|" | "&" | "&&" | "^" | "{" | "}"
<keywords> ::= "if" | "then" | "else" | "of" | "while" | "do" | "begin" | "end" | "var" | "array" | "procedure" | "function" | "program" | "assert"
<predefined id> ::= "Boolean" | "false" | "integer" | "read" | "real" | "size" | "string" | "true" | "writeln"
Context-Free Grammar
<program> ::= "program" <id> ";" <block> "."
<declaration> ::= "var" <id> { , <id> } ":" <type> |
"procedure" <id> "(" parameters ")" ";" <block> |
"function" <id> "(" parameters ")" ":" <type> ";" <block> <parameters> ::= [ "var" ] <id> ":" <type> { "," [ "var" ] <id> ":" <type> } | <empty> <type> ::= <simple type> | <array type>
<array type> ::= "array" "[" [<integer expr>] "]" "of" <simple type>
<simple type> ::= <type id>
<block> ::= "begin" <statement> { ";" <statement> } [ ";" ] "end"
<statement> ::= <simple statement> | <structured statement> | <declaration>
<empty> ::=
<simple statement> ::= <assignment statement> | <call> | <return statement> |
< read statement> | <write statement> | <assert statement>
<assignment statement> ::= <variable> ":=" <expr> <call> ::= <id> "(" <arguments> ")"
<arguments> ::= expr { "," expr } | <empty> <return statement> ::= "return" [ expr ]
<read statement> ::= "read" "(" <variable> { "," <variable> } ")" <write statement> ::= "writeln" "(" <arguments> ")"
<assert statement> ::= "assert" "(" <Boolean expr> ")"
<structured statement> ::= <block> | <if statement> | <while statement> <if statement> ::= "if" <Boolean expr> "then" <statement> |
"if" <Boolean expr> "then" <statement> "else" <statement> <while statement> ::= "while" <Boolean expr> "do" <statement>
<expr> ::= <simple expr> |
<simple expr> <relational operator> <simple expr>
<simple expr> ::= [ <sign> ] <term> { <adding operator> <term> }
<term> ::= <factor> { <multiplying operator> <factor> }
<factor> ::= <call> | <variable> | <literal> | "(" <expr> ")" | "not" <factor> | < factor> "." "size" <variable> ::= <variable id> [ "[" <integer expr> "]" ]
<relational operator> ::= "=" | "<>" | "<" | "<=" | ">=" | ">" <sign> ::= "+" | "-"
<adding operator> ::= "+" | "-" | "or"
<multiplying operator> ::= "*" | "/" | "%" | "and"
Discussion:
UI should be an included part of the language / libraries.
Too bad C/C++ do not have a solution for this. They miss out
on learning about real problems programs need to solve to
be useful. For example i18n. C/C++ does a terrible job about
this, but despite attempting to be general, they are simply
not. This is because they don't even attempt to tackle UI
so it isn't even evident that they fail woefully at i18n.
For example uint8_t is an unsigned char, so stream operators
acting on it treat it like a char instead of an integer
value. There are subtle differences like that that make it
hard to do utf right. It can be done, but you have to fight
against C++ to do it. It shows horrible bias, leaning towards
anglo european users and is not inclusive of large portions
of the world that do not used latin based langauges. The
defacto standard for cross-platform UI is to use Qt which
helps solve these i18n issues. It mostly ignores C++ and
creates its own QString class. This causes fractures in
the whole C/C++ egosystem with code that is using Qt and
code which is not. Similar fractures occur with Boost.
Some developers avoid it like the plague and unwilling to
allow it to infect their code base, whilst others embrace
it. Again causing another division. As C++ has been evolving
Boost has proven to be a good testing ground for new C++
features and some of those have been woven in to C++.
This evolution does cause some ugly side-effects with
code which used Boost and migrates to the standardized
C++ versions of these features supporting both variations
with conditional compilation. This can be as a result of
on different platforms the adoption of those features in
the compilers happening at a different pace. Ideally a
more formal feature testing/preview system would be better
like the TR technical review, but on a larger community
wide contribution scale.
Governance - community consultation. RFCs. Preview
implementations. Test-bed servers, unit tests, regression
test suites. A reference implementation, not just a
standard which is a document. Aim for zero undefined
behavior. Everything is defined or it is a compile error.
Take a modern approach of targetting modern machines,
not hyperthetical machines or esoteric machines. The
code has to work on real machines to solve real problems
and now. Adding two ints in C++ can have undefined
behaviour if they overflow and the compiler is free to
do what it likes in this case. That is a crazy thing
as a programmer to accept, particulary given that
nearly any machine being targetted today or even in the
last several decades uses 2's complement and so is
pretty well defined at the machine level.
Bootstraping - how much of the libraries should be in
the language. Some part will be Core and the rest the
wider language / libraries.