Import/research/reflection/docs at 6c36eb4444568cc1b6fd763f75fbc38eff7bcc94

latest commit 4dc75bdc0f John authored on 29 Dec 2020
	..
	README.txt	bulk import from macbookpro checkouts	5 years ago

Kai language ============ - inspired by concept of writing own langauge by Jonathan Blow writing Jai - inspired by state of C/C++ with all the UB and complexity of C++11 to C++20 - too much legacy - too many ways to do the same thing - inspired by Nikolas Wirth interview in 2018 on youtube Motivation: C++ complexity, how many ways to just initialize a variable? - with C++11 there are atleast 3 ways to initialize an int. - atleast 4 ways to init a struct or class int main() { struct Struct1 { int x = 0; }; struct Struct2 { int x; }; struct Object { Object(int a = 0) {} }; int x1 = 0; int x2 = { 0 }; int x3{ 0 }; Struct1 s1; Struct2 s2 = { 1 }; Struct2 s3{ 1 }; Struct2 s4 = Struct2{ 1 }; Object obj1; Object obj2(1); Object obj3{1}; Object obj4 = Object(1); Object obj5 = Object{1}; } - the preferred way and styles keep changing - perhaps it is too much to design something which is enduring - backwards compatibility is great but C++ doesn't recognize that backwards compatibility has two edges - there is the backwards compat of the code with new compilers, but also the backwards compat of running new code on old machines. Not everywhere will get the new support for new compiler features - if it is just syntactic sugar, backwards compat with old machines could be maintained by having code adapters / converters, but no such thing exists (originally C++ was implemented with CFront which was this, but those days are long gone). Nothing converts C++20 in to C++98 or C but theoretically it is possible, but not a great idea anymore. - C was great because it thinly abstracted the underlying machine which allowed real progress because routine code didn't need to be coded in assembler. - C++ hitched a ride on the success of C and tried to elevate concepts to higher levels. Unfortunately this muddies the design of C, in C it is WYSIWYG in that when you add two variables you get addition, when you call a function, it goes to that function etc. C is close to the machine. C++ allowed for C code to compile with its compilers, so all C code was essentially C++ code too, and so it was a type of embrace and extend strategy that has continued with the evolution of C++. Generally embrace and extend is an evil strategy but it works. It is a kind of parasitic strategy. It is retrospect that when it succeeds it can be seen as evil, but failed or less successful attempts to embrace and extend don't seem to be quite seen that way. There were other attempts to embrace and extend C, but the most successful was C++ which became the main line branch on which C continues. Java on the otherhand broke backwards compatibility so was not really applying an embrace and extend strategy, so was less evil and more pure in this respect and became wildly successful. However Java is not Native, but can get native like performance but you also must take with it its memory model which can hurt performance. - C++ was designed around mutation which is bad. It is bad from a performance point of view for the current evolution of CPUs. This needs to be explained a bit. C++ has classes, a class is like a struct, but with member functions. These member functions change the state of the member variables. This is how C++ is taught at university. class Color { public: setColor(int a,..) { r = a; } private: int r,g,b; } In C you might not write it in such a way if you know what you are doing. In functional programming, you don't mutate the state, you return a new object each time. With CPUs we will reach a point where single threaded performance plateaus but multi-threaded performance continues to grow. Programming needs to be more multi-threaded. The issue with multi-threaded programming is sychronization of memory between threads. A CPU thread can work fast if it is operating on memory that no other thread is operating on. Things that are referenced that mutate memory are bad for this. Local variables on the stack are in the set of memory that a single thread uses so these are fine to mutate. Pointers to things which can be mutated or any global mutable state are bad. A object of a class in C++ though is pretty much this, the this pointer is a pointer to mutable state. If you have a pure function which doesn't mutate any of the input parameters or any global state, and then outputs a new value, you can safely run this on any thread. All of the program if written like this can more easily be made to run different parts of the program on different threads. Example in C: struct Color { int r,g,b; }; Color newColor(int r, int g, int b) { return Color{r,g,b}; } Color setRed(const Color& col, int r) { return Color{r,col.g,col.b}; } In the above example, setting the red value didn't mutate the orginal color object, instead it created a new color object with the specified red value. Cross-cutting concerns, aspect oriented programming, subject-oriented programming: Wikipedia: Subject-oriented programming as a "third dimension"[edit] Method dispatch in object oriented programming can be thought of as "two dimensional" in the sense that the code executed depends on both the method name and the object in question. This can be contrasted[20] with procedural programming, where a procedure name resolves directly, or one dimensionally, onto a subroutine, and also to subject oriented programming, where the sender or subject is also relevant to dispatch, constituting a third dimension. So concept here is dispatch depends on current context, not just the object. Aspect oriented programming is simpler, it gives a way to intercept function calls and inject code in to the preamble and postamble of any function. This might be something which could be adapted to be used in a functional programming context. Design goals: At first this langauge will be implemented as a translator which parses this and outputs C code which is then compiled with a C compiler (or C++ compiler). Rational - C compilers are mature at creating optimized code and will do better. Also aides in debugging the output - checking with linting tools etc. Can also be used to generate C code for use perhaps. Perhaps extension to this is outputting javascript, although enscripten/wasm allows converting C/C++ to web, perhaps the quality of javascript code from something like this might be better. A later stage of development will be able to output directly machine code without converting to C first, perhaps by adapting the output to be the IR (intermediate representation) for another compiler tool chain such as LLVM or GNU. Another direction perhaps is using this as a scripting langauge, so if can build an interpreter, or similar is JIT converting to JS and running it via V8 or similar. And another stretch goal might be to be able to convert to GLSL/CUDA/OpenCL to execute on a GPU or some mix. Definately some modules will make writing something similar to shader code in this language possible. Proposed language grammar/syntax sample Syntax inspired by a mix of pascal and C a context free grammar ideally no preprocessor, code can be processed in a single parse where practical, fewer characters are required for a task template programming should feel like normal programming language designed to prevent common programmer errors Functions are pure functions - they do not mutate the input parameters - they do not mutate any global state - they can output a new object - because they don't mutate global state or inputs, then the only thing a function can do is return something. This means that an expression will do nothing unless there is both a rhs and lhs. So need to be assigning every expression. eg: can't call log("some message"); how to relax this or still make useful? How about point-cuts? Aspect-oriented programming? It might solve logging. Are callbacks a thing in functional programming? Other Features / Inspiration: Rust - unsafe keyword to mark area using raw pointers etc. - heavy use of static analysis - but static analysis is not zero cost - cost to compile time is still a cost (not that C++ is fast to compile) - range-based loops can avoid range checks - data locking, not code locking - initialization of all variables Qt - ownership of objects - parent/child relationship, so clear ownership semantics of pointers, so easy memory management D - slices - views in to arrays Rust - cargo - build for multiple architectures / unknown architecture Variable initialization - OS when giving pages to a process needs to clear/wipe them first anyway - would be a security risk if a page of memory that was used by another process was handed to us without being cleared - within our process when memory is freed and then reused, it might not be cleared. - what reasons might it be valid to not initialize the memory of variables? - when a large allocation happens, it might just be virtually mapped, but after the first write it may then need to actually be assigned. - what needs to be avoided is reading before the memory has been written. - it can be avoided if we know it will be zero when we first read it. - OS policy might make this so, so can avoid clearing to zero again and avoid causing the memory to be committed. (if it is on demand allocated when written to) - imagine alloating GBs of memory, perhaps via a file mapping. That memory is just page table entries at first. When write to a specific address in the range, on the first write a fault occurs which is caught and this causes the OS to then find a free page which it then clears and then resumes execution at the instruction that caused the fault which will retry writing to that address which will then write to that page. Clearing the memory will cause the OS to then try to allocate all those pages, even if virtually they exist and if were attempted to be read would have returned zero, so in a virtual sense are all cleared already. It would be a waste. But it wouldn't be much of a waste if we did immediately need all of that memory and began reading and writing to all of it. Dlang mixins - instead of the C preprocessor and C++ templates, a mixin can be a function which generates code from strings. A template mixin is like a template pattern with substitution using the parameters to insert this in to the program. Instead of typename T, they use 'alias' as a generic type. - D mixins are not as general or as powerful as what can be done with the C preprocessor, but that may be deliberate so as they are more hygenic and are also not another language. There is more safety using mixins. - D template mixins are not as powerful as C++ templates, as there are limitations about type inference that C++ templates can use to generate code based on more nuanced context. D template mixins are good for simple cases. TypeScript mixins - similar to composition of interfaces with default implementations - side-steps the multiple-inheritance diamond problem TypeScript tsconfig.json - instead of Makefiles, a json file has compiler flags and list of sources. - perhaps defining how to invoke the compiler is part of the language? Vue/webpack/hot-reload - the hot-reload feature is great for fast prototyping TypeScript & SASS transcription - great idea to maintain compat, 'compiles' down to js/css Refactoring tools - being able for tools to parse the code is important and helps with better tooling. - eg: resharper - rename variables, functions etc. - reformat code according to a style guide - alphabetically order imports or other unordered lists - format tables to be aligned - consistent whitespace and other style - generate various code blocks automatically, - eg: switch statement for a given enum - a new function or module GraphQL - defined schema, query to this, result in preferred format extensible Atomontage - point clouds ECS - Entity component systems AOS/SOA - structs/arrays - packing, access - easily switchable without large code changes (Jai) Qt - UI framework and libraries Reflection/Object-tree/property pages - reflect to UI the state in memory Rendering backends - OpenGL/Vulkan/DirectX/Metal Scene-Graph - rendering driven by a logical scene-graph (in-memory might not be a graph/tree) SIMD optimizations - instruction level optimization Multi-threading - automatic thread / worker splitting / dynamic web - integrate with websockets and browsers/html etc security - security primatives supplied. Makes it easy to make things secure. Compile-time checks. Static strong typing. Static pointer/reference checking. Bounds checking. Interface/Implementation: Binary only modules? Need to be able to have either: - module with source for interface and implementation - or source of the interface and binary of the implementation How to handle different platforms/cross-platform with binary distribution? Publish process probably should be able to sign the binary Publish ideally not to machine code, but to binary IR which can be compiled and optimized for target. Perhaps binary might be compiled C/C++ code, and the interface is just the equivelent Kai for the interface in C/C++. This means linkage with other languages is possible. Mutation concept: In games, there are frames. Perhaps in a given frame there is the state. Then all the mutations are gathered for given input and given deltas and recalcs and are queued as a list of transactions to apply ready for the next frame. Then for the next frame the state is recalculated by appling all the mutations. I wonder if the concept could be applied to not just games. It means that can not make a state change request and then immediately get the new state. Must wait a frame for this, so not synchronous, but async as to when a change happens. class State { int GetValue() const { return m_value; } State SetValue(int a_newValue) { State s = this; s.m_value = a_newValue; return s; } int m_value; }; class Frame { State m_state; }; void frameLoop() { Frame current; Frame next; while (true) { // render using current // operate / update based on current next.m_state = current.m_state.SetValue(10); // swap current = next; } } Now imagine if make multi-threaded. Avoids race between readers and writers to current. However still race between multiple writers to next. Nullable concept: Is nullable a requied concept? In a pointer based or references based language/system it can become required. But in a stack based system, there are no pointers so no concept of null. The object always exists. So pointer based objects can be not pointed to, which is null, or can point to an uninitialized or empty object, or can point to a valid object. A stack based system can not implicitly get a null for free, but is it desirable. Null pointers are one of the problems of pointer based systems. If fail to check a pointer and then dereference it, the program will likely crash. But then the code becomes littered with null checks decreasing readability. Both a real problems with having pointers. And it is not limited to pointers, the same applies to references which are pointers in disguise. It is only by convention that in C++ it is assumed a reference is a non-null pointer and an actual pointer is a nullable one. You can have a function which takes a reference, and with a pointer to an object, dereference it and pass it to this function without doing a null check and the reference will be a null one. It is by convention that you wouldn't do that and would null check the pointer before dereferencing it to ensure there are no null references, only nullable pointers. Is a null semantically just meaning empty or invalid? Sometimes a value could have sentinel value which means uninitialized or invalid or empty. Like a string which is empty or zero length. But how about an int. A value of 0 or -1 might be valid so can not be the sentinel value to mean empty or invalid. An extra bool needs to be attached to this to handle the concept. This might be quite a cost in space for small objects but may be preferable to having pointers to individual ints. With floats, there is a special sentinel value of NaN which could be used for this. For a bool, it would need to be one of three values, null, true or false. In JavaScript, these three are three different types as apposed to values. Javascript is not implicitly pointer based, but the values that are used are dynamically typed which means the type is carried around with the value. This allows coding in to the type a null state. The size of a null, true or false value will all be the same size which could be a byte. So there are two kinds of safety to consider. Run-time safety to prevent crashing, and semantic safety to ensure correctness of the meaning/interpretation of the data. Run-time safety can be ensured by not having nulls, instead relying on sentinel/default values which might be also valid values. This sacrifices semantic safety as it is ambiguous as to if the sentinel is valid or not. Semantic safety comes at the cost of runtime safety if it can not be guarenteed that all required null checking has been done. A language could perhaps give semantic safety and statically ensure runtime safety at compile time by static analysis of the dereferences to ensure they are after the required null checks. This might not need to be too costly if designed in to the language. This can be achieved with C++ and other languages with linting/analysis tools, but it is not a defined compile time error by the language, so it is an optional thing. Rust is a good example of a language which does this. Optional - compiler checks that you have checked it is not None Borrow checker - can have one mutable reference and no const references, or can have one or many const references. Can then pass things using the restrict keyword to say they don't alias. In kai: Int32! X; // nullable value if (X) { Int32 x = X; // can now use it } translation: template <typename T> struct Nullable { bool m_isNull; T m_value; }; Nullable<Int32> _X; if (!_X.m_isNull) { Int32& X = _X.m_value; Int32 x = X.m_value; } Mutex - lock data, not code. Mutex is added to a struct with templates and need to lock to use the data The lock call on the data returns the accessible data with borrow-checker lifetime of the current scope. Implementation: In Kai code: type Time := Int32; // strong typedef type Duration := Time; // strong typedef, derivative type Int8 x; Time t; x := 1; t := 1; Translates to: Basic types: struct Int8 { int8_t m_value = 0; } // always initialized struct Int16 { int16_t m_value = 0; } struct Int32 { int32_t m_value = 0; } Derived types: struct Time { int32_t m_value = 0; } struct Duration { int32_t m_value = 0; } // The derivative type Consistent access is always with .m_value to get to the value. Int8 x; Time t; x.m_value = 1; t.m_value = 1; Rust lifetimes Foo a; let b = a; // lifetime of 'a' ends here In C++: b = std::move(a); But in C++ it is up to the programmer to not use 'a' after this. It is similar to swap. If 'b' was originally uninitialized, it would be bad to use 'a' now. The goal is to avoid doing deep-copies to transfer values from one variable to another. Using references has the problem of ownership semantics. Which is the owning reference. Another solution is shared_pointers. The cost is loss of explict control of when the deallocation happens. Syntax: '//' - end of line comment '/*'...'*/' - multi-line comment Keywords: - module, interface, enum, type, struct, implementation, function, tests In C/C++, signed integer addition/subtraction have undefined behaviour int x, y; x = x + y; // UB, compile can set fire to computer if it wants unsigned x, y; x = x + y; // not-UB, this is fine Assumes 2's complement int representation - if a machine has an unusual representation, then users of that machine pay this price not everyone else with normal hardware - in many cases, sign vs unsigned are no different - can trivially cast to unsigned, perform the operation, and cast back - this avoids UB if one can assume 2's complement In C, code like this can be accidently written: if ( a = b ) To prevent accidental assignment when wanting to compare: := assignment = comparison Because functional style language, there are no ++ or += etc Assignment only happens to a new variable. Perhaps can infer the type. In C++ templates have been tacked on, templated functions are written like this: template <typename T> T func(T a, T b) { return a + b; } More natural would be: func(type T, T a, T b) => T { return a + b; } So in this way, a type is itself a type that can be passed around. This is C# like. Iteration, C#, python and C++11 range-based loops: for (x : arrayVector) for (x in array) for (x in range(0, 100)) map/reduce/filter, co-routines ? how to make things like map/vector etc with functional programming? vector might be a pointer + length if add elements to it, then it might be the same pointer if there is space at the end the original vector is unaltered if it contains the original pointer + original length the new vector contains the same pointer but a new length same with removing elements at the end a linked list might be possible if can store some kind of edit history perhaps think of transactional DBs and journals. Perhaps keep a journal until no more references to the original. actually, really just create a new vector/container map,reduce,forEach,filter,zip,some/exists,every,find creating a new copy of a vector is fine if the size is not too large or the reduced size is small the internal implementation details can't construct a new vector and still be functional. I think the constraint is on the passed in parameters, not on the local variables, so a local variable can be mutated just fine. That is because it is on the stack and another thread shouldn't have a pointer or reference to it. It begs the question, how to have objects with functions which mutate them but restrict calling them. Perhaps the trick is that parameters to a function will always be passed by const reference, and therefore can never be modified. We can always return by value (RVO will make this cheap). Constructing the returned value can be done as the local variable will not be const and can then call anything on it which mutates it (or constructs it). sized types only (for portability and explicitness - but limits targetting really low-end devices) null, bool, int8, int16, int32, int64, flt32, flt64 perhaps profiles? Can mark a module as not using floats or anything 64bit? string is a first class type string_view is not needed, strings contain the length so can construct a string from another string strings are utf8 string formats are translatable because functional, only operators which make a new output, eg: a = b + c not: a += b // this mutates a so operator overload is limited to a few less things + (T, T) -> T - (T, T) -> T * (T, T) -> T / (T, T) -> T % (T, T) -> T null as a type and a value? void? pointers - ? if functional, are pointers a thing? Some concepts from rust could be 'borrowed' borrowed pointers if need pointers crates unsafe static_asserts supported inline C code perhaps - so can mark as unsafe? keywords: module, interface, enum, type, struct, implementation, function, tests alpha: [a-zA-Z] num: [0-9] alnum: alpha|num identifier: alpha alnum* number: num [0-9a-fA-F]* [h EBNF Starting from a Pascal example: Lexical Grammar <letter> ::= [a-ZA-Z] <digit> ::= [0-9] <id> ::= <letter> { <letter> | <digit> | "_" } <digits> ::= <digit> { <digit> } <integer literal> ::= <digits> <string literal> ::= "\"" { < a char or escape char > } "\"" <real literal> ::= <digits> "." <digits> [ "e" [ <sign> ] <digits>] <string literal> ::= "\"" { < a char or escape char > } "\"" <literal> ::= <integer literal> | <real literal> | <string literal> <operators> ::= "+" | "-" | "*" | "%" | "=" | "!=" | "<" | ">" | "<=" | ">=" | "(" | ")" | "[" | "]" | ":=" | "." | "," | ";" | ":" | "!" | "||" | "|" | "&" | "&&" | "^" | "{" | "}" <keywords> ::= "if" | "then" | "else" | "of" | "while" | "do" | "begin" | "end" | "var" | "array" | "procedure" | "function" | "program" | "assert" <predefined id> ::= "Boolean" | "false" | "integer" | "read" | "real" | "size" | "string" | "true" | "writeln" Context-Free Grammar <program> ::= "program" <id> ";" <block> "." <declaration> ::= "var" <id> { , <id> } ":" <type> | "procedure" <id> "(" parameters ")" ";" <block> | "function" <id> "(" parameters ")" ":" <type> ";" <block> <parameters> ::= [ "var" ] <id> ":" <type> { "," [ "var" ] <id> ":" <type> } | <empty> <type> ::= <simple type> | <array type> <array type> ::= "array" "[" [<integer expr>] "]" "of" <simple type> <simple type> ::= <type id> <block> ::= "begin" <statement> { ";" <statement> } [ ";" ] "end" <statement> ::= <simple statement> | <structured statement> | <declaration> <empty> ::= <simple statement> ::= <assignment statement> | <call> | <return statement> | < read statement> | <write statement> | <assert statement> <assignment statement> ::= <variable> ":=" <expr> <call> ::= <id> "(" <arguments> ")" <arguments> ::= expr { "," expr } | <empty> <return statement> ::= "return" [ expr ] <read statement> ::= "read" "(" <variable> { "," <variable> } ")" <write statement> ::= "writeln" "(" <arguments> ")" <assert statement> ::= "assert" "(" <Boolean expr> ")" <structured statement> ::= <block> | <if statement> | <while statement> <if statement> ::= "if" <Boolean expr> "then" <statement> | "if" <Boolean expr> "then" <statement> "else" <statement> <while statement> ::= "while" <Boolean expr> "do" <statement> <expr> ::= <simple expr> | <simple expr> <relational operator> <simple expr> <simple expr> ::= [ <sign> ] <term> { <adding operator> <term> } <term> ::= <factor> { <multiplying operator> <factor> } <factor> ::= <call> | <variable> | <literal> | "(" <expr> ")" | "not" <factor> | < factor> "." "size" <variable> ::= <variable id> [ "[" <integer expr> "]" ] <relational operator> ::= "=" | "<>" | "<" | "<=" | ">=" | ">" <sign> ::= "+" | "-" <adding operator> ::= "+" | "-" | "or" <multiplying operator> ::= "*" | "/" | "%" | "and" Discussion: UI should be an included part of the language / libraries. Too bad C/C++ do not have a solution for this. They miss out on learning about real problems programs need to solve to be useful. For example i18n. C/C++ does a terrible job about this, but despite attempting to be general, they are simply not. This is because they don't even attempt to tackle UI so it isn't even evident that they fail woefully at i18n. For example uint8_t is an unsigned char, so stream operators acting on it treat it like a char instead of an integer value. There are subtle differences like that that make it hard to do utf right. It can be done, but you have to fight against C++ to do it. It shows horrible bias, leaning towards anglo european users and is not inclusive of large portions of the world that do not used latin based langauges. The defacto standard for cross-platform UI is to use Qt which helps solve these i18n issues. It mostly ignores C++ and creates its own QString class. This causes fractures in the whole C/C++ egosystem with code that is using Qt and code which is not. Similar fractures occur with Boost. Some developers avoid it like the plague and unwilling to allow it to infect their code base, whilst others embrace it. Again causing another division. As C++ has been evolving Boost has proven to be a good testing ground for new C++ features and some of those have been woven in to C++. This evolution does cause some ugly side-effects with code which used Boost and migrates to the standardized C++ versions of these features supporting both variations with conditional compilation. This can be as a result of on different platforms the adoption of those features in the compilers happening at a different pace. Ideally a more formal feature testing/preview system would be better like the TR technical review, but on a larger community wide contribution scale. Governance - community consultation. RFCs. Preview implementations. Test-bed servers, unit tests, regression test suites. A reference implementation, not just a standard which is a document. Aim for zero undefined behavior. Everything is defined or it is a compile error. Take a modern approach of targetting modern machines, not hyperthetical machines or esoteric machines. The code has to work on real machines to solve real problems and now. Adding two ints in C++ can have undefined behaviour if they overflow and the compiler is free to do what it likes in this case. That is a crazy thing as a programmer to accept, particulary given that nearly any machine being targetted today or even in the last several decades uses 2's complement and so is pretty well defined at the machine level. Bootstraping - how much of the libraries should be in the language. Some part will be Core and the rest the wider language / libraries.